NashTech Blog

AI’s role in reshaping testing practice: Using AI for testing GUI and Bug prediction

Table of Contents
Using AI for testing interface

Welcome back to our ongoing exploration of AI’s profound impact on reshaping testing practices. In this installment, we delve into the intricate realm where artificial intelligence intersects with the evaluation of interfaces, probing the boundaries of innovation and reliability.

In our previous discussions, we uncovered the transformative potential of AI in optimizing regression testing and enhancing software stability. Now, we pivot our focus to the crucial aspect of testing interfaces, where user experience and functionality converge in the digital landscape.

In this blog, we’ll look at using AI to support different testing activities and look specifically at the testing of user interfaces and bug prediction. For the first topic, we consider how AI can be used to automate that testing that is performed through the user interface as if by an actual user. This is in contrast to testing that bypasses the user interface by working, for instance at the API level while treating the user interface separately.

Let’s start by first considering how we move from manual testing to automated testing. This will hopefully allow us to see how AI can contribute to making this move. If performing system testing, then nearly all testers will start by performing manual testing through the system’s user interface.

Traditionally, this has also been the starting point for many test automation initiatives. The resultant tests aim to emulate human interaction with the system under test. A simple but flawed approach to automating testing this way is to use a capture playback or capture replay approach. The way this works is that the tester runs tests on the system, and while they are doing this, a tool sits in the background and records or captures all of their inputs to the system.

The idea is that the recording can then be played back into the system at a later point. As far as the system under test is concerned, the same tests are run as if the tester were doing it all again themselves. Basic versions of capture playback tools record the actual coordinates of the mouse icon as it moves so that they can move the mouse exactly the same when the captured script is replayed.

 

If nothing in the user interface changes, then this is fine and can work well. Problems occur when the changes include modifications to the user interface. If an update has moved a button on the screen a centimeter to the right, for instance to accommodate a new button, then the recorded mouse coordinates may no longer fall on the button and so the replayed script will fail.

To mitigate such a risk, more sophisticated capture playback tools do not record the actual mouse coordinates on the screen, but instead record the objects or widgets that the tester interacts with.

Now, if an object is moved on the user interface, it can still normally be automatically found when the script is replayed. Whichever of these approaches is used, the result is a test script that is brittle. We call these scripts brittle because a minor change in the system under test will often cause the automated test script to fail or break. The changes that cause this brittleness can be minor changes in the interface elements, especially if mouse coordinates are used. However, the renaming of objects and widgets, even by a single letter, will have similarly catastrophic results as far as the test scripts are concerned. Imagine if we port the system under test to a new platform. The many changes associated with changing to a new platform can have a disastrous effect on a brittle test script. The capture playback approach to testing through a graphical user interface has been known to be flawed for many years.

However, the benefits of utilizing tests that mimic real user interactions with the system have led to the frequent suggestion of new and improved methods for capture playback.

The following are a couple of ways that AI can be used to mitigate the brittleness of capture playback scripts. The first suggestion is that AI is used to recognise elements on the user interface by using a variety of different criteria on top of the object names or mouse coordinates. These criteria could include the X path, the label, the ID, and the class. The second suggestion is that depending on the system, some of these criteria are more stable than others, and the AI can learn which criteria are most stable and assign a higher weighting to the most stable criteria.

These options may help with the known problems of capture playback, but they are not a complete solution. On the subsequence section, we will see a possible contender for fixing the problem. Here we introduce the next generation of technology for testing through the user interface. This is visual testing using AI based image recognition.

With this approach, we do not need to use one of the various forms of metadata to identify user interface elements. Instead, we use image recognition to decide which objects to interact with. This approach allows us to get very close to how a real user interacts with the system under test. We do this by designing and executing tests that work through the user interface and are based on what the user or AI in this case can see in that interface.

A benefit of this approach is that it requires no knowledge of the underlying technology and the only interaction necessary is through the user interface. And so this is a purely black box approach. Automated test scripts written for visual testing need only to be written in terms of how a human user would interact with the system such as visible buttons, dropdown lists and text fields. The assumption is that the AI based tool will be able to recognize and interact with all the common elements of a user interface in terms of both inputs and outputs.

Visual testing based on image recognition has been there over the past 20 or 30 years. However, in the past, this technology was restricted to systems with quite constrained user interfaces. This was largely due to it requires access to large amounts of computing power and memory to perform the image recognition of more than a handful of interface elements.

With the advent of image recognition based on deep neural nets today, it is quite affordable to build tools that can learn to recognise practically anything that can be put on the user interface.

The following section is about using machine learning and heuristics to determine the quality of the user interface itself.

Source: Automated Reporting of GUI Design Violations for Mobile apps by K.Moran

The chart above is from research done with Huawei for mobile app user interfaces. The chart shows the different types of user interface defects that were identified by the AI based system and their relative proportions.

For instance, in this form of testing, the AI based system is looking at whether the right colors have been used and whether or not the right fonts have been used. At  the top level, as shown by the inner ring, we can see that 39% of user interface violations were due to layout problems, 38% were due to resource violations and 23% were due to text violations.

Another use of AI for testing graphical user interfaces is for comparing different versions of a user interface by looking at screenshots to determine what has changed between the two versions. Such a capability can be used as part of regression testing to determine what has changed and to identify any potential defects where changes have occurred that were not expected.

The screenshots above show two versions of the same screen that were compared

by the Applitools commercial testing tool. The resultant screen on the right is a marked up version of the screen that shows the differences between the two versions.

The two screens may not look particularly realistic, but that is because these screens were used as part of a hackathon. However, they do show the capabilities of the AI based tool, which can find differences practically instantaneously compared with doing the comparison manually.

Next Steps – Combined Checking and Comparison UIs

Given that AI based tools that can do automated checking of user interface elements and the other AI based tools that can do comparisons between versions of user interfaces.

We will consider what is possible when we combine these two technologies.

A tool with both these capabilities would be able to identify changes as in an upgraded version of a user interface and then subsequently check the acceptability of the changes. So performing a form of regression testing, the results from this tool would provide recommendations to human testers for which user interfaces or user interface elements should be checked for usability. Such more sophisticated functionality could also be used to compare user interfaces for the same application on various browsers or different mobile devices or different platforms.

This functionality would allow AI based portability testing to be automatically performed to ensure an app’s user interface looks the same in all different operational environments.

Bug Prediction

The final application of AI to support testing that we will consider is for bug prediction. A bug prediction tool is used to help us predict which parts of a system are most likely to contain defects. Knowing this would be extremely useful as then we could target those parts with more testing.

But prediction has traditionally been based on code metrics, such as the size of a module measured as the number of lines of code and the complexity of a module measured by the number of decisions in the source code. Generally, it is believed that the larger the­ module size and the more complex the module is, then the likelihood is that the module is more likely to contain defects.

Assume that we collected bug data for an organisation. The graph shows the relationship between the module size and whether the module contains bugs.

You can see that as module size increases, then it becomes more likely that the module is defective. From this data, we could identify a module size above which the probability is that it will contain defects. Now, if we are presented with a new module, we could determine if it was likely or not to include a defect.

This is a very simple binary classification problem. Obviously with only the input variables of module size, both humans and tools could do this quite easily.

On the above graph, we have moved to two factors module size and number of revisions made to the module. The green ticks represent modules with no defects and the red crosses represent modules with defects. The Purple Line classifies modules as either defect prone or not.

This is still a relatively easy problem for humans. However, when we start to move to three or more variables and so we cannot draw it as a simple 2D representation, it makes it impossible for humans to easily reason about it in their heads. Hence, we need to consider using AI and more precisely, machine learning. It is unclear which factors are the best predictors of defects.

Research has identified a number of different measures and combinations of measures that can be used. The above input data matrix shows an example set of attribute data for five modules and six attributes showing whether each module contained a defect.

If we use machine learning as opposed to doing it ourselves, then in theory we can use as many attributes as we believe may be useful. The machine learning algorithm will decide which attributes are the best predictors and which are the least useful. Of course, with many attributes, we cannot do this manually.

The following are a number of different metrics that could be used for bug prediction.

·       Source code metrics

o   Lines of code

o   Number of comments

o   Cyclomatic complexity

o   Module Dependencies

·       Process metrics

o   Revisions made to module

o   Times refactored

o   Time fixed/when fixed

o   Lines of code changed (code churn)

o   Module age

·       People and organisations metrics

o   Number of authors

o   Developers experience

Earlier, we mentioned that traditionally source code metrics for a module were used to predict if a module contains bugs. However, here we can see that we also have process metrics and people and organizational metrics. The process metrics are obtained from data repositories such as version control systems and defect management systems.

The people metrics are concerned with the developers, such as the experience of the programmer and the number of developers when more than one was involved. Other people metrics could include historical data and the likelihood of each individual developer, including defects in their modules. Although this is an attractive metric to use, we need to be careful that this information is put in context.

For instance, in many software development teams, the most experienced developers are normally assigned the most complex modules. While new developers are normally assigned the simplest modules.

Bug prediction using Supervised Learning

Different studies show that using machine learning for bug prediction can be very effective, research in this area has been carried out for many years.

In the Towson study (2010), which was performed in the largest telecoms operator in Turkey.

The researchers tool was able to detect 87% of defects with a false alarm rate of only 26%.

The Kim study (2007) was performed on open source software with over 200,000 revisions. The machine learning model built by the researchers was 73 to 95% accurate at predicting defects across seven projects in just 10% of the modules.

The list of best predictors has been gathered from several research projects on bug prediction. The predictors are ordered roughly based on how good they are at predicting defects.

o   Best predictors are:

o   People and Organizational measures (84%)

o   Module change (80%)

o   New Module

o   Fixed recently (and connected modules)

o   Reused module are more error-prone than new modules

o   Module age

As you can see, the best predictors appear to be measures based on people and the organization. The numbers next to the best predictors show how useful the different individual metrics are when used as input to an AI tool.

For instance, the figure of 84% came from a study where code complexity scored just 66% and lack of code coverage scored just 54%. Thus, the traditional view that source code metrics should be the primary indicator of defects appears to be flawed.

Of course, each project and each organization is unique and so any machine learning model for bug prediction should be trained using data from that project or organization. The differences between projects and organizations are large enough that building a commercial tool based on a generic machine learning model is not normally useful.

Conclusion

Throughout our exploration of “AI’s Role in Reshaping Testing Practice: Using AI for Testing Interfaces,” we’ve journeyed through the transformative landscape where innovation converges with reliability to redefine the testing of user interfaces.

From the outset, we’ve uncovered the profound impact of AI on interface testing, from automated testing scenarios to the analysis of user interactions. The marriage of intelligent algorithms and interface evaluation has ushered in a new era of efficiency and precision, enabling testers to navigate the complexities of digital experiences with unparalleled depth.

 

Yet, as we conclude this discussion, it’s essential to recognize that the evolution of interface testing with AI is not without its challenges. The dynamic nature of user interfaces demands adaptability, and the reliance on artificial intelligence requires continual refinement and optimisation.

Reference:

ISTQB AI Testing

Picture of Ngoc Tran Ha Bao

Ngoc Tran Ha Bao

Test Lead with over eight years of experience, especially in Performance Testing area.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top