AI in GUI-Based Software Testing

An Overview of the State of the Practice

By Daniel Kraus

Artificial intelligence (AI) and machine learning (ML) are still some of the most hyped topics at the moment. Especially in the realm of software testing, various new tool manufacturers have emerged over the past few years that incorporate some sort of AI in their products. This is an overview of how some of these vendors use AI to help developers and testers when it comes to software testing through the graphical user interface (GUI).

Background Information About AI

Before we dig into the wide variety of tools, the question arises: What actually is AI? Well, this topic has been disputed for decades. Shane Legg (cofounder of DeepMind) and Markus Hutter (professor at the Australian National University) say that “[a] fundamental problem in artificial intelligence is that nobody really knows what intelligence is.” [1, p. 391] In one of their research papers, they have created a collection of over 70 definitions from psychologists, AI researchers, and others, which shows that a consensus does not exist at this time—even among experts. Nonetheless, Legg and Hutter came up with the following definition [2, p. 21]:

"Intelligence measures an agent’s ability to achieve goals in a wide range of environments".

A key attribute here is that an intelligent agent is able to adapt to different objectives and environments, whereas most (if not all) AI-based systems today are only good at one specific task; for instance, game playing, speech recognition, or computer vision. According to Michael Jordan (professor at the University of California, Berkeley), “[m]ost of what is being called ‘AI’ today, particularly in the public sphere, is what has been called ‘Machine Learning’ (ML) for the past several decades.” [3] He describes ML as an algorithmic field that utilizes statistics, computer science, and many other fields to make predictions and decisions—the foundation of the majority of today’s AI-labeled systems. Even if many of these achievements represent notable milestones towards “real” AI, the results are often “reclassified” afterwards; Wolfgang Banzhaf (professor at the Michigan State University) puts this as follows [4]:

"[…] as soon as you have something that previously was considered a “real” AI task, as soon as it’s successfully performed, it’s not anymore AI, it’s something else."

For simplicity (and search engine optimization), we will refer to AI/ML in terms of methods that have been developed within the corresponding academic fields over the past couple of decades, including search and optimization, logic, artificial neural networks (ANNs), etc. That is, we don’t make any statements whether a particular tool deserves to be labeled AI or not; we will merely provide links between the underlying methods and the field of AI in a broad sense.

Tools! Tools! Tools!

AI-based tools within the scope of software testing range from static code analysis (e.g. Facebook Infer), over unit test generation (e.g. Diffblue), to entire testing platforms (e.g. SmartBear). As often in the context of test automation, there is a trend towards GUI-based system testing. System tests through the GUI are still a major pain point. Not just because they tend to be “[…] brittle, expensive to write, and time consuming to run.” [5] But also because developers and testers seem to feel attracted by using the GUI as the test interface, which often results in a test automation ice-cream cone rather than a pyramid.

GUI-based system tests are also special since they stimulate the system under test (SUT) from end to end, i.e., somewhat from a user’s perspective down to persistence layers such as databases. When used wisely, they can be a powerful part of a testing strategy. For years, tool manufacturers have been advertising that they have already solved the aforementioned issues—often to the displeasure of their customers, respectively, the testing community. With the current wave of AI-based tools, one may ask how these new methods can help.

Here is a look at eight AI-based tools which can be divided into the following two main categories (although some also fit in both):

  1. Test maintenance: the tool takes care of maintenance tasks such as updating locators after changes or carrying out special checks to detect visual differences.
  2. Test generation: the tool generates actual tests that can be executed. For instance, automated exploration to find broken links or the creation of regression test suites.

When it comes to test generation, the approaches are more diverse, such as crash testing or broken link detection. Within the context of test maintenance, many AI/ML tool manufacturers focus on the recognition of GUI elements, which is a common problem in GUI-based testing (also known as the GUI element identification problem in academia). Without further ado, let’s have a look at the tools and their methods in details.

Applitools

URL

https://applitools.com/  

Supported platforms    

All

Application type

SaaS or on-premise

Category

Test maintenance

Applitools focuses on visual testing (a.k.a. visual checking or visual regression testing), which verifies that the SUT’s GUI appears as intended. In order to do so, Applitools offers various APIs for major programming languages and test frameworks to create tests with visual assertions, either for single GUI elements or entire screens. During the execution of a test, every check via the Applitools API (Eyes SDK) creates a screenshot that is uploaded to a dedicated service (Eyes Server) for comparison. This is where computer vision (CV) algorithms are used to only report perceptible differences. Afterwards, the Eyes Test Manager can be used to review the test results in detail. Changes can be either approved, rejected, or ignored, where similar changes are grouped together to speed-up the maintenance process.

Since the comparison itself is made against a given baseline (a.k.a. golden master), the approach can be seen as derivative of Approval Testing; a means to characterize the behavior of the test object to protect it against unintended changes, regardless of its correctness (which is why academia calls this a consistency oracle as it compares the consistency between two versions). Applitools adds a Git-like branching and merging mechanism for versioning the baseline images.

According to Applitools, the AI itself is based on the largest data set of GUI validations in the world and achieves 99.99 % accuracy, continuously evolving by analyzing new images on a daily basis. [6] To further improve the CV-based checks, the comparison uses sequences of images, instead of a static association between checks and images. As a result, images can be added or removed without affecting other parts of a diff, just like with textual diffs, which again reduces the required maintenance effort when the SUT changes.

Eggplant AI

URL

https://eggplant.io/products/dai/eggplant-ai/

Supported platforms    

All

Application type

On-premise

Category

Test generation

Eggplant AI makes use of various concepts from model-driven software development (MDSD) to generate test cases. A model of the SUT is created, similar to a state machine, with states that represent screens and contain actions. Actions usually lead from one state to another, but there are also global actions which are feasible from every screen (e.g. a menu bar). Each state and action can have data values and additional attributes.

This abstract model of the SUT is then used to generate new test cases, especially to create such tests that haven’t been considered before. During the test generation phase, Eggplant AI doesn’t actually execute on the SUT, instead, it only uses the model.

With regards to test generation in general, an important part is how actions are selected, which typically consists of picking an action and possibly additional data for this action. For example, when executing a text insertion, the tool needs to know what text to insert, which is not needed in case of a simple mouse click. Eggplant AI does this action selection based on the following four ingredients:

  1. Bug hunting: looks for common patterns that typically lead to bugs (e.g. going backwards and forwards in between screens).
  2. Coverage analysis: optimizes towards coverage in terms of states, actions, and data. Alternatively, other (proprietary) coverage algorithms can be selected.
  3. Real user journeys: prioritizes paths within the SUT that have been executed by actual users.
  4. User weights: actions can be extended with weights, which increase or decrease the probability of a particular of being chosen.

Eggplant AI uses different ML algorithms to achieve different goals. One example is the bug hunting system, where parts are built upon TensorFlow. Here they use a dense (i.e. fully-connected layers) ANN and a final classifier to determine the action (and its data values) to execute next. [7, p. 6] Since completed tests that passed are fed into the ANN as well, it is also able to unlearn fixed issues.

It is important to note that the results of the generation process aren’t actual test scripts. Rather, actions are associated with one or more snippets that are executed. (According to Eggplant, eventually, snippets shall be generated as well.) Besides various reporting facilities (such as heat maps), default test cases for particular features and/or business requirements can be added to the model, too.

Functionize

URL

https://functionize.com/

Supported platforms    

Web and mobile (Android, iOS)

Application type

SaaS

Category

Test maintenance and generation

Functionize describes itself as a testing platform that handles test creation, execution, maintenance, and analytics, with capabilities for functional, visual, performance as well as load testing. Tests can be either created via a recorder, a Robot-compatible framework, or a natural language processor (NLP) using keywords like open url, input text, verify, etc. The NLP engine uses “automated test creation techniques” and manually created tests to identify inputs, assertions, and interactions. Similar to Applitools, Functionize also allows CV-based visual testing with the aid of convolutional neural networks (CNNs) and analytics to only report relevant changes.

The ML core of Functionize, called Adaptive Event Analysis (AEA), is “[…] a patented and proprietary hybrid of algorithms which span supervised, unsupervised, reinforcement learning and computer vision, enabling lightning-fast test creation and execution, self-healing maintenance, and actionable analytics.” [8, p. 1] AEA creates a model of the SUT as a multi-dimensional application map. This model is used to compare the SUT with previous versions, e.g., to see if user flows are broken.

The three main uses of AI, where data is collected both via live user observation and manual input (i.e. by creating test cases), are:

  1. Live user analysis: data is first clustered with k-means and expectation maximization to identify user types and then a recurrent neural network (RNN) is used to understand the different user behavior.
  2. Autonomous test creation: based on Markov models / decision processes and ranking of recorded user behavior.
  3. Self-healing maintenance: uses reinforcement learning and CV to understand where tests are breaking (e.g. by identifying static and dynamic content via CNNs).

To speed up test maintenance, the tool suggests test updates (in case self-healing is not possible), provides reusable page objects and also tracks different locators (including proprietary ones) for the targeted and nearby elements.

mabl

URL

https://mabl.com/

Supported platforms    

Web

Application type

SaaS

Category

Test maintenance and generation

mabl’s focus is regression testing and offers various features for this purpose. The AI is trained by recording user journeys (i.e. test cases with steps and assertions). Internally, a domain-specific language (DSL) is used to capture user interactions and intents (only executable with an evolving knowledge base of the SUT), which allows users to apply different AI techniques. Based on these recordings and other data like previous executions and knowledge from other SUTs, mabl attempts to simulate user input. The tool also autonomously crawls for “insights” like broken links, JavaScript errors, page load / test execution time changes as well as visual changes. The accuracy improves over time with each test run, as mabl learns how to better distinguish dynamic and static regions for visual testing.

Similar to Functionize, mabl manages multiple locators for each element to automatically maintain existing tests. The SaaS solution operates on top of the Google Cloud Platform (GCP) and uses many other Google services, which allows mabl to scale easily. For example, it uses Kubernetes for test runner container management or Dataflow and ML Engine for analytics.

The tool primarily uses Bayesian and heuristic models for load time distributions and visual change detection. CV and image processing techniques are used to see if the executed test is in the expected state. The team currently evaluates decision-theoretic and cognitive approaches to further improve the simulated user input.

retest

URL

https://retest.de/en/

Supported platforms    

Swing (Java)

Application type

Desktop

Category

Test maintenance and generation

retest originally stems from a research project named EXSYST, respectively, EvoSuite. Besides functionalities for creating, executing, and maintaining tests, retest exhibits two special properties: first, it uses an approach similar to Approval Testing, which captures the entire state of the SUT that is visible through the GUI by extracting all properties of all GUI elements. If a change is detected, it can be either accepted or ignored with a single action, just like a version control system (VCS) would do. Consequently, no assertions need to be defined and maintained. These tests are usually also more stable than tests that use explicit locators to find GUI elements, since the entire GUI state is available and this information can be used during search. Second, it supports AI-based monkey testing within the so-called “surili” module. surili uses the following three building blocks to generate tests:

  1. SUT model: represents a behavioral model of the SUT, constructed as a nondeterministic finite automaton (NFA), that is updated iteratively. Users can also extend this model with manually created tests, which allows surili to perform specific tasks (e.g. login via valid credentials) in order to follow various paths through the SUT.
  2. Genetic algorithm (GA): used to optimize the generated tests for code coverage (and test length implicitly). The test generation process is represented as a search problem for which test suites with various test scripts have to be generated.
  3. ANN: recommends GUI elements that are picked to select the next action. Existing tests are exploited to extract information on how humans use the given GUI. The obtained data is utilized to train the network, which is then able to rank the available GUI elements at runtime, similar to a recommender system.

Since Difference Testing sidesteps the oracle problem, surili is not just able to perform crash testing, but also to generate actual regression test suites that can be used to verify the behavior of the SUT for different versions.

retest is currently limited to Swing (Java) GUIs, but there is work in progress on a port for web applications. Parts of the Approval Testing implementation for web-based GUIs have already been made available within the open source project recheck-web and the team is currently working on surili-web.

Sapienz

URL

n/a

Supported platforms    

Android

Application type

n/a

Category

Test generation

Sapienz, formerly known as MaJiCKe, was acquired by Facebook back in January 2017 and has been part of Facebook’s continuous integration (CI) system as of September 2017. The tool uses multi-objective search-based testing (MoSBaT) to generate tests optimized for code coverage, found crashes, and test length. MoSBaT essentially relies on search and optimization techniques such as GAs, which are able to target multiple objectives.

In the case of Sapienz, tests are optimized for Pareto optimality, where it isn’t possible to improve one objective without impairing another one. The tool combines fuzzy testing (“atomic genes”) and systematic exploration (“motif genes”):

  • Atomic genes: Android events like Touch, Motion, Rotation, etc. that cannot be further decomposed.
  • Motif genes: series of atomic events that achieve higher coverage by doing high-level functions (e.g. filling out and submitting a form).

Based on these “genes”, Sapienz generates test suites, including coverage and crash reports, which are used to auto-localize and auto-comment issues and, if possible, also auto-detect fixes in pull requests. In terms of coverage criteria, Sapienz allows three different levels: white box (uses source code for statement coverage), grey box (instruments APK for method coverage), and black box (uses GUI for skin coverage).

A previous (before Sapienz became a part of Facebook) empirical study with more than 1,000 Android apps showed that Sapienz was able to reveal 558 previously unknown crashes. [9] Facebook is currently working on an iOS port, open-sourcing the solution, and SapFix, a tool that uses AI to fix bugs found by Sapienz.

test.ai

URL

https://test.ai/

Supported platforms    

Mobile (Android, iOS)

Application type

SaaS or on-premise

Category

Test maintenance and generation

According to test.ai, their goal is to enable code-free automation, including autonomous element selection and test execution. The tool uses bots to explore the SUT’s functionality and to perform regression testing, including screenshots captures and performance metrics.

In order to achieve the aforementioned goals, test.ai uses multiple ANNs to address different layers of abstraction; for SUT states, actions, button types, etc. The networks are trained in a supervised manner to do classification tasks. To perform the SUT state classification, the ANN receives a screenshot and the DOM as its input and then classifies it accordingly (e.g. login or help page; the initial training data is labeled by humans). The ML model is also capable of executing generic tasks such as adding something to the cart and performing a checkout. Test verification is based on three approaches:

  1. Automatic detection of common errors (e.g. crashes).
  2. Training on bug examples (e.g. a text like “ooops”).
  3. Human oracles.

Test cases can be defined in a Gherkin-like DSL, called Abstract Intent Test (AIT) syntax, which is currently in draft mode. AIT focuses on the intent of test case, but not “how” to execute them. No locators for identifying GUI elements or related information are associated with the test cases. This allows better reuse of test specifications across platforms and SUTs.

test.ai operates on top of Appium and was recently released as an Appium element finding plugin, which allows Appium users to leverage AI-based search strategies for GUI element identification. The used training data is also open source.

Testim

URL

https://testim.io/

Supported platforms    

Web

Application type

SaaS or on-premise

Category

Test maintenance

Testim uses ML to “[…] speed-up the authoring, execution, and maintenance of automated tests.” [10] Tests can be either created via a recorder, code, or both. The tool incorporates historical data to rank GUI element locators for each element individually, stabilizing the tests over time and giving the possibility to perform self-maintenance.

In principle, each executed test improves the underlying AI. Testim uses “dynamic locators” that are based on a real-time analysis of the entire DOM of a page. Objects and their properties are first extracted and then compared to decide which is the best locator for a given GUI element. Consequently, tests are said to not break on each change due to invalid static locators.

Testim describes itself not just as a tool, but as a platform since it also provides an execution environment for the tests. Besides customization capabilities to choose among different browsers, operating systems, etc., Testim also has the ability to allow visual validations for pixels, fonts, texts, and similar elements. The team is currently working on support for mobile testing and has just launched a beta program for Android.

The Right AI/ML Tool For The Job

There is a wide range of applications available when it comes to AI/ML in GUI-based software testing and beyond. Most of the tools mentioned focus on test maintenance and a few allow tests to be generated. Which one is the best? Well, there is no general answer. Each organization will have their own, specific requirements. Time will tell which approaches will prove their worth. This could be shown by how successful the different manufacturers are with their tools and also by which ideas will gradually find their way into the open-source community, for instance, via widely-used tools such as Selenium.

In general, the use of AI-based tools can be helpful to complement a traditional testing strategy. For example, by adding visual checks, by reducing maintenance efforts via self-healing capabilities, or by generating test suites for fast smoke testing. In any case, a small proof of concept should be implemented before making a tool decision to find out if a particular solution fits into the existing technology stack. More importantly, it shows if the AI is smart enough to deal with the given SUT.

References

[1]

S. Legg and M. Hutter, "Universal Intelligence: A Definition of Machine Intelligence", Minds and Machines, vol. 17, no. 4, p. 391 – 444, URL: https://arxiv.org/abs/0712.3329/, 2007.

[2]

S. Legg and M. Hutter, "A Collection of Definitions of Intelligence", Frontiers in Artificial Intelligence and Applications, vol. 157, p. 17 – 24, URL: https://arxiv.org/abs/0706.3639/, 2007.

[3]

M. Jordan, "Artificial Intelligence — The Revolution Hasn’t Happened Yet", URL: https://medium.com/@mijordan3/artificial-intelligence-the-revolution-hasnt-happened-yet-5e1d5812e1e7/, 2018.

[4]

W. Banzhaf, "Interview on 'Experts Weigh in on the Future of AI and Evolutionary Algorithms'", URL: https://www.youtube.com/watch?v=tj5-H6ECxyM/, 2018.

[5]

M. Fowler, "TestPyramid", URL: https://martinfowler.com/bliki/TestPyramid.html, 2012.

[6]

A. Prasad, "Our Enterprise Visual UI Testing Platform: 10,000 Users, 300 Companies, 100 Million Tests, 1 Billion Component Level Results", URL: https://applitools.com/blog/product-release-applitools-v10-our-enterprise-visual-ui-testing-platform-10000-users-300-companies-100-million-tests-1-billion-component-level-results/, 2018.

[7]

Eggplant, Eggplant AI: The brain of the Digital Automation Intelligence Suite (white paper), URL: https://eggplant.io/s/Eggplant_AI_white-paper.pdf, 2018.

[8]

Functionize, Intelligent Functional Testing (white paper), URL: https://www.functionize.com/project/intelligent-functional-testing-white-paper/, 2018.

[9]

K. Mao, M. Harman and Y. Jia, "Sapienz: Multi-objective Automated Testing for Android Applications", in Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA), New York, NY, USA, URL: http://www0.cs.ucl.ac.uk/staff/K.Mao/archive/p_issta16_sapienz.pdf, 2016.

[10]

Testim, "Testim: Agile, self-healing End To End test automation that everyone can use", URL: https://testim.io/, 2018.

Author Bio

Daniel Kraus is an industrial PhD student at retest, where he applies models and machine learning to software testing. In 2017, he graduated (with distinction) with his master’s thesis “Machine Learning and Evolutionary Computing for GUI-Based Regression Testing” from the Karlsruhe University of Applied Sciences. In his more than six years of experience in software engineering, he has worked for various SMBs and enterprises such as Volkswagen R&D. You can find him on Twitter, Medium, and GitHub.
 

You might also be interested in:

A Tester's Guide to Testing AI Applications with Bill Matthews