The Future Of Software Testing Part Two

 By Seth Eliot

In the November 2011 edition of The Testing Planet, I made several predictions of the future of software testing,

In part one, I made the case that Testing in Production is the natural way to test software services, using real users and real environments to find bugs that matter.

The subject of this article is what I call TestOps, and how it represents the potential for dramatic change to the testing profession and how software organisations test software and organise to assure software quality.

Software Testing Presently

To see the future we must first understand how most organisations test software in the present. Figure 1 is a simple but useful representation of this current state of affairs.

Figure 1.The traditional model of software testing

Tests are run against the product and the product reacts. If this result matches the expected result (the oracle) it passes… it’s green. A tester can look at all the passes (greens) and the fails (not-greens) then make pronouncements about the product quality. Distilling this down to three stages we get Table 1.

Table 1. Three stages of traditional Test

Test results are our output, our signal to be processed and interpreted. However, in the future, we need to look more at a new signal. Software services running in data centres have relied on Operations or “Ops” teams to run the data centre. Ops provisions hardware, configures networking, then monitors and repairs any production impacting issues. Now, like the Ops team, testers will need to have more focus on the live site. Our new quality signal will come from this live production data, hence the name TestOps.

Products of the Future

  • To understand the future of software testing, look where the growth in software products is taking place
  • Google executes more than 3 billion searches, 2 billion video replays and absorbs 24 hours of video per minute (1)
  • Microsoft Bing has grown from 10% of U.S. Searches in September 2010 to over 30% as of April 2011 (2)
  • Facebook has about 800 million active users sharing 30 billion pieces of content per month (3)
  • Amazon has more than 120 million user accounts, 2 million sellers and 262 billion objects stored in its S3 cloud storage system (4)
  • Twitter reached its one-billionth tweet after just 3 years, 2 months and 1 day (5)

These successful, growing products are all services. In "The Future of Software Testing Part One", I discussed how services allow us to Test in Production by giving us immediate access to production, where we examine and change what is happening there. But to see what else services give us that will change the way we test, consider these statistics:

  • At Google, system performance is tracked by their “Dapper” tool. More than 1 TB of sampled trace data per day and all data is kept for two weeks (7)
  • Facebook’s logging framework Scribe collects approximately 25TB of data daily (8)
  • eBay collects approximately 50 TB of incremental data every day to perform analytics (9)
  • Twitter stores 12 TBs of new data daily (10)
  • Microsoft online properties Bing, MSN, and AdCenter collect and process 2 PB of data per day (11)

What services give us are telemetry data from users and the systems they are using, and in the cases above it is big data. Changing our signal to assess software quality from test results to telemetry data is a game changer for testing. Even if your service does not generate big data, any product operating at internet scale is going to see usage and data of sufficient magnitude to assess quality.

A New Signal for Quality

Let us update our three stages of test. We’ve determined that our signal is going to be the telemetry or big data coming from the system under test. Then the input can be the real world usage of the product.

Then what about the observe stage? Data by itself has little value until it used or transformed to find patterns or prove hypotheses. Such analysis can be done to calculate Key Performance Indicators (KPIs) which are then compared to target requirements. Or we can look for patterns in the data that give insight into the behaviours of users or systems. These stages are summarised in Table 2.

Table 2. Three Stages of Traditional Testing and TestOps

We can start by using real production usage to decide where to look. For example with Bing, we can identify the top 100 query types used and conclude that we might get the most value by starting to look there. Among the top queries are those for stock prices. On Bing, the query “Amazon Stock Price” should return the Finance Instant as the top result. Figure 2 shows what actually happened with one release, as well as what the missing Finance Instant Answer should look like.

Figure 2. A Bing bug... Amazon stock query missing Finance Instant Answer

A classic test approach might be to iterate through the S&P 500 or the Wilshire 5000 and execute the query “[company_name] stock price”, substituting in each company name and searching for where the instant answer did not fire. We would then find that the following test cases pass:

  • Expedia stock price
  • Comcast stock price
  • Bed Bath and Beyond stock price

All-pass is not all good in this case as there are bugs we still have not found. Therefore Bing implemented Bug Miner (12) which enables testers to configure queries or patterns and see what results real users are getting. Using real user data Bing found and fixed these bugs (did not return the Finance Instant answer) that traditional testing would have missed:

  • stock price
  • Comcast Corporation stock price
  • Bed Bath & Beyond stock price

A consideration when using telemetry data as your signal is that you may have different signals from different users on different products. The example in Figure 2 would still seem to “fail” in the UK, but the international sites do not use the same logic as the US site. Different regions and different products will have differing signals that each must be interpreted in the context of those respective products.

Another example of using telemetry data as the signal for quality is Microsoft Hotmail. By instrumenting their web pages they get anonymised signals telling them when a user performs actions like opening or sending a mail for example. Using this data, testers can calculate how long it takes across millions of users to perform key tasks and interrogate this data to see how Hotmail performs across different operating systems and web browsers. By using this information key bottlenecks can be identified. In one example Hotmail re-architected image size and static content to improve upstream traffic, improving performance by 50%.

More Ways to Generate the Signal

Monitoring and analysing production usage to access the telemetry data signal for software quality is a very approachable way to implement Testing in Production. But there are also more active steps we can take to vary the input stage of our three-step approach and gain quality insights

We can start with all those test cases we used to run in the traditional model. In part 1 I told you about Microsoft Exchange who re-engineered their 70,000 automated test cases they used to run a lab so that they could run them in production with their new hosted Microsoft Exchange Online cloud service. While test cases in the lab give us the traditional pass/fail signal, we can take a different approach in production, instead of looking at the success rate over thousands of continuous runs. Success or failure is measured by availability and performance KPIs. For a scenario did we meet the “five nines” (99.999%) availability? Or did we complete the task in less than 2 seconds 99.9% of the time? The signal is still the same, it is still the data coming out of our system under test, but we are triggering that signal with our testing. Exchange Online runs thousands of tests constantly in production alerting them to small non-customer impacting outages which represent regressions and risks to the service.(13) They find quality issues before the customer does.

Another way to vary our Inputs is to deploy the new version, feature, or product change so only a subset of users sees it. This is called Exposure Control. Using Exposure Control we can run A/B tests, comparing the data signal from the old version (A) to the new version (B). Google, Amazon, Microsoft and others regularly do this. Often UI changes are assessed based on business metrics, but testers can also make use of back-end changes “under the covers” to assess quality metrics. Google regularly tests changes by limiting exposure to explicit people, just Googlers, or a percent of all users.(14)

So we can now update our testing model diagram to represent this new quality signal, and the three inputs to drive it in Figure 3.

Figure 3. The TestOps model of software testing

TestOps? What Does This Mean for Me?

Just the change in signal alone indicates a big shift in the testing profession and how we test software. But what does TestOps mean to the day-to-day work of the tester?

The roles of the Developer and Tester begin to change, with each looking more like the other:

  • The tester’s focus can be much less on the up-front testing and running of discrete functional test cases. Developers have always had responsibility for unit testing, but to enable testers to focus more on production and high context scenarios consider moving up-front functional testing to the developers. Put another way, a developer must use every available means to produce code free of low context bugs that can be found by running on his desktop.
  • Testers now need to focus on a Test Oriented Architecture (15) that enables us to collect and identify the right data signal. Testers need to create tools to run the tests in production, monitor the data signal, and analyse this signal. These are roles that look very dev-like.

The roles of Operations and Tester begin to change, similarly with each looking more like the other:

  • Testers are now focused on the live site- the traditional domain of operations. The tester’s skill is now applied in determining the connection between the data signal and product quality.
  • Whether scanning the production data signal or firing synthetic transactions, the tests we write now look a lot like monitors. Something Ops has been doing all along. The value test brings is in moving these monitors from heartbeats and simple scenarios to high context user affecting scenarios.

Across the major services, these changes are already taking place. At Facebook and Microsoft Bing, they practice combined engineering, with no separate test discipline. The goal is for developers to not only develop code but to also do all of the TestOps tasks above. This can be a difficult model to execute and will fail if engineers revert to “classic” developer responsibilities alone thus abdicating quality. A separate TestOps role can ensure quality is foremost even while blending this test role with Dev and Ops. Therefore Google, Amazon, and Microsoft services (other than Bing) instead maintain their test disciplines, but with high Dev to Test ratios from 3:1 to 8:1. A 1:1 ratio is not necessary as testers are no longer writing rafts of tests for each discrete feature and are instead focusing on tools and processes to harness the power of the data signal. Companies like and Google maintain central teams to further put Developers’ and Testers’ focus on the live site creating tools to make deployment and monitoring easier.

It’s All Test, It’s All Good

Yes, Testers can still test up-front. Black box and exploratory testing still add value and the traditional test model can still be applied. But depending on your product – especially if it is a service – the TestOps model using the big data signal can be your most powerful tool to assess software quality. In Test is Dead (16) Google’s Director of Engineering made the case that the data signal tells us if we built the right thing and that traditional testing is much less important. Test is not dead. The power of big data and the data signal will also tell us what we need to know about quality – not just did we build the right thing, but did we build it right.


  12. Edward Unpingco; Bug Miner; Internal Microsoft Presentation, Bing Quality Day Feb 11, 2011
  13. Experiences of Test Automation; Dorothy Graham; Jan 2012; ISBN 0321754069; Chapter: “Moving to the Cloud: The Evolution of TiP, Continuous Regression Testing in Production”; Ken Johnston, Felix Deschamps
  14. Google: Seattle Conference on Scalability: Lessons In Building Scalable Systems, Reza Behforooz [timestamp: 20:35]

Author Bio

Seth Eliot is Senior Knowledge Engineer for Microsoft Test Excellence focusing on driving best practices for services and cloud development and testing across the company. He previously was Senior Test Manager, most recently for the team solving exabyte storage and data processing challenges for Bing, and before that enabling developers to innovate by testing new ideas quickly with users “in production” with the Microsoft Experimentation Platform ( Testing in Production (TiP), software processes, cloud computing, and other topics are ruminated upon at Seth’s blog at and on Twitter (@setheliot). Prior to Microsoft, Seth applied his experience at delivering high-quality software services at where he led the Digital QA team to release Amazon MP3 download, Amazon Instant Video Streaming, and Kindle Services.