Testing Big Data In An Agile Environment

By Adam Knight

One of the hot trends in technology this year is the topic of “Big Data” and products targeting the “Big Data” problem (1). In May 2011 Mckinsey Global Institute described Big Data as “The next frontier for innovation, competition, and productivity” (2) and in December 2011 the International Institute of Analytics predicted that ‘The growth of “big data analytics” will outpace all other areas of analytics in 2012’ (3). Previously unseen levels of technology adoption across the world are producing greater and greater quantities of data which organisations are looking to analyse using scalable technologies such as the Hadoop distributed software architecture (4). At the same time, greater legislation necessitates the ability to store this information securely.

As big data presents fantastic opportunities to software vendors, it also poses some significant problems for test teams in small organisations who are working in this arena. The use of agile development processes is becoming ubiquitous in responsive organisations, with continuous integration and fast feedback loops allowing organisations to operate in responsive markets.

When it comes to testing long running, large-scale big data technologies it can be very difficult to fit the testing operations within the short development iterations that typify agile processes. In my current organisation, our product is used to store many Terabytes of data that will typically take many servers months to import. In a staged testing development project, we would expect to have a long running performance and load testing phase in which test to the performance limits of the system. How then, do we perform equivalent testing within a 4 week agile iteration? In order to tackle this problem, we have to look at the software itself and test smarter, not harder.

Understanding the System

Big data tools by their very design will incorporate indexing and layers of abstraction from the data itself in order to efficiently process massive volumes of data in usable timescales. In order to test these applications our testing too, must look at these same indexes and abstraction layers and leverage corresponding tests at the appropriate layers. In this way, we can test the scalability of the system components without necessarily having to process the full data load that would normally accompany that scale of operation.

For example, within my current system, data is stored in a database-like-structure of schema and tables. Data in each table is stored in 100,000 to 1,000,000 record-sized “partitions” of data. These partitions are then indexed via a metadata database which stores the appropriate metadata to look up the partition efficiently when querying the data from the application, as shown in Figure 1.

Figure 1. Metadata Layer Indexes Fully Populated Data Partitions

As testers, we have to work to understand this metadata database and the relationships that exist between it and the data. This knowledge allows us to create test archives in which each layer in the system behaves exactly as it would in a massive production system, but with a much lower setup and storage overhead (Figure 2). By essentially “mocking” out very small partitions for all but a target range of dates or imports, we create a metadata layer that is representative of a much larger system. Our knowledge of the query mechanism allows us to seed the remaining partitions (Figure 2 – Partitions 3 and 4) with realistic data to service a customer query across that data range that is functionally equivalent to the same query on a much larger system.

Figure 2. Fully Populated Metadata with Reduced Data Storage

Hitting the Ground Running

Whilst it is important to have some testing of newly installed systems, many of the important performance tests that need to be performed on big data systems will be more realistic with a populated installation. Building each of these up from scratch can be a time-consuming exercise which may not be possible within the confines of a sprint iteration. The use of virtual machines and the ability to ‘snapshot’ and roll back to a known hard disk state is a very useful tool for this kind of operation in smaller scale testing, but is of limited use for big data archiving tests given the scale of storage involved. In order to overcome this problem, we can make use of various techniques to pre-populate data into the system prior to executing new tests. A few techniques that I currently use are:

Static Installation – Configuring the software against a static, “read-only” installation can be useful for testing query performance against a known data set for performance benchmarking.

Backup/Restore – Using the backup/restore and disaster recovery features of the system to restore an existing installation in a known state. As well as being an excellent way of restoring an installation this also helps to test the backup and recovery mechanisms themselves through real use.

Data Replication – If the software supports quick import or replication methods then we can leverage these to populate an installation with bulk data far more quickly than through the standard importing interfaces. For example, we utilise a product feature to support geographic replication of data across servers to bulk insert pre-built data into archives far more rapidly than the standard import process. Once we have reached a suitable capacity we can then switch to standard importing to test performance.

Rolling installation – Having an installation which is in a “rolling state” whereby tests import new data and archive out old data at a continuous rate. This allows for testing at a known capacity level in a realistic data lifecycle without the lead time of building up an archive for each iteration, with the additional benefit of boosting our version compatibility testing with an installation that has been running over a long period of time and many software versions.

Creative Testing

Big data is a growing market requiring specialist products, which in turn needs specialist testing. Performance testing as a devoted performance phase is no longer available to us when working with agile methodologies.

To support our adoption of agile methods, as testers we need to constantly use our creativity to find new ways of executing important performance and load tests to provide fast feedback within development iterations.

Here I’ve presented a few of the methods that we have used to try to address these challenges and support a successful big data product. This is certainly not a static list, as big data is getting bigger by the day. Even as I write this we face greater scalability challenges and increased use of cloud resources in order to ramp up the testing of our own Hadoop integration and consolidate our next step in the scalability ladder. As more companies look to move into the big data arena, I believe that testers are going to be a critical factor in the success of these organisations through their ability to devise innovative ways of testing massively scalable solutions with the resources available to them.

References

What is big data? Edd Dumbill.
Big Data: The next frontier for innovation, competition and productivity, McKinsey Global Institute.
Analytics in 2012 Backs Big Data, Cloud Trends. Justin Kern, Information Management.
James Kobielus, Forrester “Hadoop: What Is It Good For? Absolutely . . . Something”.

Author Bio

Adam Knight is Director of QA and Support for RainStor inc. and writes an independent blog at http://www.a-sisyphean-task.com