Core Processes – The Test Data Management Process

In the automated testing domain, one discipline that tends to cause more problems than any other is test data management. In this article, we use the generic term ‘management’ to refer to these four aspects of test automation data management: planning, designing, generating and storage. As we discuss each of these in detail, we’ll see that successful test data management depends on upfront planning with each of these aspects in mind.

Often, issues arise when the QA team puts all its effort into generating test data and ignores the other three key components. In this article, we also emphasise the often-overlooked stages of planning, designing and storing test data.

Before we take a closer look at each stage, it’s worth examining the complexities that test data management presents to testing teams. Only once these are understood is it possible to clearly see the importance of each of the four stages.

The Complexities of Test Data Management

Test data management is one of the most complex areas of automated testing. This complexity stems from a few areas:

First, we need to create reusable data sets that the tests feed into the application under test. Second, some external or reference data is required (e.g. data feeds or lookup tables). Third, the data already populated in a test environment needs to be in a particular state. And fourth, there may be dependencies, meaning that some data may need to be entered first in order for the core test case data to work.

We’ve all had the experience of feeding in the same test data set a second time, only to find that there are duplicate records causing issues. This usually results from dates or primary key values being overly constrained. So while we do need to have reusable data sets, there will often be system constraints that prevent us from repeatedly using exactly the same data set.

Another common issue is that the data from external data feeds and reference data changes from one run to the next. It’s not that your actual test data has changed, but that the external data you rely on has changed. This has a knock-on effect on the test results that are compared against the expected results.

A third complication arises from the growth of databases and data feeds over time. As test environments mature, they become increasingly difficult to clean and purge. In complex environments, it just isn’t viable to refresh the data before every automated test run.

Fourthly, dealing with data dependencies usually entails feeding in one type of record and obtaining a reference key that is used to feed in the next record. This sounds simple, but as test scenarios grow, the larger volume of initial data becomes quite onerous to set up.

Let’s take a look at some ways to overcome these issues.

Dynamic Data Generation

One option is to automatically update the test data prior to each test execution run. A scenario where this would be required would be when feeding in data from an Excel spreadsheet containing date values that expire (e.g. if they need to always be one week ahead for the test to work). A way around this could be to use formulas to define the date values as ‘Today+1w’. At run time, today’s date is taken and a week is added to it. This value is then used in the automated test case.


Spending some time up front to work out the requirements for your test data will save you a significant amount of rework later on. During the planning phase, answer these questions:

You may not be able to answer all of these questions up front, but the process of trying to answer them will draw your attention to any special requirements that need to be built into the system from the beginning. For example, if you know you’re going to need large volumes of test data, you can start thinking about storage methods up front even before you know the exact requirements.

Designing Data Sets

This is where art and science meet � in the innate ability of a good tester to craft data sets using experience and intuition, coupled with the science of using tools and scripts to create or extract data. Neither approach is perfect, but neither of them should be ignored. Used together, they form a powerful combination.

It’s tempting to look at a range of input fields and think that writing a script to generate all possible permutations of input data is the best way to go. However, the number of permutations will inevitably grow beyond the capabilities of any tool or tester. A good tester should be able to select subsets of data according to the scenarios that are most likely to expose defects and, therefore, deliver the best test coverage.

There is much insight to be gained from the input of business users and code developers. Business users will be able to tell you what kind of data the end users will be entering. It is pointless testing with data that doesn’t represent what the system under test will actually be doing once live. Developers can tell you where data sets are most needed to test the code. For example, to identify the most complex areas of the code or where the highest levels of code churn are occurring.

But don’t overlook exploratory testing. Experimenting with small data sets, even entered manually, can provide design ideas for the bigger, auto-generated data sets. Quite often the application logic will preclude certain data configurations. Knowing this early on will save you a significant amount of work when you start generating the data.

Generating Data

Often the most important question to ask in this area is, ‘Will the test data need to be generated dynamically before every test run, or simply created up front and then stored and used in every test execution run without modification?’

These approaches lie at opposite ends of the spectrum. The two can be combined by having a static data set, which is then preprocessed just before each test execution run; for example, by updating certain data-sensitive fields.

Another technique worth considering is using setup scripts to pre-populate the test environment with data that forms the foundation of the actual test. The complexity here is that values and keys from this foundation data are required in the test data that will actually be used. There will be reference numbers and unique IDs that need to be fed back into our main data set just prior to executing the run. This means that your test data will have to be modified for every run. Dependencies like this tend to make the test runs less reliable. Therefore, they should be decoupled where possible. However, if you are forced to have dependencies, then make sure the setup part of the run is as reliable as possible. Build a strong foundation before you start building tests on top of it.

Storing Data

Managing the storage of test data is not just about making sure you can handle large volumes of the data; you also need to consider version control requirements.

The version tracking of your test data is just as important as that of your automation code. If users are allowed unfettered access to modify test data, you’ll never know how these changes are affecting the quality of the results generated by your execution runs.

For this reason, where possible, store your test data alongside the automation code using a source code control tool like Git or SVN. This will become difficult with large volumes of data, but if you can do it, it will greatly simplify things, especially if you’re using CI tools like Jenkins. There will be just one place for your CI tool to obtain both the automation scripts and data. One version or revision number will cover everything.

Having data in a text format simplifies the process of working out the differences between one execution run and the next. Tools for diffing text or CSV files are easy to set up, assuming that the files are well structured and not too large. When successfully implemented, then from one automation run to the next, with a few clicks of a mouse button, you’ll be able to see all the changes that were made to the test data. You can employ your usual diff tool and use a CI tool to retrieve the data files from two different runs and compare them. You’ll easily be able to trace failures caused by changes in the test data.

This process isn’t nearly as easy when your data stored in a database. The advantages of using a database are that it allows you to manage large data sets, and to make simple calls to access and modify records. And because databases are inherently well structured, it’s easier to display and visualise the data. However, it’s not as easy to assess the changes in your data over time. So, you have to weigh up the pros and cons of each approach. It’s often useful to start out with a text-based, file-driven approach and then migrate to a database when the time is right.

Binary and image data can present interesting challenges. If you are doing lots of image comparison tests, you may need to store the images in a file system. Storing these types of files in source code control tools or databases is tricky. Keeping track of file versions and figuring out the changes from one execution run to the next is a complicated and more manual process. However, there are tools, like Artifactory (from JFrog), that provide version control and are designed to deal with large volumes of binary data. They also integrate well with CI and CD tools, so they’re well worth investigating.


In this article, we examined four aspects of automated test data management: planning, designing, generating and storage.

We highlighted the important things to consider when planning the contents of test data sets and the infrastructure that will store and handle them. Careful attention should be paid to how data will be reused, how external reference data will be handled, the needs of a growing test environment, and dependencies among data sets. When selecting constrained ranges of values to generate, considering the needs of developers and end users will focus your test coverage on the most important areas of the back-end code and user interface.

Thus, we have shown that careful planning up front will help you to design, generate and store test data in a way that facilitates automated test execution, maximises test coverage, and makes troubleshooting data-related failures quick and easy.