Software Testing Club -  An Online  Software Testing Community

Hi there.

I came across an argument today for using production data as yourtest data for system testing (The actal data you will use whn all is complete and ready to go live)

Although I though this was a great idea because you would atleast see how the data would work earlier. I also saw the downsides.

1. If I was to use this data I would have had to import 2000 records and then establish whether or not it had met all requirements for the data input thus lengtheing the test in an already limited time frame.

2. The production import data was still in a state of flux meaning it could have more rows added or even changes to existing data.

The PM I dealt with was arguing that if we used real data earlier, I would have spotted more problems sooner. That I can't argue apart from the fact that time constraints cause us not to be able to use full production data.

Personally I prefer to use a subset or dummy data that mimics production. By doing so I can make my data have errors in it and test the requirements are met by manipulating the data as and how I feel. Far better than dealing with a real file I think.

However I did say that once we were in a state where the product was fully teted and rady to progress that we could to a form of OAT to test real production data.

What are your thoughts on using complete real data to execute system testing? Bear in mind when I mean completely real I mean you don't know the flaws that exist in it and you are not allowed to alter it.

It would be interesting to hear

Reply to This

Replies to This Discussion

"...And hope for the best"

Well, what they can do?

Having the following.

1. We want to reduce expenses: let's close [expensive, unnecessary] testing positions inside the company.

2. We still need to do [some] testing: let's engage [cheap] [offshore] third-party testing companies.

3. We have to fulfill privacy and security policies.
- They answer it: "give as limited access as possible".

To give another answer requires getting out of the box.

In case of in-shore outsourcing, test consultants usually access Client's environment from the inside, are not allowed taking outside sensitive information, and sign all required privacy and security papers, so there is no need in covering that much.
I'll show you one examples: Let's have an imaginary HR-application. It is used at the office with about 50 people. It has following information:
* Name
* Address
* Phone number
* Title
* E-mail address
* Salary
* Information about knowledge in formal way (at database is SKILL and KNOWLEGE_LEVEL)
* Information at free form which is originally written by person

If there is 2 different values left open, you can connect very likely information to correct person in some cases. E.g. let's imagine that salary and formal knowledge are not masked. Now if there is Unix admin, her salary is visible to all testers as she most like has many Unix related skills at formal list. Same with all others who has unique skills or small group who has similiar skills. So at the end most of the usable data is masked with artificial data, so why to take prod data at all?

Also the model has similiar dangers, so it has to be done carefully also. And this application were very simple. 'Attacker' doesn't even have to make heavy analyse to reveal personal information if there is even couple fields with personal information without messing up.

Reply to This

Actually, salary is not even that secure information as SIN or bank account number. That could be enough data for identity fraud crime!

And having a valid phone number often allows finding address and personal information in a phonebook.

Reply to This

So you think it's nearly impossible to anonymize these fields? I disagree.

As I said, it may be tedious, difficult or impractical. And in your trivial 50-people in-house example, it may simply not be worth doing - it may be far simpler to just use artificially constructed test data.

But impossible? Hardly. On the contrary, it would be quite simple.

Reply to This

Please, let me know how to anonymize those so that it is not totally artificial. E.g. mention the fields which you are NOT going to scramble with totally new values.

Even with the larger data set there is huge risk that personal information leaks to the testers, even thought it is less likely.

Reply to This

"mention the fields which you are NOT going to scramble with totally new values"

There is seldom a need to scramble a State field, for example. Lots of other fields in many datasets are like this.

Some fields such as Years of Service or Salary can be algorithmically modified easily.

Other fields like Name can usually be replaced with random strings, or random names. Sometimes the names can come from within the dataset itself, just attached with different records that originally.

In the datasets I deal with, the vast majority of fields don't need to be anonymized/greeked - they have nothing to do with personally-identifiable information. Your mileage may vary, but it's always possible.

Reply to This

basically I agree with previous answers, use both whenever possible choose wisely what to use when, unfortunately for me, coming from the embedded world, "data" usually means analog signals and as such it is very hard to capture, save and get from real cases.
An interesting anecdote is from the times I debugged fax relay boards, I had a list of fax machines all over the world known for being problematic. Once in a while I sent a single page with apology as regression testing...

Reply to This

I find that unless the production data has been properly filtered and a subset created, you are often left dealing with vast amounts of data. This often takes ages to load up into the test environment (some places with large mainframes have cited over 48 hours to load test data into the test environment !). It can also be hard to pin point errors when confronted with such large amounts.

I think a combination of both production and test created data is the best approach.

Reply to This

Anne-Marie has a good point about volumes of data. Security and confidentiality concerns obviously have to be addressed. However, I sometimes worry that people think these are the only problems, and that using production data gives them all the data that they'd require. In reality you're almost certainly getting huge amounts of data that is equivalent (in terms of equivalence partitioning), and you may not be getting the critical outlying values that could bring the application down. Does it matter? Maybe, maybe not, but the point has to be considered rather than blithely assuming that thrashing the application with huge input files will, in itself, give you the test coverage you need. Using unnecessarily large files may actually reduce coverage because of the sort of practical issues Anne-Marie refers to. Instead of multiple runs with carefully chosen data you get only one run with whatever data happens to lie in the production file.

Testers need to identify exactly what value the production data would provide, and what it won't give you. It can be valuable for regression testing of the batch suite, and it can be essential for management information systems, which entail new processing for existing data. However, using existing production data for batch runs when there are new online functions can fog the issue. You have to do regression testing to see whether existing processing will work, but you have to test the effects of data that has been input through the new online functions. You have to be clear about what you're doing and what data you need for each purpose rather than just assuming that slinging hundreds of thousands of existing production records at the application will inevitably give you what you want. Using production data shouldn't be a substitute for test data planning. Sometimes I'm afraid that it is.

Reply to This

This is a really interesting subject and one that we advise on a lot. There are key principals and techniques as they relate to a test environment for targeted data extraction, data de-sensitisation, use of pair-wise optimised input scenarios, data synchronisation with test scenarios, as well as simultaneous user interface and database validation. The topic is being covered by George Wilson at the Test Management Forum Summit on 26th & 27th Jan in London. Check out the programme or TMF website for more info.

Reply to This

George will be continuing with the same theme when he presents "Five Easy Steps to Take Control of Your Data for Testing" during BCS Data Management Specialist Group free half day event on 11 Feb at BCS London Offices. The details of the event can be found here and members and non-members alike can register online. Refreshments will be provided.

Reply to This

RSS

© 2010   Created by Rosie Sherry

Badges  |  Report an Issue  |  Terms of Service

Sign in to chat!