Testing - an exercise of skill, not a game of chance?

Yesterday evening Michael Bolton posted the sort of Tweet that makes Twitter worthwhile. It made me think.

"Q. What's wrong with this post? http://bit.ly/dDAL41 A. It treats testing as a game of chance, not an exercise of skill."

Normally I'd agree with Michael. Testing is an exercise of skill. In this context, however, I can't agree. The linked article is talking about the number of users required to detect usability defects. It's not about how many testers are required to detect defects in the functionality.

Testers do have to use their skill. Throwing more testers at an application without consideration of their skill levels is a ridiculous way to test. In fact, it wouldn't really be testing at all, but that's another debate.

Usability testing is about how the users will interact with the application. Testers can, and should, try to put themselves into the minds of the users, but they cannot be an adequate substitute for real users. 

Real users may approach an application with unrealistic expectations, or with attitudes that developers and testers consider irrational. They will usually have no existing knowledge of the application. In the case of internet applications they will often lack a knowledge of the conventions, and the culture, of these applications.

In short, users are liable be ignorant or uninformed. Testers will be knowledgeable. It is extremely difficult to think oneself back into a state of ignorance. To do so would require us to consciously choose what knowledge we are going to dispense with. Genuine users aren't even necessarily aware of what they don't know.

I am not denigrating users. Why should they know as much as we do about the applications under test, or the culture of web applications? We have to adapt to them. If we expect them to adapt to us they will leave us for a competitor who is more flexible.

We can anticipate many of the problems they might have, by use of heuristics, inspections, prototyping, testing on wireframes. But this doesn't tell us what real users might do when they get their hands on the application. Real users surprise us. 

Testing the functionality requires a high level of skill, and Michael is quite right that this is not a matter of chance. Testing the usability requires real users if it is to be effective. The decision about how many are required to give us an acceptable level of confidence in the application, at a cost we consider acceptable, becomes an important question. Probability is then relevant, and that is what Jeff Sauro was talking about in the article Michael referred to.

Am I right? Obviously I think so, but this isn't a matter of blind faith. My stance is just a working hypothesis I'm currently comfortable with. If someone wants to try and convince me that skilled, professional testers can do a good enough job impersonating real users, in all their baffling complexity, then I'd love to see the argument.

Views: 125

Reply to This

Replies to This Discussion

Jim,
I find myself agreeing with Michael about the testing as a game of chance comment and your comment about users not being the same.
Where I have problems with is that we seem to forget where the numbers come from. Take this summary from the link:

"The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%. Most people either leave off the last part or are not sure what it means. This does not apply to all testing situations such as comparing two products or when trying to get a precise measure of task times or completion rates but to discovering problems with an interface. Where does 31% come from? It was found as an average problem frequency from several studies (more on this below)."

First of all, the 31% come from an average problem frequency. It's not the same for each problem. Furthermore we're not talking about random chances like tossing a coin. That's looking at it from a purely academical or mathmatical problem. We're talking about discovering fixed problems in the application. Fixed because they're in code, in the configuration, etc. There's nothing random in it.

The other important thing is that yes, we might discover most problems that users discover this way. Are these the most important ones though? If, for example, you have a small group of high profile customers that are in the 15% that you missed the whole theory about the 5 users you need gets blown out of the water.

Since the users are likely to be called in during a UAT or alpha or beta phase some people would call it testing. How are these 5 users selected? If it's a public web site, do you pick people at random out of the telephone book? That's not random actually, what about people that are not in there?

That reminds me about the difficulty in clinical studies to have a representative number of animals for testing. Say you test a new drug and their impact on their reaction time. You have a pool of 20 rats and you pick out 5 rats each for 4 sets of experiments. If you pick out the rats by just grabbing the nearest, chances are you pick out the ones with a general low reaction time out first and the quick and quirky ones last which completely screws up your results for the whole study.

My point here is that the selection of users has an impact. And that with 5 users you won't get coverage over all the ideas that a large number of users can dream up. I agree that testers can't forget what they know. But picking just 5 users is not going to make a big difference in my view.

I'm happy to hear alternatives views as well.
Thanks for the thoughtful reply Thomas.

"First of all, the 31% come from an average problem frequency..."

I think your concern is largely answered in the "why the controversy" section at the bottom of Jeff Sauro's article. "Five Users" is a common rule of thumb in usability testing. Sauro is just explaining the rationale behind it, and he makes the point that 85% confidence is based on a 31% probability that the problem will affect a particular user.

Five Users is an idea that has been popularised by Jakob Nielsen. Yes, it is simplistic, but it is a good rule of thumb rather than a precise statistical prediction. Crucially, it is based on the assumption that it will be used in formative usability testing, rather than summative testing.

Usability testing that takes place in UAT is summative testing. It tells you what the product is like, and at that late stage the great danger is that the defects will wrongly be dismissed as cosmetic because the pressure to implement will be huge. In my opinion, usability testing at that stage is of limited value, often verging on useless. If usability testing is to be genuinely effective it has to shape the product (formative testing) through a series of iterations.

Nielsen's Five Users assumes that in each iteration of the design you test with a different set of five, with the result that by the end of the final iteration you can have reasonable confidence that you've flushed out most of, and the worst of, the problems.

People do tend to ignore the fact that the Five Users is consistent with 85% confidence and a 31% probability that a defect will occur with a given user. However, you can't know what the real probability is, which is why Nielsen's rule can't be a precise prediction. You're right. the probability is not random, but I'm not sure that that's relevant. You don't even have to know what it really is. Applying the Five Users rule through a series of iterations can give you reasonable confidence at reasonable cost.

As for how the users are selected, that's an important issue with usability testing. You can select them well or badly. However, I don't think that affects the principle that it is better to do usability testing with real users, or close surrogates, rather than professional testers.
From my own experience, James, I would have to say that it takes a lot of effort for a tester to really take on the mindset of a user.

As you say, there is nothing to rival what a 'real' user will do to a piece of software and no matter how hard we try to predict what users will do or how they will react we will always be surprised by the ways they find to use our products.

I think we have to exercise skill in sampling our user community when we are putting a product out for public beta testing. If we are too narrow in our focus - for instance mainly getting computer literate people on board because they are the ones who have expressed an interest in joining the beta programme - we can end up with a false impression of usability because, again, our users have been at the 'technical end' of the spectrum. We have to make sure we are realistic in our expectations of public testers in this regard.

If we have built up a team of beta testers from across the whole spectrum of user knowledge and experience (very difficult to do) then we might be confident that we have all the bases covered. However: this takes a lot of skill in choosing the sample and I would not like to undertake this with a small number of users. It also becomes selective rather than random which probably defeats the purpose of the exercise!
Thanks Stephen. I agree with most of that. I'd hesitate to say the non-random selection of users defeats the purpose of the exercise. That's going a bit far.

I think I answered your point about "a small number of users" in my reply to Thomas. If you're doing just a single pass of usability testing, especially at the end of the project, then a small number of users is a problem. In order to get the real benefits of usability testing, and to apply the Five Users rule wisely, you have to do iterative, formative testing earlier.
Hi James - thanks for your reply. My final sentence has come out rather OTT hasn't it? I phrased it very badly - I'll use the 'Friday afternoon excuse'. What I was trying to say is that we have to be careful about how we select our sample of users.

I have been asked to participate in schemes as a result of my address appearing on a random sample of addresses taken from across the UK. This does not guarantee my experience or my ability to do what is being asked of me and I would be happy for this to classed as a random selection.

If, however, the testers had sent me a questionnaire or interviewed me to see which user profile I could be slotted into, repeating the exercise with other people until the alotted number of people have been selected for each profile, I would be more inclined to say that I had been selected for the test.

If you say in your test strategy that you are going to choose a sample of 5 people at random and then go out and select people based on their experience or user profile, I think you have to bear in mind that the results you get are going to be skewed because you have not been trully random in your selection technique.

Regards,


Stephen
Hi, James...

Thanks for continuing the conversation.

I certainly agree that it's a wonderful thing to involve the users in testing. I also agree that testers run the risk of inuring themselves to certain kinds of usability problems (although with the application of usability heuristics they can control this problem to some degree) but that's not at the centre of my objection to the original post.

My objection to the original post is with the application of the ludic fallacy (follow the link) to testing. It's bad enough to reduce testing to quantitative models; it's worse to reduce testing to bogus quantitative models.

The trouble with the quants is that the problems aren't hard to understand, we have a world-changing example from which they could learn, and yet they still don't get it. Let me try to explain.

The first issue is that a problem is not a thing. A problem is neither an object nor an event. A problem is a relationship between some person and some object or some event. A change in perception can change something from a problem to not-a-problem, or vice versa. So it's silly to count problems--and counting problems is the basis of all the measurements and statistics and projections that follow.

The second problem is with Nielsen's assertion that with one test, you know a third of all there is to know about the usability of the design. How, pray tell, does one quantify "all there is to know"? And how does Nielsen quantify a test? How long does a test take? What skills does it require? What coverage does it obtain? Problem 2A: how does one quantify a third of that stuff?

The third issue is that the model assumes that all usability problems are equally hard or equally easy to find.

The fourth problem is that the model assumes that all users engaged in usability testing are equivalent and fungible.

The fifth problem is that the model assumes that all products are equivalent in terms of the probability that they will manifest a problem, yet surely there are dozens of factors in the project context that would affect this assumption.

The fifth problem is that the model ignores the significance or impact of a usability problem. That's crucially important, since a usability problem could result in a minor incovenience or a repeated pattern of killing people. It's silly to talk about the probability of a real-life outcome while ignoring the consequences--the cost or the value--of that outcome.

The sixth problem is that the model ignores the fact that the more rare the problem, the less you can know about it (because your sample size of rare problems is, by definition, small). So problems that are harder to find, take longer to find, and take more skill to find are problems that we know less about than other kinds of problems.

The seventh problem is that the data on usability comes from 1993—before there was an Internet, a Web, or graphical user interfaces, or even a desktop computer from the perspective of most users. Since this study was done, people who were newborns then are graduating from high school now—most with immense experience and sophistication in dealing with certain computer interfaces and not with others.

The eighth problem is that the study and the article treats testing and finding problems as though they were games of chance. In a test, the goal not to simulate users (or use real ones) and hope that they'll bump into a problem by way of simple odds; it's to find important problems that matter. What's the point of usability testing, or testing of any other kind? Is it to find problems, or to play dice? I note that 85% is pretty close to the odds of hitting one of five empty chambers in a game of Russian roulette, in which, as Feynman says, "the fact that the last shot got off safely is of little comfort for the next."

The ninth problem is that the author of the blog post uses statistics for conclusions about what's right (five), rather than for suggestions on where to look. Nielsen doesn't make this mistake so egregiously.

The last problem is that no matter what I've said here, someone will come along and say "Well... sure; I guess there are all these problems in the model, but we did some usability testing and we found some bugs before release and only a few after release, so I feel good about the model, it's probably okay, and therefore there's no reason to question its validity. " This is the only prediction I make about the future that I consider robust.

It's this kind of quantitative modelling that leads to Microsoft Vista and to global economic collapse. And some of the quants still haven't learned anything from either one, so it seems.

---Michael B.
Thank you for your detailed response Michael. It has given me a lot to think about, and I think I understand your thinking more clearly than was possible from a 140 character tweet. When I looked at your detailed objections I found a great deal to agree with, but I still couldn’t agree with your conclusion. I’ve found the exchange helpful, and even if we haven’t reached agreement, I think I’m better aware of the dangers of blind over-reliance on probability.

The first issue is that a problem is not a thing. A problem is neither an object nor an event. A problem is a relationship …

Yes, I agree about a problem being a relationship, and a matter of perception. That’s why we need to use real, or realistic, users for usability testing. I’d also agree that counting them is of dubious value, and is plain wrong if you try to sell it as a significant arithmetical calculation. Counting them is indicative, not precise, and one is dealing with averages.

The second problem is with Nielsen's assertion that with one test, you know a third of all there is to know about the usability of the design. …

True. He should maybe have portrayed it as a third of all the problems that you’re going to find with reasonable effort. Of course, that would still have been open to your objection, but it would have been a more defensible stance.

The third issue is that the model assumes that all usability problems are equally hard or equally easy to find.

I’m not sure it assumes that. It assumes an average probability, not that they are all the same.

The fourth problem is that the model assumes that all users engaged in usability testing are equivalent and fungible.

Again, I’m not convinced. It is based on averages, but that doesn’t relieve the testers of the obligation to choose users carefully to ensure that they are representative and relevant to the application.

The fifth problem is that the model assumes that all products are equivalent in terms of the probability that they will manifest a problem, yet surely there are dozens of factors in the project context that would affect this assumption.

I agree, but the testers and developers should tackle all those other factors that are under their control. Usability testing with real users applies to those subjective, perceptual, problems that we haven’t anticipated. The better we are at applying UX principles and heuristics then the fewer will be the problems detected in usability testing. These remaining problems should also be less serious and less likely to get picked up by an individual user. That in turn would imply a higher number than five users per iteration if we are to clear up most of the outstanding usability problems in three iterations. On the other hand, the better we are at producing a good initial prototype for the first iteration, the fewer will be the final number of defects that go live, even if we are using as few as five users per iteration. What counts is the effectiveness of the whole process, and not just the usability testing. Maybe that’s what you’re arguing when you say that it’s skill that matters, not probability.

The fifth problem is that the model ignores the significance or impact of a usability problem. That's crucially important, …

Yes, it’s a similar point to your first one. Adding oranges and grapes to get a total of pieces of fruit isn’t very useful. However, the model should be treated as a rule of thumb. If anyone applies it blindly without considering the problems and risks of the particular application then that would be dumb and unprofessional. In the case of a safety critical application it would probably be criminally negligent. Different criteria must apply to an air traffic control system, or a nuclear reactor control system, than some routine back-office administrative application.

The sixth problem is that the model ignores the fact that the more rare the problem, the less you can know about it. …

True, but I think this objection applies more forcibly to functional defects than usability defects. A usability defect that shows up only rarely is probably not a great concern for most applications. Such defects do take more effort to find, but the effort of doing so is probably not justified by the cost and risk. That obviously doesn’t apply to safety critical applications.

The seventh problem is that the data on usability comes from 1993…

Yes, that is the date of the original study cited. I’m not sure that that invalidates the model. Over time, as users change, the likelihood of defects being detected changes, but it’s not a simple relationship. More sophisticated users won’t experience the basic problems that can bring less experienced users to a dead stop, and they will move on to find deeper problems. That doesn’t mean that the initial basic problems that the sophisticated user skated past don’t exist. It all adds to the complexity of the decision about exactly which users to use and how many of them.

The eighth problem is that the study and the article treats testing and finding problems as though they were games of chance. …

I don’t find this criticism persuasive. As I said, the developers and testers go as far as they can to anticipate problems, and the usability testing attempts to find those problems they couldn’t see. When you’re dealing with what you can’t know then it is useful to look at probability. Anyway, it’s not a question of doing it once, with five users, to reach roughly 85% confidence. It’s about doing it with maybe five at a time, through a series of iterations, each exposing about 85% of the outstanding defects.

The ninth problem is that the author of the blog post uses statistics for conclusions about what's right (five), rather than for suggestions on where to look…

True, but I wouldn’t call it a mistake. The author is looking only at part of the problem. I don’t think it follows that he considers that his article covers the whole of the problem. He says that five users per iteration “may be all you need”. The title of the article is the more definite “Why you only need to test with five users (explained)”, but that title, and the dubious certainty comes from Nielsen.

The last problem is that no matter what I've said here, someone will come along and say "Well... sure; I guess there are all these problems in the model, but we did some usability testing and we found some bugs before release and only a few after release, so I feel good about the model, it's probably okay, and therefore there's no reason to question its validity. " This is the only prediction I make about the future that I consider robust.

I’m not quite sure what you’re saying here, but it looks close to my view, albeit slightly caricatured. Our skill and knowledge are crucial, but they take us only so far. When it comes to testing with real users we are stepping into the unpredictable, the unknown. We cannot tackle that problem using our skill (as I inferred from your tweet), except insofar as we select the users. I assume you’d argue that it would be more honest, and certainly more intellectually credible, to admit the limits of our confidence, and to admit that we don’t know, and can’t know the unknowable. I’d agree, but in practice we need to offer project managers and stakeholders more than “I don’t know and I can’t honestly say that I could know” about how much testing with real users should be done to produce an acceptable level of confidence in the product at reasonable cost.

I don’t think that referring to the Five Users model is introducing statistical snake oil, or bogus certainty, provided one is honest about the model and its limitations. The model deals with averages, and needs to be critically analysed and refined to ensure one is choosing an appropriate number of users, appropriate individuals and an appropriate number of iterations for the application. I believe, however, that the model is a useful rule of thumb and a good starting point for these refinements, and is a lot better than the alternative of honest uncertainty. Without that starting point there is a danger that one will just be flailing around with numbers that seem entirely arbitrary to management.

It’s hard enough to get projects to take usability testing seriously. I fear that a purist approach could be counter-productive and make it harder to justify testing with real, or realistic, users. Maybe, in this respect, there is a trade off between honesty and results. One doesn’t want to mislead, but one doesn’t want to be rendered ineffective by a naïve excess of honesty.

Finally, I don’t think relying on probability is an alternative to reliance on skill. That would be a false choice. Skill takes us so far. Looking at probability, admittedly rough and imprecise, helps us assess our options when we have to hand testing over to users who are by definition unskilled and unpredictable.
I’d also agree that counting them is of dubious value, and is plain wrong if you try to sell it as a significant arithmetical calculation.

Look at this post (http://www.measuringusability.com/five-users.php, the one that I originally complained about) and observe how the author gleefully does exactly that. Note also that the comments are in a font so small as to be unreadable (in both Firefox and Internet Explorer), and that the lowest value you can assess to the article is one star (instead of 0)--points which, as of this writing, he seems to have ignored. My advice to him: put down the calculator and open your eyes.

Oh... and there are two more objections. The first is that the guy doesn't regard (or at least doesn't discuss) time as a factor in his model. Could it be that one user for five units of time might find more usability issues than five users for one unit of time? The second, related, problem is that "usability" is not only about ease of early use; it's also potentially about ease of long-term use. The distinction is important; something that's intially easy to use might not be so efficient later on; something that's initially hard to use might be much more efficient. (Consider the difference between a tricycle and a bicycle.)

When it comes to testing with real users we are stepping into the unpredictable, the unknown. We cannot tackle that problem using our skill (as I inferred from your tweet), except insofar as we select the users.

I think you might be suggesting here—perhaps inadvertently—that tester skill is not a factor in detecting usability problems, or that the only way we can find usability problems is by having users naive in both the application and usability seek them. I hope that's not what you're suggesting, but it can sure be read that way.

I’d agree, but in practice we need to offer project managers and stakeholders more than “I don’t know and I can’t honestly say that I could know” about how much testing with real users should be done to produce an acceptable level of confidence in the product at reasonable cost.

Here's an example of the binary fallacy at work: either accept this guy's bogus and shallow answer produced by unsupportable quantification, or cut off the conversation at “I don’t know and I can’t honestly say that I could know”. We can absolutely offer more than that—or at least, I can, and the people that I respect can. In fact, I'm sure you can also produce an answer that's better than saying "FIVE!" How about it?

"that the model is a useful rule of thumb and a good starting point for these refinements, and is a lot better than the alternative of honest uncertainty."


Are you saying that this heuristic--this "useful of thumb" is something other than honest uncertainty? What is it?

---Michael B.
Michael – I think we might be in danger of violently agreeing with each other! I’m not sure there’s a huge difference in our views about what we would do in practice. We agree that the users should be involved, and I’d be astonished if you advocated the testers doing anything that I disagreed with.

I most certainly didn’t mean that the tester skill isn’t a factor, or that only users can find usability problems. That would be patent nonsense. There’s plenty they can and should do, though often in the early formative stages on traditional projects they’re wasting time on massive test plans rather than doing useful early testing. That was what I was referring to earlier when I mentioned applying usability heuristics, inspections, prototyping and testing on wireframes.

I’m also certainly not saying that test managers should abandon their own judgement, skills and experience in the selection of the number of users and the number of iterations of testing with users. I’m also not saying that they can forget about analysing the application under test, and the context in which it will be used. It’s vital that test managers do that, or they’re not doing their job. I’m pretty sure that Jeff Sauro’s not saying that either.

Where you and I differ is over the Five Users rule of thumb. You clearly believe that it cannot be used as a starting point for the planning I referred to in the previous paragraph. I think it can be a useful starting point; a start, not a final conclusion. I’m not convinced that in practice our final proposal would be radically different. If there was a significant difference then I suspect that it would be more to do with variances in our levels of experience and competence, rather than whether we’d referred to the Five Users. I think it would help me to do a better job if I’m aware of that model, rather than just saying “I didn’t know what the starting point should be”. That was all I meant by “honest uncertainty”.

I must admit I winced when I looked back at that mention of “honest uncertainty”. I think I was right, but in a rather narrow sense. It was an unfortunate phrase to use. In general, there is far too little honest uncertainty in software development projects, and far too much false certainty that is verging on the dishonest, or that is at least indifferent to honesty.

James
Where you and I differ is over the Five Users rule of thumb. You clearly believe that it cannot be used as a starting point for the planning I referred to in the previous paragraph. I think it can be a useful starting point; a start, not a final conclusion.

No. I do believe that it can be used as a starting point. I also believe that three users could be used as a starting point, or six. Where we disagree (perhaps) is that the basis of five as the choice can justified mathematically in a way that has construct validity. Honest uncertainty is a fine thing. What is unjustifiable is line of reasoning in the original post (http://www.measuringusability.com/five-users.php) and in Nielsen's original post, too. The mathematics, to me, represent dishonest certainty.

---Michael B.
Hi James,

In proper Irish fashion I have to say I do and I don't agree with both you and Michael.

I actually quite like Jeff Sauro's page and it's interactive displays of Binomial Distribution. I think it would be a bit better if he had turned the logic on it's head and stated the real usefulness of the binomial distribution - with n tests completed and a k defects found how, confident can we be that the software is releasable? I think this also gets around all of Michaels "problems".

I thought my job as a skilled professional tester was to impersonate a real user? I take my lead from the manufacturing / production sector where sampling techniques are used to predict the probability of the user encountering a defective product (the eponymous Student t test to predict the quality of stout coming from Guinness' Brewery in Dublin for example).

Interesting post.

- Rob
Thanks Rob.

"I think it would be a bit better if he had turned the logic on it's head and stated the real usefulness of the binomial distribution - with n tests completed and a k defects found how, confident can we be that the software is releasable?"

I'll leave it to someone else to pass detailed comment on that! Personally, I'm very sceptical. I suspect that would be pushing the model way too far. Even if the statistical theory held up I think that the objections raised by Michael would have even greater strength if the model were to be used to generate a level of confidence.

"I thought my job as a skilled professional tester was to impersonate a real user?"

Yes, indeed. As well as trying to be as skilled and expert as we can be, we should try to impersonate users. However, as I was arguing in my first post in this discussion, it's impossible for us to lose the knowledge and experience that we have and users lack. We do our best, we go as far as we can, but we need the users as well.

RSS

© 2012   Created by Rosie Sherry.

Badges  |  Report an Issue  |  Terms of Service