Jakob Nielsen's Alertbox, June 26, 2006:
Summary:
When collecting usability metrics, testing 20 users typically offers a reasonably tight confidence interval.
We can define usability in terms of quality metrics, such as learning time, efficiency of use, memorability, user errors, and subjective satisfaction. Sadly, few projects collect such metrics because doing so is expensive: it requires four times as many users as simple user testing.
Many users are required because of the substantial individual differences in user performance. When you measure people, you'll always get some who are really fast and some who are really slow. Given this, you need to average these measures across a fairly large number of observations to smooth over the variability.
I analyzed 1,520 measures of user time-on-task performance for 70 different tasks from a broad spectrum of websites and intranets. Across these many studies, the standard deviation was 52% of the mean values. For example, if it took an average of 10 minutes to complete a certain task, then the standard deviation for that metric would be 5.2 minutes.
For most statistical analyses, however, you should eliminate the outliers. Because they occur randomly, you might have more outliers in one study than in another, and these few extreme values can seriously skew your averages and other conclusions.
The only reason to compute statistics is to compare them with other statistics. That my hypothetical task took an average of ten minutes means little on its own. Is ten minutes good or bad? You can't tell from putting that one number on a slide and admiring it in splendid isolation.
If you asked users to subscribe to an email newsletter, a ten-minute average task time would be extremely bad. We know from studies of many newsletter subscription processes that the average task time across other websites is four minutes, and users are only really satisfied if it takes less than two minutes. On the other hand, ten minutes would indicate very high usability for more complex tasks, such as applying for a mortgage.
The point is that you collect usability metrics to compare them with other usability metrics, such as comparing your site with your competitors' sites or your new design with your old.
When you eliminate outliers from both statistics, you still have a valid comparison. True, average task time in both cases will be a bit higher if you keep the outliers. But, without the outliers, you're more likely to reach correct conclusions, because you're less likely to overestimate an average that happened to have a greater number of outliers.
We know that for user testing of websites and intranets, the SD is 52% of the mean. In other words, if we tested ten users, then the SD of the average would be 16% of the mean, because .316 x .52 = .16.
Let's say we're testing a task that takes five minutes to perform. So, the SD of the average is 16% of 300 seconds = 48 seconds. For a normal distribution, two-thirds of the cases fall within +/- 1 SD from the mean. Thus, our average would be within 48 seconds of the five-minute mean two-thirds of the time.
The following chart shows the margin of error for testing various numbers of users, assuming that you want a 90% confidence interval (blue curve). This means that 90% of the time, you hit within the interval, 5% of the time you hit too low, and 5% of the time you hit too high. For practical Web projects, you really don't need more accurate interval than this.
The red curve shows what happens if we relax our requirements to being right half of the time. (Meaning that we'd hit too low 1/4 of the time and too high 1/4 of the time.)
This is a rather wide confidence interval, which is why I usually recommend testing with 20 users when collecting quantitative usability metrics. With 20 users, you'll probably have one outlier (since 6% of users are outliers), so you'll include data from 19 users in your average. This makes your confidence interval go from 243 to 357 seconds, since the margin of error is +/- 19% for testing 19 users.
You might say that this is still a wide confidence interval, but the truth is that it's extremely expensive to tighten it up further. To get a margin of error of +/- 10%, you need data from 71 users, so you'd have to test 76 to account for the five likely outliers.
Testing 76 users is a complete waste of money for almost all practical development projects. You can get good-enough data on four different designs by testing each of them with 20 users, rather than blow your budget on only slightly better metrics for a single design.
In practice, a confidence interval of +/- 19% is ample for most goals. Mainly, you're going to compare two designs to see which one measures best. And the average difference between websites is 68% -- much more than the margin of error.
Also, remember that the +/- 19% is pretty much a worst-case scenario; you'll do better 90% of the time. The red curve shows that half of the time you'll be within +/- 8% of the mean if you test with 20 users and analyze data from 19. In other words, half the time you get great accuracy and the other half you get good accuracy. That's all you need for non-academic projects.
Luckily, you don't have to measure usability to improve it. Usually, it's enough to test with a handful of users and revise the design in the direction indicated by a qualitative analysis of their behavior. When you see several people being stumped by the same design element, you don't really need to know how much the users are being delayed. If it's hurting users, change it or get rid of it.
You can usually run a qualitative study with 5 users, so quantitative studies are about 4 times as expensive. Furthermore, it's easy to get a quantitative study wrong and end up with misleading data. When you collect numbers instead of insights, everything must be exactly right, or you might as well not do the study.
Because they're expensive and difficult to get right, I usually warn against quantitative studies. The first several usability studies you perform should be qualitative. Only after your organization has progressed in maturity with respect to integrating usability into the design lifecycle and you're routinely performing usability studies should you start including a few quant studies in the mix.
Copyright © 2006 by Jakob Nielsen. ISSN 1548-5552.