When can you trust recommendations from usability studies or other user research?
This question arises in two main cases:
Planning your own research. You need to prioritize your research investment to maximize its impact on your project, thereby increasing the probability of deriving a higher-profit design.
Reading about outside research. It’s important to know how much you can rely on other people’s findings—is it safe to base your design decisions on outside research? Which research should you trust when findings conflict?
Luckily, in both situations, the answer is almost the same. You should trust the same kind of research that you should invest in—that is, research that derives its conclusions from a broadly diversified base.
The main difference in the two situations is that ROI considerations dictate a smaller investment in your own studies. The sample sizes and other costs of making really, really sure are not worth it. Thus, you should implement the research recommendations even if, say, a statistical analysis indicates a 10% probability of being wrong. Better to have the right design 90% of the time than to shoot blind and design without the benefit of research.
(Make no mistake: if you require perfection in your internal research, you’ll have no research. Nobody has the budget to conduct exhaustive research on every small design decision; the only realistic choice is between no research and decent—if imperfect—research. Go with the option that gives you some data, because it’s much better than guessing.)
Strength in Numbers?
The one factor most people consider when judging research evidence strength is the sample size, usually known as N. How many people were in the study? Another common consideration is the level of statistical significance, often known as p.
However, a big N or a small p is a horrible indicator of validity when it comes to research findings.
Yes, statistical significance can be computed precisely, but that only tells you something that’s not terribly important: the probability of getting the same result by repeating the exact same experiment again.
Crucially, this says nothing about whether the experiment was done right or has any predictive power for your design problem. And these two issues are essential when deciding whether to trust research findings.
Statistically significant research findings are vulnerable to 3 key problems:
Study was done wrong. Almost all usability research has weaknesses. The most common one is biasing the respondents, such as by talking too much. Sometimes, the study is just poorly designed; many eyetracking studies, for example, simply show people a static screenshot and record how they look around the image. But the way people look at a standalone image is completely different than the way they look at a sequence of screens encountered during normal website navigation. Obviously, if the methodology was inappropriate or outright incorrect, it doesn’t matter that it’s highly likely to get the same (wrong) outcome if you run the same (wrong) study a second time.
Study doesn’t generalize. Almost all academic research uses university students as test participants. Unless you happen to be designing a site for students, this means that the findings could be irrelevant for your target audience. Even if a study recruited users with a similar profile to your customers, the types of tasks and design styles were probably different than your own. Usability is highly context-dependent; what’s good for one set of users and tasks might be terrible for other people doing something else.
Study was a fluke. A study might claim a finding to be statistically significant at the level of p<.05—that is, there’s only 5% probability that the finding was a random coincidence. That sounds pretty good, until you realize that many more than 20 usability studies are run every day around the world. Given publication bias, however, you’ll hear only about the one study with a freak outcome; the 19 (or more) other studies that arrive at the truth are too boring to be published because they confirm what we already know. (Usability findings don’t change much over time.)
Rather than a big N employed narrowly in a single study, it’s better to trust in research that covers a diverse set of circumstances. If findings are derived from a broad base, they’re more likely to generalize and apply to your specific situation—and not just to the study stimuli.
Usability research should be diversified in several ways:
Users: test with consumers, business professionals, executives, geeks, doctors, children, teenagers, students, senior citizens, and many other groups.
Skills: test experienced users, high-IQ elites, people who know little about computers, low-literacy users, etc., and include users with disabilities.
Tasks: shopping, health research, checking news and investments—the list of tasks people do online goes on and on. If you want to know, say, how people search, don’t just ask them to search for one thing.
Companies or sites tested: big sites, small sites, famous brands, no-name sites.
Technology platforms: text-only UI, GUI, mobile phones, tablets, 3D.
Longitudinal research: compare results from 10–20 years ago vs. today. If they’re the same, they’ll probably hold in the future as well.
Internationally: test in multiple countries.
Methodologies: triangulate findings by combining methods across the many different ways of doing user research, including user testing, A/B testing, eyetracking, diary studies, and field research.
Finally, for internal use, you should also conduct research across the project lifecycle stages—that is, before you’ve done any design, with early prototypes, through iterative refinement, and after the product launches.
By conducting a wide variety of research, you ultimately achieve the goal of having tested a large number of users. For example, Nielsen Norman Group has tested 2,048 users in one-on-one usability sessions. More important, we tested these people with 1,524 websites and intranets in 14 different countries across North America, Europe, Asia, Australia, and the Middle East. Given this broad diversity, our findings are very likely to hold in circumstances beyond those in our testing.
When allocating your own research investment: spread the budget across multiple, smaller studies. When reading outside research: place greater trust in sources that conduct diverse research than in sources that have a large number of users do a single thing that probably doesn’t generalize to your circumstances.
Nice analysis in The Economist of the high probability of wrong quantitative results being published in academic journals. In one example, cancer researchers tried to replicate 53 published studies but could only confirm the findings from 6. In another example, pharmaceutical researchers only got the same result a quarter of the time when repeating 67 so-called "seminal" studies.
The article has a helpful visualization of the statistical outcome of 1,000 research studies under fairly reasonable assumptions: 125 of the studies would be published with 80 correct results and 45 wrong results. The remaining 875 studies would have a much higher accuracy rate of 97%, but because they didn't find anything interesting they would not be published. Because of this publication bias, only 64% of published results would be true, in this example, despite following research protocols that produced good accuracy among all (published, as well as non-published) studies.
Given the low accuracy of published research, I am extra proud of Nielsen Norman Group's track record over the last 15 years. The vast majority of our findings have subsequently been confirmed by other usability specialists running independent studies. We have a better chance of being proven right by other studies, not because we're more honorable than other researchers, but because we rely on qualitative insights before we publish a guideline. Other researchers' errors are rarely caused by outright falsified data. Mistakes are much more frequently due to statistics than to fraud. So two researchers can be equally honest, but the one that uses our method will be right more often.