Jakob Nielsen's Alertbox, July 19, 2004:
Summary:
Testing ever-more users in card sorting has diminishing returns, but you should still use three times more participants than you would in traditional usability tests.
One of the biggest challenges in website and intranet design is creating the information architecture: what goes where? A classic mistake is to structure the information space based on how you view the content -- which often results in different subsites for each of your company's departments or information providers.
Rather than simply mirroring your org chart, you can better enhance usability by creating an information architecture that reflects how users view the content. In each of our intranet studies, we've found that some of the biggest productivity gains occur when companies restructure their intranet to reflect employees' workflow. And in e-commerce, sales increase when products appear in the categories where users expect to find them.
All very good, but how do you find out the users' view of an information space and where they think each item should go? For researching this type of mental model, the primary method is card sorting:
First, they tested 168 users, generating very solid results. They then simulated the outcome of running card sorting studies with smaller user groups by analyzing random subsets of the total dataset. For example, to see what a test of twenty users would generate, they selected twenty users randomly from the total set of 168 and analyzed only that subgroup's card sorting data. By selecting many such samples, it was possible to estimate the average findings from testing different numbers of users.
The main quantitative data from a card sorting study is a set of similarity scores that measures the similarity of user ratings for various item pairs. If all users sorted two cards into the same pile, then the two items represented by the cards would have 100% similarity. If half the users placed two cards together and half placed them in separate piles, those two items would have a 50% similarity score.
We can assess the outcome of a smaller card sorting study by asking how well its similarity scores correlate with the scores derived from testing a large user group. (A reminder: correlations run from -1 to +1. A correlation of 1 shows that the two datasets are perfectly aligned; 0 indicates no relationship; and negative correlations indicate datasets that are opposites of each other.)
You must test fifteen users to reach a correlation of 0.90, which is a more comfortable place to stop. After fifteen users, diminishing returns set in and correlations increase very little: testing thirty people gives a correlation of 0.95 -- certainly better, but usually not worth twice the money. There are hardly any improvements from going beyond thirty users: you have to test sixty people to reach 0.98, and doing so is definitely wasteful.
Tullis and Wood recommend testing twenty to thirty users for card sorting. Based on their data, my recommendation is to test fifteen users.
Why do I recommend testing fewer users? I think that correlations of 0.90 (for fifteen users) or maybe 0.93 (for twenty) are good enough for most practical purposes. I can certainly see testing thirty people and reaching 0.95 if you have a big, well-funded project with a lot of money at stake (say, an intranet for 100,000 employees or an e-commerce site with half a billion dollars in revenues). But most projects have very limited resources for user research; the remaining fifteen users are better "spent" on three qualitative usability tests of different design iterations.
Also, I don't recommend designing an information architecture based purely on a card sort's numeric similarity scores. When deciding specifics of what goes where, you should rely just as much on the qualitative insights you gain in the testing sessions. Much of the value from card sorting comes from listening to the users' comments as they sort the cards: knowing why people place certain cards together gives deeper insight into their mental models than the pure fact that they sorted cards into the same pile.
Luckily, you can combine the two methods: First, use generative studies to set the direction for your design. Second, draft up a design, preferably using paper prototyping, and run evaluation studies to refine the design. Because usability evaluations are fast and cheap, you can afford multiple rounds; they also provide quality assurance for your initial generative findings. This is why you shouldn't waste resources squeezing the last 0.02 points of correlation out of your card sorts. You'll catch any small mistakes in subsequent user testing, which will be much cheaper than doubling or tripling the size of your card sorting studies.
Even though more data would be comforting, I have confidence in the Fidelity study's conclusions because they match my own observations from numerous card studies over many years. I've always said that it was necessary to test more users for card sorting than for traditional usability studies. And I've usually recommended about fifteen users, though we've also had good results with as few as twelve when budgets were tight or users were particularly hard to recruit.
There are myriad ways in which quantitative studies can go wrong and mislead you. Thus, if you see a single quantitative study that contradicts all that's known from qualitative studies, it's prudent to disregard the new study and assume that it's likely to be bogus. But when a quantitative study confirms what's already known, it's likely to be correct, and you can use the new numbers as decent estimates, even if they're based on less data than you would ideally like.
Thus, the current recommendation is to test fifteen users for card sorting in most projects, and thirty users in big projects with lavish funding.
2-day, in-depth tutorial on information architecture at the Usability Week 2009 conference in Las Vegas and Berlin.
There's also a full-day course on card sorting at the Las Vegas conference (not offered in Berlin due to lack of space at the conference venue).
Copyright 2004 Jakob Nielsen. ISSN 1548-5552.