|
useit.com |
The British Computer Society's human-computer interaction specialist group sponsored a one-day seminar on usability metrics and methodologies in connection with its 1990 annual meeting in London. The seminar was organized by Nigel Bevan from the National Physical Laboratory and attracted an audience of about 80 people. The lively discussions during the seminar and the tea breaks indicated substantial interest in the issues of usability engineering.
There is obviously a trend towards taking usability methodology seriously as an object of study in itself in order to improve our ability to choose the right method for the job at hand. My own work on "discount usability engineering" is one exponent of this trend, and my presentation at the seminar was based on a study of the effects of expertise on the quality of a usability evaluation. It does turn out that people are better at finding usability problems in an interface the more they know about usability principles. This may come as no particular surprise, but this type of precise understanding of the tradeoffs in using different personnel categories for usability work is essential for practical usability engineering.
As another example of the discount approach, Ralph Rengger from the National Physical Laboratory discussed the number of test subjects needed for user testing and mentioned that the lists of usability problems encountered by the first five test subjects and the first ten subjects looked very similar. So for most practical tests, there is really no need to run a large number of subjects.
Because of this substantial potential for change, Brooke felt that the system itself must be capable of change-unless, of course, its designer is omniscient and can predict all the changes. Since designers are not omniscient, he advocated an ecological metaphor for systems development: The survival, change, and adaption of the fittest through some kind of "macro-selection" over time.
Brooke described two models for achieving ecological change of systems:
With the DB_builder, innovations spread through cultural diffusion as users acquire the tools, techniques, and experiences of each other. Brooke and his team assume that there is no single end-point to the development of a system: It will continue to grow. By the way, this diffusion principle seems similar to that used by the user-tailorable "buttons" project at Xerox EuroPARC, described by Allan MacLean at CHI'90.
STL is not currently selling their video analysis tool but they are "definitely interested in developing it further." I talked to others at the seminar who were very interested in buying such a tool, so there seems to be a good market opportunity in making a human factors video analysis product commercially available. Walsh and Laws claimed that the use of their new tool had reduced the time required for analyzing the video of a usability test from ten times the duration of the test to only three times the duration. Even so, a factor of three still seems like a lot of time to me.
In the discussion session, John Brooke mentioned that he had also set up a usability lab with all the works when he came to DEC but that this lab is now falling into disuse as they conduct more and more of their testing on a contextual basis in the field. Brian Shackel added that the main reason for usability labs was to demonstrate to developers that users have real problems. We will still need the usability labs for this purpose every five years or so, when the next generation of eager computerists enters industry and need convincing.
Because of these and other issues, usability methods vary widely in software development projects, and Whitefield wanted to progress beyond the stage where the choice of methods depended solely on the experience of the human factors specialist. He advocated work towards an increased understanding of the nature of usability engineering, and he presented a new framework for evaluation methods.
First, he defined human factors evaluation as an assessment of the conformity between a system's performance and its desired performance, where
"System" = user+computer+task+environment.
"Performance" = quality of the task product in relation to the resource costs associated with producing this product.
Second, he wanted an "assessment statement" to include both a report of any measurement studies as well as a diagnosis identifying the reasons for any shortfalls in the interface. Such assessment statements can in principle be produced on the basis of an evaluation of the real (working) computer system or a representation of the system such as a mockup of screen designs or a specification. Also, the evaluation can be performed using real users or using some surrogate to represent the users. For example, a usability specialist can perform a walkthrough of the interface and try to estimate where users will have trouble. Since the two evaluation components (computer and users) can be either real or representational independently of each other, Whitefield concluded that there are in principle four categories of evaluation methods:
| Representational users | Real users | |
|---|---|---|
| Representational computer system | Analytic methods | User reports such as surveys, questionnaires |
| Real computer system | Specialist reports, walkthroughs | Observational methods, experiments |
The basic principle is to divide the user's work up into a task hierarchy as shown in the figure. Task 1 represents the user's entire work with the system, while tasks 2 and 3 represent the two hypothetical main components into which task 1 can be subdivided. To perform task 2, the user actually performs subtasks 4 and 5, and the task hierarchy can be further recursively subdivided until one reaches the primitive subtasks that can be directly mapped onto commands or other operations.
To specify usability targets for new computer systems or to test a design, one then identifies those subtasks that are of the highest importance for overall usability. For example, it might be the case that subtasks 4, 6, 7, and 8 are the ones first encountered by new users or that they are the ones most frequently performed by experienced users. In either case, it becomes important to obtain high usability ratings for these subtasks. Because tasks 6, 7, and 8 are constituent subtasks of task 4, one would normally only measure the usability of performing task 4 since the exact scores on tasks 6, 7, and 8 would be irrelevant. The performance on task 4 is the sum of the performance on its subtasks and the relative performance on the subtasks is therefore not important.
One could further imagine that, say, task 10 was a critical task because it formed a potential differentiator where our system was hoped to be different and better than the competitors. If so, task 10 would also be classified as a key task where we would want to set specific usability targets.
We also have difficulties with the simpler types of measurement scale. It may actually be hard to achieve a nominal scale for usability problems, even though nominal scales are the weakest form of measurement scale and only require us to be capable of deciding when two usability problems are the same. Unfortunately, it is hard to know to what extent two observations are really cases of the same underlying usability problem, and we do not even have a good enumeration of usability problems with which to match observations.
Using the somewhat more ambitious ordinal scales to measure usability problems, we face the counting problem. We would like to know whether one interface is better than another (the ordering of the ordinal scale) based on counting the frequencies of various usability problems in the two systems. Assume, for example, that we have measured the frequencies of usability problems A, B, and C as follows:
| Probl. A | Probl. B | Probl. C | |
|---|---|---|---|
| System 1 | 1 | 1 | 8 |
| System 2 | 2 | 1 | 1 |