Jakob Nielsen's Alertbox, July 10, 2006:
Summary:
The relative popularity of a site's pages, the number of visitors referred by other sites, and the traffic from search queries continue to follow a Zipf distribution.
About 10 years ago, I showed that the popularity of a website's pages followed a power law. Briefly stated, a few pages on a website were extremely popular, a larger set was moderately popular, and the vast majority constituted the "long tail" of low-traffic pages.
Mathematically, the Zipf curve is a straight line on a double-logarithmic diagram, when we plot pages with their popularity rank on the x-axis and their number of hits on the y-axis.
The old analyses showed that this same distribution also described both the number of incoming references to a website from other sites and the outgoing traffic from a company's employees.
Do these findings continue to hold today? The Web is now 2,200 times bigger, so traffic patterns might have changed. I decided to find out.
For the most popular 350 pages, the empirical data follows the theory remarkably well. Thereafter, however, the data trails off and the next 700 pages have less traffic than predicted. Also, in theory, there should have been about a quarter-million additional pages with low traffic, but I simply haven't gotten around to writing that much.
The small insert in the upper right shows the equivalent diagram from my analysis 10 years ago. It's uncanny how closely it resembles the new data. In particular, the old curve also trailed off toward the end and was missing a vast number of low-traffic pages.
The end of the "long tail" is absent from both diagrams because the sites haven't accumulated enough old content to have the expected number of low-traffic pages. Instead, they have a drooping tail. It might take hundreds of years for a site written by a single person (or even a single marketing department) to accumulate a quarter-million pages.
In this case, the empirical data hugs the theory's red line all the way down to the x-axis. The difference here is that there's no lack of other sites that might link to useit.com, and a huge number of these sites are low-traffic blogs that only occasionally refer individual users.
The chart's one obvious exception to the theory is that the site that referred the most visitors accounted for many more visits than predicted. Google (depicted as an extra-large dot) referred 257,040 visitors; in theory, it should have referred only 52,479.
Google is five times as popular as the theory predicts, but this phenomenon could fade as other search engines catch up. Only time will tell.
Also, while Google is disproportionally important, when taken together, the other 35,631 referring sites accounted for 35% more traffic. Clearly, it's not a good idea to focus only on #1.
The top 10 queries accounted for 10% of the total traffic, so each one of these queries is obviously more important than those that brought only one visitor. Taken together, however, the single-use queries accounted for three times as much traffic as the top 10 queries. This statistic shows the folly of focusing search engine optimization solely on a few high-performing queries. Your site must be found when users enter relevant queries -- and the possibilities are typically vast.
The following diagram shows the distribution of search engine queries, sorted by the number of incoming users who arrived after searching for that string.
This chart shows roughly the expected distribution, but with a hump for queries #5–300. In other words, queries in the middle range are more important than the theory predicts. Sample queries in this range include response time, open link in new window, teenagers, site maps, eye tracking, and link color. Useit.com is not particularly focused on any of these topics, which is why they're not at the top of the list. For each query, however, I have at least one good related article.
In general, my site covers a broad range of topics in some depth, which is probably why it has this hump of mid-range queries with extra traffic.
As we proceed down the list of query popularity, the queries become longer and longer. The following diagram shows the number of query words for each of the first nineteen groups of a thousand queries. (That is, queries #1-1,000 are first, followed by queries #1,001-2,000, and so on through queries #18,001-19,000.)
Single-word queries were fairly common among the first thousand queries (i.e., those that generated the most traffic), but drop off quite quickly. Conversely, four- and five-word queries are rare among the most popular queries, but are a big proportion of queries starting at about #7,000.
This shows the importance of considering longer queries in your search engine marketing: Multiple-word queries are the best way to capture the vast range of user interests.
In this case, long queries in the #7,000 traffic range included radio buttons and check boxes (five words) and horizontal scrollbar in html (four words).
The two exceptions are both search related:
Knowing that a single distribution describes these many forms of Web use can help you analyze your own log files: plot your statistics on a log-log scale and see if they fall on a straight line. If yes, your site follows the theory. If no, see where you deviate: In the head, the middle, or the tail? Above or below the prediction line? Any deviations help you understand ways in which your traffic is different than the norm. These insights may also help you spot opportunities for growth.
Copyright © 2006 by Jakob Nielsen. ISSN 1548-5552