's column on why Web
pages must live forever
I have received some interesting user comments on my November Alertbox on why pages should be kept on your site forever once you publish anything online.
Historical Accuracy vs. Revisionism
, the publisher of
Unlike many of your columns, this one revealed nothing particularly new to me, not through any lack of insight in the column, but because it covers ground we've been emphasizing with
since the very beginning when TidBITS was distributed as a HyperCard stack that could merge itself into an archive of previous issues. I suppose it has something to do with my mother being the Cornell University Archivist. :-)
Now, of course, all of our back content is on the Web in searchable and browsable form (actually, several forms, since with a publication, there's a difference between providing access to individual articles and entire issues, complete with the context of surrounding articles). We even have a graphic (graphical text) in our banner that changes on an hourly basis to point at what we consider our most interesting articles or series of articles. We also point to our old content quite frequently (the most fun came recently, when the latest version of the backup program Retrospect added a feature that I had written about six years ago in an April Fools joke issue) - for that we run everything through a CGI that provides us with short and permanent URLs.
The one comment I would make about content gardening, as you explain it, is that it's often
not appropriate to modify the old content
in any way. The reason is simple -
. When we publish an issue of TidBITS, we consider that issue set in stone afterwards, and the only change we consider acceptable in our database and archive of issues is one to fix a typo or misspelling. (There's little historical interest in those.)
At least with a publication, it's simply too tempting to go back and change articles to reflect what may have happened after deadline. But, everyone has seen the issue that went out, so it's deceptive to modify that information afterwards. If someone reads an article, then later finds it in a search after it has been modified, they stand to be confused. Think about an email address. People move, addresses change. Should we update email addresses for everyone who has ever been mentioned in TidBITS? That task alone would require a huge amount of constant vigilance, to say nothing of the many URLs that have appeared in issues since the Web came about.
Plus, from a historical standpoint, it's dangerous to go back and change content. Historians rely on originals and successive drafts at times, and although the computer has eliminated most drafts, we can at least save our originals so people have solid information to look back upon in the future.
(Modifications also play into
- changing the words of the past to make the present sound better. A large software company who shall remain nameless once released a press release announcing that some product would appear in April. April came and went, and in May, we checked that press release to verify that the product was late, and lo and behold, the release date had changed in the press release on the company's site. Having not saved the original press release, I couldn't prove this happened after the fact, but it underscores the concerns I have with changing content.)
The other problem is one of volume. We've been publishing weekly on the Internet since April of 1990. This week we're putting out issue 457, and we're at about
4,000 articles spanning 8.5 years
. There's simply
no way we could continually make sweeps
through our back content to keep it up to date.
I agree that there is a danger of historical revisionism: I have a column planned for the end of December where I will revisit my
predictions for the Web in 1998
and there is certainly some temptation to go back and rewrite the old article to make it look better.
While I think that it is important to maintain the original
of the page, I do not think that the original text has to be set in stone. For example, it would be quite appropriate to modify the press release to say that the scheduled release date had been changed and listing the new month. This should be done in a way to make it clear that the original press release had mentioned a different date.
If we ever get a Web implementation of Ted Nelson's original hypertext vision as expressed in
, users could also employ Nelson's
to review previous versions of a page and see the original, pristine copy. One more reason we need a hugely more sophisticated publishing model for the Web.
It is true that content gardening can grow problematic for ever-increasing archives, but the following three steps can it more practical:
don't aim to make every single update that presents itself: focus on those updates that add substantial value (e.g., a link from the April Fools story to the subsequent product release)
prioritize the gardening efforts to focus on those pages that continue to attract significant page views
automate as many aspects of content gardening as possible, for example using systems like LinkBot
Keeping Old Pages From Being Searched
from Rochester, NY writes:
Old URLs should never die, in theory. The user should always be able to get to the page they want.
My site receives a huge percentage of its traffic from people trying various portals for movie information. Occasionally, I notice that one of the portals points folks to one of the three servers my site occupied in '97 and '98. The search engine FAQs typically say that the only way to remove a page from their index is to remove it from the server, AKA kill the URL. Since my traffic is generated by search engines, I want users to get to the right place (the current server) ASAP, but to do so requires that I cut off people using engines whose spiders are behind.
You raise a real issue: search engines are often far behind in updating their index. This is one of the reasons to keep the old pages alive so that users get
. I recommend having a prominent link on the old page stating that newer info is available.
A well-behaved search engine should skip any pages that have a NOINDEX meta tag in the HEAD section:
<META name="robots" content="NOINDEX">
Alternatively, use the
file to keep search crawlers from entering obsolete segments of your site. I would still recommend adding an individual NOINDEX tag to all pages that you want to be kept out of search engines.
Unfortunately, not all search engines are well-behaved, but the big ones are. Over the long term, users will gravitate to those search engines that employ these protocols correctly because they will tend to return more useful search results.
Guided Tours to Old Content
Chug Roberts, Director of Editorial Services for the
Having been involved in a legal news service that considered many of the issues you discussed (
West's Legal News
, which ceased publishing January 1997, published simultaneously on WESTLAW and the web), I especially enjoyed your current article, Web Pages Must Live Forever.
However, I'm not sure that updating old links is practical for news-related archives. An archive containing numerous stories with numerous links rapidly becomes unwieldy to update. My preference would be to devote limited resources to developing new content unless we could generate revenue from keeping the old material updated.
West's Legal News
did have several
categories of documents that were updated periodically
: Pathfinders, which were narrowly focused (e.g., Tobacco Litigation) and heavily researched documents, and our Source documents that essentially acted as scope notes for a practice area. Even then, we created new documents that linked to older versions so we could keep our citations stable and preserve older versions. In each Pathfinder, we provided instructions how to search in the Pathfinder database for a newer version.
West's Legal News
was unique because it essentially served as a merchandizing tool for a huge, stable, fee-based archival database: WESTLAW.
Great idea to identify categories that receive additional content gardening emphasis. I particularly like your idea of
(not to be confused with the
site of the same name) and source documents with an overview of a particular area. The idea of pathfinders goes all the way back to Vannevar Bush who talked about "trail blazers" in 1945.
Pathfinders and source documents are implementations of the concept of
guided tours through a hypertext space
: they point users to valuable resources as judged by somebody with dual expertise: domain knowledge and familiarity with the site content and structure.
Overview documents with pointers to the best old content can dramatically enhance the value of a site and free the user from having to search manually through an ever-growing pile of content. And since the original hyperlinks remain in place, the user can pursue additional old content by following the links from the recommended pages.
Share this article: Twitter | LinkedIn | Google+ | Email