An interesting post by Allan Engelhardt from July 2004 provides additional context to the problem:
I do not wish to rely on authors creating their own categories. In my experience, people don’t categorise. Getting anybody to document what they are doing is enough of a challenge without bringing up topics like information architecture.
That is why I am looking for automatic (unsupervised) text clustering. If I have somebody in London who has great ideas for my retail shops; somebody in Manchester who is experimenting with practical changes to my consumer stores; and a man in Glasgow who would like to promote change in our high-street outlets; how do I enable them to discover each other and work together?
I can not use a standard text classifier on this because I do not have a training set.
An alternative approach explored by people like Matt Mower of eVectors is to assume that 10-20% of people will classify and use that to automatically classify the rest. That is an interesting assumption and a well-understood problem (classify text based on examples) with well-documented solutions from naive Bayes through neural networks and on to support vector machines and similar solutions (the list here in roughly order of increasing performance).
However, the issue is that you are always using yesterday’s taxonomy to categorise. I am not very interested in this, because chances are that if you have a useful taxonomy then you have existing projects within the organisation dealing with the issues, and promoting existing (funded) projects within a company is a (largely) solved organisational problem.
I’m interested in tomorrow’s taxonomy to bring together people around new innovative ideas. In the example above, assume that retail stores are a new idea and that the corporate terminology (“retail shops, consumer stores, or high-street outlets) has not yet been embedded within the corporate culture. How can I bring together the idea-man in London with the guys in Manchester who can implement them and the manager in Glasgow who can promote the change within the organisation and help it become a change project?
Reprise Medias SearchViews provides a nice, brief summary of Rich Skrentas post: the Reference Web is goal-directed – it delivers results based on relevancy to a users search. This includes sites and services like Google, Amazon, and IMDB. The Incremental Web is goal-directed as well, but is organized chronologically. This includes subject feed sites like The NY Times, Gawker, and Google News.
Greg Linden: “Even if you monitor just a few tens of sources, you are facing a daily stream of hundreds or thousands of articles. It’s a painful, overwhelming task to manually skim it hunting for relevant content. There is precious little discovery in the current model.
As I thought more about it (and brainstormed with others), I realised that there was something missing in this picture. Between the Reference Web and the Incremental Web, we need an Archived Web. More precisely, the Incremental Web and the Archived Web need to be built around subscriptions, tags and discovery. And outside of these three Webs will be the Community Web one that is built around our social networks and which no search engine can crawl. Other than the Reference Web, each of the other Webs will be prefixed with My. This is what I call the Four-Web model.
Tomorrow: The Four-Web Model