Searching to Filtering

Brian writes:

A log of people talk about search as the “killer app” of the Internet. It’s not. Search is easy. Filtering is hard. Speed is hard.

This is a vast oversimplification. Search isn’t really easy, but it isn’t terribly difficult to present search results for a search of hundreds of thousands or millions of documents. Presenting relevant results is another story. That’s where filtering and speed come in. Search is useless without some kind of filtering or sorting, and painful without the speed to which we’ve become so accustomed. Both of which are why Google has been so successful with their PageRank algorithm.

Does anybody remember Usenet? There were a couple of NNTP newsreaders that would do article scoring: they would run each article in a newsgroup through a number of filters and assign each article a score. You could then sort the articles in a newsgroup by score and read the highest-scoring articles while merely skimming the titles (or ignoring completely) the rest of the articles. The web has much less formal structure than Usenet and so this kind of filtering is much more difficult to do.

In the world of weblogs, there are a few companies targeting weblog search. Nobody seems to be thinking about filtering. There will be a need, in the not-too-distant future, for something like the current crop of spam filters for weblog aggregators.

Here’s why: Imagine a C# programmer who is subscribed to thirty or so .NET, C#, and related weblogs. Even if all of these weblogs stay completely on-topic, there will be a fair amount of noise in this programmer’s aggregator. This might include posts in a .NET weblog that are more focused on ASP.NET or upcoming features in Longhorn. All this programmer wants is to read tips, suggestions, and the occasional open-ended question on C# programming. Every day he reads roughly fifty posts and ignores about half of them. Over time he subscribes to more weblogs — about one new subscription per week. At that rate, he won’t be able to keep up with his subscriptions after a couple of months.

Now imagine that he has two buttons in his aggregator: “signal” and “noise”. He! can mark any post he reads as either signal (interesting) or noise (uninteresting). The aggregator can use a number of techniques to learn what the programmer find interesting, and can filter out the noise. With this feature, he can be subscribed to twice the number of weblogs and get twice the amount of good information in the same amount of time spent reading every day — like we used to have with article scoring on Usenet.

Published by

Rajesh Jain

An Entrepreneur based in Mumbai, India. View all posts by Rajesh Jain