RSS’ Missing Bits

David Galbraith writes:

On the detailed level: RSS content is so unnormalized as to be almost useless for commercial applications. To build a searchable index of RSS content you need access to the full text of stories – and commercial publications are not going to syndicate the full text of stories – but you don’t need to syndicate the full text of stories to index them. Encouraging the use of tokenized full text (i.e. remove stop words such as and, or, the etc.) allows for machines to index full articles but leaves humans to visit original publishers sites for the full article. This should be the default content of a ‘content’ tag and needs to be built into the default output from weblog publishing tools.

On the medium scale: because of arguments over the RSS core, not enough focus has been made on tools to create modules and allow extensibility. Forms need to be built into applications such as Userland’s, Blogger and Moveable Type’s to allow end user creation of RSS modules within a users namespace and without having to have users have any need to know about the underlying XML. Rapid adoption of modules will take syndicated content beyond the headline/link pair that is the only metadata currently being syndicated in any volume.

On the larger scale: content and the weblog API are two parts of the whole – most important of all perhaps is the ping server and related specs. In order to build personalized aggregators of real-time information, all of a weblog post needs to go to a neutral third party ping server and the ping server needs top have an API that allows clients to be alerted of changes in real time without having to scrape the ping server. Do this and you don’t have 15 minute old Google aggregated news but real time news – the stuff that people like Reuters know the value of.


Published by

Rajesh Jain

An Entrepreneur based in Mumbai, India.