TECH TALK: News Refinery: News Refining Process
News Pages are the Ore. We are then setting up a complete refining process to remove the junk from these pages so we can get to the “metal”. The components of the Refinery are:
- PageBot: URLs are fed into a bot:
. Output of the Bot is an HTML page, which contains the headlines. Need to also check if page has changed. Alert if page cannot be botted for 3 successive time periods. Queue pages for processing. The bots keep running constantly.
- PageQueue: This queues the pages, and sends them to the HeadlinesExtractor. Needs to prioritise pages based on their importance if queue gets big. May also do checks like if page has not changed.
- HeadlinesExtractor: Gets the headlines from the HTML page. A mix of auto-manual techniques. Output is a collection of headlines:
. The description comes if it is available on the page with the headlines. NumberOnPage indicates the importance of the headline (the no. in the sequence). Also needs to compare if headline/URL already exists and then overwrite if necessary.
- StoryBot: Takes the URL of the story and gets the full-text page. In case of there being multiple pages for the story, one should get the “printer-friendly” page – most sites offer such a page.
- StoryExtractor: Takes the story page and then extracts the actual story from it, stripping away the unnecessary stuff. Also tries to extract the author of the story, and the summary (first 20-odd words).
Thus, at this time, we have 3 databases: the URL database for botting, Headlines, Stories.
Then, we come to the Analytics.
- Classifier: Classifies the headline/story by topic/concept and puts it in a hierarchy.
- Analyser: can do more detailed analysis, based on aggregated browsing histories.
- Miner: look for other types of trends in what people are reading and doing.
I have not thought of how one makes money from this. But the most likely thing is to make into a subscription service (a News Portal) for use of the advanced features. It’s a service which will grow on people. Like Google. No one likes to miss out on a story. People also like to share stories and be the first to share them. We all email interesting stories to friends. After email and instant messaging (both falling in the communications category), news and information is what we access most often on the Web. There is so much of it out there, but so little structure to it all.