Excerpts from a SearchEngineWatch interview with Rich Skrenta about Topix, which “combines an excellent news search engine with two other hot technologies: local search and personalization. The Topix database includes full text news stories from over 4,000 sources, including a great deal of content that’s difficult to quickly access elsewhere. The real power of this nifty news search engine comes from its easy-to-use pre-built pages that aggregate news and other information into more than 150,000 topic-specific pages.”
Rather than starting with a full web crawl, which has 4 billion+ pages, we started with news, which has 4,000 sources, and is very dynamic and high quality content. We don’t cover everything in the world yet, but we do have every place in the U.S., every sports team, music artist, movie personality, health condition, public company, business vertical, and many other topics.
We developed separate software modules to crawl, cluster and categorize articles. The heart of our system is a proprietary AI categorizer that uses a massive Knowledge Base (KB) to determine the geographic location and subject categorization for each story. The final step is the Robo-Editor, which picks the best stories for display.
We have a commercial feed business for companies that want to enhance their own website offerings with deeply categorized news content. Topix.net offers an extremely rich newsfeed — in addition to the standard URL, title, and summary, we have the latitude/longitude of the news source, the latitude/longitude for the subjects of the story, the prominence of the news source, the subject categorizations, and more. We can also “geo-spin” any subject category, to produce a locally focused version. These features give us a lot of flexibility to customize feeds for clients.
In addition to newspapers, Topix.net is crawling radio and TV station websites, college papers, and some high school papers and weblogs. We’re also crawling government websites with “newsy” public information, such as police department crime alerts, health department reports, OSHA violation announcements, coast guard notices, and news releases from other city, county and state level government entities. We are crawling and including press releases too.
Our focus is on hyperlocal deep coverage of the U.S.. We love police blotters and little papers with extremely local coverage. If your local PTA has online meeting minutes, that’s the kind of source we want to add.
The Seattle Times has an interview with Greg Linden of Findory, “offers free, instant personalization of news searches at http://www.findory.com. It learns from the news you select to read and finds articles that match your interests.”
What’s the point? There’s a glut of news out there, and Findory News is one way to find some focus within that information, Linden said. The Web site keeps a record of the articles you’ve read in the past and uses that information to automatically pick the articles you would likely be interested in.
How does it work? Each visitor is assigned an anonymous identifier, a random number that’s part of a cookie a piece of data that tracks a visitor’s preferences. It associates news searches with the individual identifiers. The service is anonymous in that it doesn’t know anything about users other than the articles they’ve read.
Why news? Lots of companies are developing their own personalization services, and Linden could make a nice living as an executive geek. He said he picked the news business because it has a redeeming social value. “If you make it easier for people to read the news, to spend less time and be more informed, that actually has a lot of value,” he said. “People make better decisions.”