Emergic: Rajesh Jain's Blog

Emergic: Rajesh Jain's Blog header image 2

Screen Scraping for RSS

August 29th, 2003 · No Comments

Phil Windley points to a post by Bill Humphries: “He’s using curl to get the page, tidy to clean up the HTML, and an XSL program to convert the result into RSS. Because the example he’s using is making good use of CSS, he can use XPATH to easily grab the right nodes in the HTML doc. Very different from the PERL screen scapers we were writing 4 years ago.” It would be good to do this for all the Indian newspapers – none that I know has its own RSS feed.


Tags: BlogStreet

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment