Screen Scraping for RSS

Phil Windley points to a post by Bill Humphries: “He’s using curl to get the page, tidy to clean up the HTML, and an XSL program to convert the result into RSS. Because the example he’s using is making good use of CSS, he can use XPATH to easily grab the right nodes in the HTML doc. Very different from the PERL screen scapers we were writing 4 years ago.” It would be good to do this for all the Indian newspapers – none that I know has its own RSS feed.


