I needed to get a very large list (thousands) of feeds for testing our feed parser robustness, i.e. we want to be able to handle any feed, no matter in what format (or lack of format) the feed is in, etc.
Turns our getting a very large list is not that straightforward, sure you have ‘top 100′ lists, and lots of directory sites that categorise lots of feeds, but no one big list of a few thousand active blogs/news feeds! Google does have a ‘ajax feed search‘ API but its server based and more importantly I wasn’t entirely sure if I was in breach of their T&C’s so better safe than sorry..
So, I took to doing some simple screen scraping. I choose Google Blog Search to gather the feed URLs from, as it returns feed links in a single page when you search for a term.
The scraping process is quite simple:
- enter a search term into google blog search
- get the 10 URL links returned (all conveniently flagged with class=f1, you can tell this from Firebug or just view source)
- hit the next button to get the next 10 results
- repeat
One problem though, Google detects that your an automated script (which is correct!) and stops returning results if you misbehave. To get around this:
- use random sleeps between searches and ‘clicking’ next (I was in no hurry for this script to run so execution time didn’t matter, so I left very long pauses)
- make sure that the HTTP headers sent from your script match the same headers that are sent from your browser when you search manually (in my case firefox on windows). Wireshark is good for this sort of stuff, makes it very easy to examine the HTTP headers.
The Ruby script itself (thanks to Hpricot – fantastic library) is very simple. I stopped it running when it had gathered me a list of ~25,000 unique feeds or so. Now, to start the real testing…