barnesdmd.co.uk

BBC feeds reparser (continued)

A brief little update on the script I’ve written which I blogged about the other day which reparses the partial feeds from the BBC news website and creates full feeds of these.

I’ve gone through the script tonight adding scrape caching (6 hour cache currently) and a few other tweaks, this has decreased the build time on it by around three quarters so quite a bit, although of
course this varies if articles have to be fetched etc.

Now that I’ve done this I can now add a little more filtering to the scraper to remove the occasional unwanted elements which creep up from time to time such as voting forms which I didn’t originally think about. Another thing which I want to deal with can be seen if you look at the source code on any BBC news article, there are very few closing tags for paragraphs so I’d like to deal with this within my parser at least.

Its been quite interesting actually now that I’ve released this, I added a log to the script to
allow me to see more detail what is being fetched and when (and there are quite a large number of people using it). I didn’t realise for instance that the useragent string pulled from a Bloglines request gives the number of subscribers which its catering for which is quite useful. Google reader however doesn’t appear to do this which is a pity but of no real significance at this stage.

Tags: , , , 

Tags: , ,

One Response to “BBC feeds reparser (continued)”

  1. Simon Wakeman Says:

    Why doesn’t the BBC offer full RSS feeds?…

    There’s much discussion about the merits of publishing full or partial RSS feeds on your website. But I really don’t understand why the BBC doesn’t offer full feeds for their great range of news content.

    ……

Leave a Reply