A brief little update on the script I’ve written which I blogged about the other day which reparses the partial feeds from the BBC news website and creates full feeds of these.
I’ve gone through the script tonight adding scrape caching (6 hour cache currently) and a few other tweaks, this has decreased the build time on it by around three quarters so quite a bit, although of
course this varies if articles have to be fetched etc.
Now that I’ve done this I can now add a little more filtering to the scraper to remove the occasional unwanted elements which creep up from time to time such as voting forms which I didn’t originally think about. Another thing which I want to deal with can be seen if you look at the source code on any BBC news article, there are very few closing tags for paragraphs so I’d like to deal with this within my parser at least.
Its been quite interesting actually now that I’ve released this, I added a log to the script to
allow me to see more detail what is being fetched and when (and there are quite a large number of people using it). I didn’t realise for instance that the useragent string pulled from a Bloglines request gives the number of subscribers which its catering for which is quite useful. Google reader however doesn’t appear to do this which is a pity but of no real significance at this stage.
Tags: BBC, RSS, Bloglines, Google Reader
What?
Basically I wanted full article text and images from the BBC news RSS feeds and so I’ve built what is at the moment a fairly simple reparser to scrape the rest of the content and include it in the feed.
I’ve built it as part of another project but also so that when I’m getting the train into work I can read the full headlines from the BBC RSS feeds and not just the first line or so without forking out for a mobile data plan. I use an O2 XDA MiniS so essentially in the morning the Egress feed client I use updates off my home wifi before I leave. I am planning on testing on other devices but havn’t had a opportunity thus far.
How?
Its not that complicated a script but essentially it reads the requested RSS feed, scrapes the target links for each item in the channel and pumps it back out with that full text. You can also choose not to include images if you have a device with limited storage, the XML generated on its own is around 110Kb feed dependent and of course the images will increase the total download size quite considerably if you wish to do as I do and cache it all to your mobile device.
At the moment it caches the original RSS feeds for an hour but doesn’t cache the scraped content, this is something I still need to work on, an optional item limit might be useful as well, easy to implement but still needs a spare moment or so which I need to find!
Using it…
The below link will allow you to build your own feed based on a BBC News RSS feed, I’ve tested the available feeds and believe thus far they are producing a satisfactory valid output using the W3C feed validator.
I’ve tested so far in Egress, Bloglines and Opera readers. It is still a little messy in its implementation, please remember its still a work in progress!
Update! 29th Nov
- The script is being hit a great deal more than I expected indicating that a) I need to optimise it a little more for efficiency/speed and b) there is a demand for full feeds (no surprises there!)
- I’ll be updating the script over the next 24 hours to include caching of the article texts, this will a) increase speed, a lot! b) enable me to do a more comprehensive filter of the tags and article contents to remove forms and clean up the rather dirty markup which results within the RSS>Item>Description part of the feed. I’ll have to think about this a bit more in terms of how long to cache this for etc but it should be done by thursday early am.
Tags: BBC, RSS, Egress, Opera browser, Bloglines
Something I’ve only just discovered (another excellent find from Ian’s blog), it seems the goverment have made another move to try and stay in step with technology by launching (albeit in beta) an online petitions service to complement the many received through the door. It follows the trend which I thought was gone now of prefixing internet ‘stuff!’ with an ‘e’.
The skeptic in me believes these will be largely ignored, especially considering the lack of technical knowledge on the part of the PM. Although it does make the whole system a bit easier and enviromentally friendly to get away from the paper idea! Oh and there’s an RSS feed as well which is useful.
Comments have now been turned off on this archive post as an added spam measure to Akismet
Reporters Without Borders have released a list of the top 13 “enemies of the internet”, a list of countries which aims to draw attention to those who suppress freedom of speech and expression on the internet.

Unsurprisingly, China and North Korea are on the list along with various others including Egypt who allegedly arrested three bloggers in June, not that I don’t trust RSF its just that I can’t find any major references anywhere else! This isn’t by any means the first time RSF have released the list but the first time they have included an online petition of sorts which also contains an attack on Yahoo for their involvement in censorship in China, quite surprising they didn’t go after Google in the same manner during this online protest after Google restricted searches within the communist state.
Freedom of Speech on the internet seems to have become a particularly big topic in recent months, Amnesty International launched the Irrepressible.info campaign a while ago which shares many of the same ideals as the RSF campaign but in a broader sense. The UN Internet Bill of Rights workshop for which the amnesty campaign was aimed at, took place in athens recently and reading from the workshops wiki, it aims to address:
- What rights are fundamental to freedom in a digital world?
- What obligations are necessary to create a digital society based on rule of law and civil liberty?
- Who are the appropriate stake-holders in making policy determinations for the Internet and what is the role of government?
- How to negotiate between conflicting values in setting policy for the online environment?
- What special challenges and opportunities does the Internet provide in the quest for life, liberty, and the pursuit of happiness?
It will be very interesting to see what the final document contains, the wiki makes for some interesting reading on what has already been done. Notably this was a topic raised in France as early as 1998 as a issue for discussion, and brought up by others in various forms earlier still. The United Nations is a large organisation with many of the worlds major powers involved heavily in it. In the same way that the League of Nations, the UN’s predecessor went, the influence of the UN has arguably been diminished in recent years most prominently through the coalition invasion of Iraq. Is the publication of an Internet bill of Rights going to have any noticeable impact on the internets standing within individual world states. As the prime example: both online and diplomatic pressure have not yielded any major results with China over large scale censorship of the internet, and companies such as Google who by their motto should known better have been keen to boost their share prices through cooperation with China on this point. There is also have the issue of where to draw the line, where does the bill fit in and how much governance should it contain?
We all have very different ideas on what should be allowed on the internet, in the UK recently Channel 4 aired ‘Dispatches Debate – Muslims and Free Speech’. This has been described as “sensationalist” in some places and others have gone further and I would agree with points raised that the audience selected did not reflect the average british ethnic percentages, so creating a potential bias in the audience poll results. Anyway, I digress, the key point I saw throughout was that different people had as can be expected different views on what is and what isn’t offensive, Jon Snow presenting summed up at the end by saying “The freedom not to be offended should be enjoyed by allâ€?. As was the case here the same is true for the internet, who gets to decide what is and what isn’t offensive, what is and what isn’t permitted on the internet. Whilst any decent person will say that child pornography for example should not be permitted, there are not many other examples of where a clear line can be drawn. China for one will of course not share the same views held by western democratic nations.
Iif we cannot find common international ground on what is and what isn’t a persons right offline, how can we find it online without dividing up the internet?
I’ve had TIOTI on my feed list for a while now. From the day the original concept was published and rumours went around the blog world it took a little while before it came to fruition (unsurprisingly). I’ve been playing around with the beta for a while now although not as much as I expected to before it actually came out (if that makes any sense!). The concept is fantastic, sort of like a socially charged, P2P breathing Radio Times without the schedule bit, is about the best way of describing it. Being able to see what your friends have and havn’t watched potentially saves a lot of plot spoiling for one thing! I’m not going to bother covering the whole thing here, Ian Forrester has written his thoughts on it and TechCrunch also did a piece on it which between them have convered most of it.
I agree very much with what Ian has said about it in that the AJAX perhaps goes a bit too far and this is indeed my biggest gripe too although perhaps in a different way. As a web application I don’t believe it fulfills its potential in the same way it might as a full desktop application. I’ve used it in a couple of ways recently to try and get the best use, I’ve used Opera to browse it and manage bittorrent downloads and then VLC to play and I’ve also tried using the Democracy player. Democracy works very well in a lot of ways as it has the media player, bittorrent support and RSS support built in and so works quite well as the desktop interface. For me the best mix would be what Democracy does plus adding the social, planning and suggestion functions which TIOTI provides so well. Its the little things which make it in a lot of ways more of a hassle to use, tagging shows which I’ve watched is something a desktop version could do automatically for example. Perhaps a plugin for Democracy would make the most of it?
On another note, its going to be interesting to see where the founders of Skype go with the Venice Project, I’m signed up for the beta now so hopefully this will be opened up a little more soon, from what is being described at the moment this looks to be a very interesting project although I am skeptical over the full screen high quality video streams they are promising! Interestingly viral is mentioned as a method of spreading the word but there’s no RSS feed just a very 90′s email newsletter, fortunatly mailbucket.org does a good job of fixing this!
Comments have now been turned off on this archive post as an added spam measure to Akismet
Its one of those things I think we’ve all seen and experienced at some point, the street preachers found increasingly on high streets etc in busy towns proclaiming the value of their chosen religious stance. Walking through Bromley high street this afternoon there were a group of christians doing exactly this, personally I don’t mind it that much, of course many others equally find objection to it. As I was walking past I overheard a fellow passerby commenting to his partner something along the lines of I wish they woudn’t do that. Of course the actual phrasing was a little bit more crude but that can be ignored. It got me thinking and the first thing I thought was yes I agree, I wish they woudn’t do it either, certainly my views on religion have a lot in common with Ben Metcalfe’s statement a little while ago on the subject of religion and in some ways I might go a little further down this line.
‘it’s always been the number one cause for segregation and conflict in, and is used ultimately as a control/influence mechanism for society’
At the same time I’m thankful for the fact that they are doing it and are able to do it. I think something we take for granted is the remarkable level of freedom of speech we have in this country compared to what many others have to live under. That said, I suppose there comes a point when it becomes a little over the top as was the case of Philip Howard. All the same, I find it amusing that he simply moved down the road to Piccadilly Circus to circumvent the ASBO placed on him, another shining example of ineffective and ill thought out government legislation in action.
Archive Post; Comments and trackbacks disabled to help counter spam, please contact me directly.
ALIPR stands for Automatic Linguistic Indexing of Pictures (Real Time), its been developed by Penn State University professors and makes it possible to automatically tag images with keywords, rather than having a person manually label the photos. This could be quite an interesting one for sites like Flickr and Zooomr, whose users use tags extensively to catagorise their pictures. The vocabulary is fairly limited at the moment and it doesn’t always get it right but if you try uploading a picture it doesn’t do a bad job although with the current limited vocabulary it has a pretty good chance of getting a few correct tags. You’ll notice when uploading that it says at the top…
‘ALIPR is like a child trying to learn about the world. Please help us to teach ALIPR’
…so hopefully it’ll get better although I’m not sure how its learning algorithms work as it doesn’t currently add manually suggested tags to the word list.
Worth having a look at anyway if your into the whole social tagging thing which is becoming increasingly popular at the moment.
http://www.alipr.com/
Surprisingly enough I picked this one up off slashdot, occasionly something interesting gets posted on there but the comments should be largely ignored!