programmingTag Archive -

Notes From a Small Internet

Beautiful Soup is a great little Python module that will read just about any HTML page and give you back a structured parsed tree. It’s awesome because you can pass it just about any mangled markup — I’ve never known it to choke on anything. For some web service consumers I’ve had to write over the years Beautiful Soup has saved me many, many hours of slogging through crappy HTML parsing. Great software deserves appreciation.

Whilst browsing my good friend Rachel’s website I happened to notice that her brother Leonard wrote Beautiful Soup. He also wrote RESTful Web Services, which is part of my (recently pruned) dead tree collection, and which I’d heartily recommend to anyone who has to work with REST web services. The Django examples were especially useful!

Google’s AppEngine Beat Me To It

Recently I’ve been putting some time into writing a database adapter for Django that uses Amazon’s S3 and SimpleDB services as a storage layer, whilst trying to retain as much of Django’s QuerySet functional layer as possible. The general goal is to provide a storage back-end for Django that isn’t dependent on the traditional vertically-scaling database server, but can scale horizontally in the same way as the EC2 computing cloud does. My eventual goal being the ability to deploy Django in the cloud with no external dependencies. Just throw out a Django machine image, deploy your app’s code and config, and you have a scaling solution that takes minutes rather than days or weeks.

It’s a non-trivial exercise that is both stimulating and frustrating in equal measure, and progress has been steady, if not exactly rapid. It’s worth it to me though, as the ability to roll out scaling infrastructure is dramatically hampered by the database layer.

Imagine my delight then to find that Google have launched AppEngine, their own cloud-based web application system. It’s Python without any messy machine-based libraries, uses WSGI so you can use pretty much any Python web app, and with GFS for a distributed file storage and BigTable as a data persistence layer. Google even throws in Django 0.96.1 with instructions on how to use their storage layers by doing away with Django’s own model  (more on this later).

There’s a lot of whining about how Google’s solution cripples Python (which is crazy when you look at how trivial it is to refactor code to use Google’s supplied alternatives), and locks you into their solution. I suspect that this is mostly from people who have never even contemplated building an application that needs to really scale, and are therefore still thinking in terms services provided by the underlying OS. That’s a big problem for scaling, because disk, IO, threads, sockets, etc are finite resources that are hardware-bound. Abstracting access to these things is tough. Most scaling solutions these days are about providing multiple hardware instances, but unfortunately that only solves the hardware problem. Building an app that scales transparently over multiple hardware instances is a huge challenge in comparison to procuring more servers.

Google’s approach is to do away with the concept of hardware entirely. That means a change of mindset towards every request being an atomic operation. Persistence occurs (correctly) in your persistence layer and not in transient storage available to an instance of your application. Google have provided extensive Python libraries and API calls to enable applications to take advantage of this, but it seems that a fairly vocal group aren’t interested unless their applications work on AppEngine without any additional effort. Considering the paradigm shift that AppEngine represents (from machine-centric programming to distributed programming) it’s not unreasonable to expect some small effort to be required. Especially when you take into account that AppEngine is currently in a very limited trial phase.

I’m extremely optimistic that Google’s approach will work well for a number of reasons. As an application programmer I spend huge amounts of time working around hardware and platform limitations that I should be spending on core functional areas. If Google can provide a solution that means I never have to worry about specific hardware problems ever again, I doubt I’ll look back.

iPlayer Flash — Embedded (with a little work)

The BBC has just released Flash streaming for iPlayer’s 7 Day Catch-Up feature. This is just about the best news to come out of the fiasco that is iPlayer, and for those of us unwilling or unable to install the bandwidth-destroying Kontiki client it’s the only way we can get our on-demand BBC programming.

Unfortunately, the BBC only makes the streams available on their site. I checked the iPlayer terms and conditions, and there’s nothing stopping remote embedding, it’s just a technical hurdle.

Fortunately, the internet eats technical hurdles for breakfast.

So, in the half hour before I go out to the pub, I’ve knocked together a little WordPress plugin that will take an iPlayer URI, and embed the streaming video into your page for you.

Like so:

Streaming full-screen Doctor Who on my website. Excuse me whilst I geek-out a moment, this is the coolest thing I’ve played with all week!

I’ll tidy up the code and release it to the community as soon as I get back from a night of drinking and dancing!

The Power of Python

I’m somewhat of a fan of Python, using it as I do for almost everything I do in programming, so it’s great to see other people appreciating the language.

Randall Munroe, however, might be getting a little… carried away:

import antigravity

Perl on Rails — Why the BBC Fails at the Internet

Perl on Rails is a project by the smart chaps over in BBC Audio and Music Interactive that replicates the Ruby On Rails MVC framework in Perl. They’re obviously rather proud of themselves, and I understand that internally the project is making waves. Whilst I applaud the technical achievement of the individual developers, I deplore the situation that has forced them to do this.

The problem is that the BBC doesn’t control its own technical infrastructure. In an act of staggering short-sightedness it was outsourced to Siemens as part of a much wider divesting of the BBC Technology unit. In typical fashion for the BBC, they managed to select a technology supplier without internet operations experience. We can only assume that this must have seemed like an acceptable risk to the towering intellects running the BBC at the time. Certainly the staff at ground level knew what this meant, and resigned en masse.

Several years later this puts the BBC in the unenviable situation of having an incumbent technology supplier which takes a least-possible-effort approach to running the BBC’s internet services. In my time at the BBC, critical operational tasks were known to take days or even weeks despite a contractual service level promising four hour response times. Actual code changes for deploying new applications were known to take months. An upgrade to provide less than a dozen Linux boxes for additional server capacity — a project that was over a year old when I joined the BBC — was still being debated by Siemens when I left, eighteen months later.

The BBC’s infrastructure is shockingly outdated, having changed only by fractions over the past decade. Over-priced Sun Enterprise servers running Solaris and Apache provide the front-end layer. This is round-robin load balanced, there’s no management of session state, no load-based connection pool. The front-end servers proxy to the application layer, which is a handful of Solaris machines running Perl 5.6 — a language that was superseded with the release of Perl 5.8 over five and a half years ago. Part of the reason for this is the bizarre insistence that any native modules or anything that can call code of any kind must be removed from the standard libraries and replaced with a neutered version of that library by a Siemens engineer.

Yes, that’s right, Siemens forks Perl to remove features that their engineers don’t like.

This means that developers working at the BBC might not be able to code against documented features or interfaces because Siemens can, at their sole discretion, remove or change code in the standard libraries of the sole programming language in use. It also means that patches to the language, and widely available modules from CPAN may be several major versions out of date — if they are available at all. The recent deployment of Template Toolkit to the BBC servers is one such example — Siemens took years and objected to this constantly, and when finally they assented to provide the single most popular template language for Perl, they removed all code execution functions from the language.

So talented, underpaid, and frustrated software engineers at the BBC are forced to make a decision. Either they can produce websites using static HTML, and make a few remote calls to limited Perl functions, decorating their page with SSIs, or they can fight against a reticent and incompetent technology supplier to make use of a crippled and outdated language on servers that more than likely are unable to meet the capacity requirements of a dynamic application being used by the BBC’s audience. Software engineers at the BBC must become masters of the sleight-of-hand, using every smoke and mirrors tactic they can to conjure the appearance of dynamic websites, not exactly what you would expect from one of the largest media corporations in the world. Oh, and if you’re an external agency working for the BBC and hoping to write a new application or build on technologies that the rest of the world has taken for granted for the best part of a decade, you might as well forget it. There’s only one externally available development server, and it’s not in synchronisation with the live environments.

It doesn’t have to be this way. If, instead of forcing its teams to waste valuable license fee payers’ money on duplicating existing free software, the BBC decided to take control of its technical infrastructure and provide a viable platform for complex, dynamic applications, then that creativity, effort, and time could be directed at making more of the kind of applications that make the BBC great.

Some work is already progressing in this direction. A large part of the BBC’s Creative Futures project is what the BBC calls “BBC 2.0″ (often mistakenly referred to by executives and television-types as “Web 2.0″). The last I heard this was planning to deploy an architecture based around Java, Tomcat, Hibernate, Velocity, and MySQL. Whilst I disagree with the choice of technology for many reasons, this is at least an important step in the right direction for the BBC — as long as they exert control over the infrastructure from end to end.

It’s a ridiculous situation, and I know that many talented and respected technical staff have left the BBC in the past few years citing frustration at the insufficient technical infrastructure, and the inability of both Siemens and BBC management to keep up with the pace of technological change. Unfortunately, unless something dramatic changes with the upper levels of BBC management to recognise the nature of the problem, it’s a situation that will remain the status quo for a long time to come.

Tags: , ,

Seam Carving for Content-Aware Image Resizing

YouTube Preview Image

This video presentation from Siggraph 2007 has been popping up all over the internets the last couple of days, and the implications are truly astonishing. Algorithmically this is a remarkably simple technique, and easily implemented in real-time. It should be pretty straight-forward to write an implementation in ActionScript 3 (for Flash 9) or in IronPython (for Silverlight) and have this apply to images in webpages with a minimum of effort.

More exciting than simply just resizing images is that the weighting and treatment that can be applied manually to specific regions of an image. It’s easy to imagine a myriad of gaming opportunities that arise if you can hide data selectively in images across the web through this technique.

Dr. Shamir’s other research work is pretty interesting as well, covering as he does:

  • Mesh Partitioning
  • Skeleton Based Representations
  • Multi-Resolution Models
  • Object Feature-Space Analysis
  • Digital Typography
  • Visual Succinct Representation of Information
Page 3 of 3«123