Scaling the ³ÉÈËÂÛ̳ iPlayer to handle demand
One of the key goals we set ourselves when we developed the was that it would have to be fast to use. We understand that any delay in getting you to the video is frustrating as the site is just a jumping off point into TV and Radio content.
But how do we make things fast? Displaying a web page in the browser contains many steps, some we can control some we can't. Time spent for the request and response travelling over the network we can't control, but we can control how long the pages take to generate and how large they are. We also have a degree of control over how long those pages can take to render in your browser.
We had our work cut out for us on the new version of iPlayer.
Personalised websites require much more processing power and data storage
The current site uses one back-end service that we pull data from to build the pages. The uses many more, and we both post and pull data from them.
This means that every returning user gets a different homepage. There's already a small amount of difference between each homepage on our current site (your recently played) but the new site is driven much more by your favourites, recommendations and friends; they're key parts of the experience and they have to be fast.
We started developing in PHP
The ³ÉÈËÂÛ̳ is standardising on as its web tier development tool. Our current site is developed using Perl and Server Side Includes, and it's something that's well understood, but our new web tier framework (based on ) means that teams can share components and modules. In fact, the team responsible for the social networking functionality develop modules that anyone within the ³ÉÈËÂÛ̳ can integrate into their site easily.
This does come at a cost though: the usage of a framework sometimes introduces delay in generating a page as it needs to get hold of resources to do so. In some cases this is necessary, especially if there's an element of personalisation, but in others our web tier is just repeating the same tasks.
All this against a growing demand
The site will have to support a massive amount of page views and users every day, on average 8 million a day for 1.3 million users. Previous versions of the site were able to grow into this demand; we'll have to hit the ground running from day one.
This graph shows our growth over the last year in terms of monthly page views.
So how do we do this?
One of the first things we can do is optimise the time it takes to generate the page.
Although changing architectures can be risky, we were confident that the one we moved to would enable us to meet all the challenges. At the heart of page generation is a PHP and customised Zend-based layer called PAL. This system then needs to integrate with our login system, ³ÉÈËÂÛ̳ iD, our programme metadata system (Dynamite), our social networking systems, a Key Value data store and a few others. The homepage alone for a logged-in user with friends requires 15 calls across these services. Even if each of those calls take a few milliseconds, we can spend a second or two just collecting the information required, which would push us well out of our 2.5s target.
We proved our architecture before we built it
At the start of re-architecting iPlayer, we did what we could to eliminate guesswork. We developed a number of architectures based on our requirements, and then built prototypes of three of them; all built to serve the homepage, which we then tested against some basic volumetrics. This gave us plenty of data about how many requests we could serve a second and CPU loads, which we could then weigh up against other softer factors, like how our dev team could work with it.
We actually ended up going for the one which offered us a good balance between these factors, as this enabled us to be the most flexible in building pages, rather than constraining what we could with the site just to squeeze the extra speed out.
We cache a lot
Caching means storing a copy of the data in memory so subsequent requests for that data don't have to do the expensive things such as database queries.
It also allows us to get around any delays introduced by our framework starting up, as there's no such delay when delivering from cache.
Caching has its problems though. The data may have changed in the underlying system (programmes become available to play for example) but the change won't be reflected in our cache. This means we can only cache for seconds or minutes, but with the millions of page views we get, it can still make a crucial difference.
- Data caching We cache the data returned from the services. We use for this. Sometimes we share data between pages.
- HTML caching We also cache the resulting HTML for a short time. When you're hitting a page, it's highly likely you're just seeing the cached page. We use for this. Caching in this way is nothing new, but Varnish has a few tricks up its sleeve that we use which I'll explain later.
We broke the page into personalised and standard components
If you look at our homepage, many of those components are the same for everyone, but some are just for you. With traditional page caching in some reverse HTML caches, it's not possible to do this; so we break the page up. The main build of the page is cached; then when the page loads we use XHR and Ajax to load in the personalised components. Varnish gives us the ability to control the caching at a low-level like this. Every time we generate a page or a fragment, we can tell Varnish how long we want to cache it for. The main bulk of the homepage doesn't need caching for long to get some benefit, but your favourites we can cache for longer (although still only for a few minutes), and we know when you add a new favourite so we can clear out the cache and replace it with the new content. This means as you browse the site, the page loads quicker and your experience is smoother.
We use loads of servers
After we've optimised all we can using a single server, we then scale horizontally using multiple servers joined together in a pool. None of our web servers store any state about who you are and what you're doing, so your request can go to any server at any time.
We also serve pages out of two locations (or data centres). This gives us a higher degree of resilience to failures; we can lose an entire data centre and still be able serve the site.
We load tested the site before we launched
We're able to track how the site is used, so this gives us the ability to produce detailed volumetrics of how we think the new site is going to be used. Some of it is estimation, but it's always backed up with data. We can then produce detailed load tests, so we can simulate usage of the site. This enables us to find and resolve any problems we may experience under load, before we go live.
The end result
We're not 100% there yet (this is a beta after all) but from this sample 24 hours of monitoring data you can see that, apart from a couple of spikes, we're doing well at keeping to our target of 2.5 seconds. (We were also able to track down the spikes to some misbehaving components on the platform).
We're currently working hard behind the scenes at making sure we can continue to serve at this speed as usage increases, spreading the load across our infrastructure.
At the end of this though, we hope the result of our efforts is that you won't notice a thing: it'll just work.
Simon Frost is Technical Architect for ³ÉÈËÂÛ̳ iPlayer .
Comment number 1.
At 2nd Jul 2010, Russ wrote:I still feel you're missing the wood from the trees here. In the current (non-beta) iPlayer, it takes (typically) 4 clicks/pageloads to get to a Radio 4 Afternoon Play. Now you've taken the insane step of getting rid of per-channel programme lists in the beta, it takes 7 clicks/pageloads to get to the same programme.
Mad. Completely mad.
Russ
Complain about this comment (Comment number 1)
Comment number 2.
At 4th Jul 2010, Do I wrote:Russ,
If you know what you want to listen to, have you tried the search box in the top right of the page. The auto suggest is much improved allowing you to find specific programmes more easily.
Also if afternoon play is one of your favourites, try the new favourites functionality. This way all episodes available on iplayer will be appear in your favourites carousel
D
Complain about this comment (Comment number 2)
Comment number 3.
At 4th Jul 2010, Russ wrote:Yes, fair points, D, but:
- Using the search involves typing, which is unergonomic when the primary interface mechanism is predicated on mouse-clicking. (I'm not criticising the search function per se, but it is a sort of 'last resort' in interface design terms.) Searching also uses more ³ÉÈËÂÛ̳ server resources.
- I do use favourites, and like it a lot, but it operates only when logged in of course. (I usually log in only to post in messageboards/blogs.) How many users are logged in when going to iPlayer - a few percent at the most? I recognise the ³ÉÈËÂÛ̳ would like us to be logged in all the time, and user behaviour in this respect will change over time, but at the moment, I would argue the features dependent on being logged in will appeal only to a minority 'enthusiasts' set. The mistake in design strategy in my view is requiring users to be logged in to access basic functionality.
- The 'For you' feature on the beta console works only sporadically. (Very strange.) And when it does work, and one goes 'off genre', it will lock into that other genre, with no way back. The 'For you' recommendations can be very bizarre, and often bear no relation to the recommendations on the non-console pages. Some of the backend databases are either not talking with each other, or are just spewing out random suggestions.
If not wanting to type in things into the search box, it's now quicker to avoid iPlayer beta pages completely, and go via non-iPlayer channel pages. For example, 5 or 6 clicks (depending on route taken, and these will become 4 or 5 clicks when the homepage cookies get sorted out properly) from the ³ÉÈËÂÛ̳ homepage to an Afternoon Play console. iPlayer beta's 7 clicks is a suicide note. It has just made itself redundant.
Admittedly, things can be quicker via a logged-in favourites route, but I don't always 'know' what I want to listen to, and have fairly catholic tastes across Radios 3, 4 and 7. This is where the axeing of per-channel programme lists in iPlayer beta is so inexplicable and perplexing. I note no one from the ³ÉÈËÂÛ̳ has even mentioned this, let alone attempted to defend the rationale.
My basic point remains. The above blog explains at length how iPlayer beta has been made more efficient from the ³ÉÈËÂÛ̳'s point of view. I would still disagree strongly with that premise. It is also demonstrably more inefficient from this user's point of view.
Russ
P.S. On latency aspects, the 2.5s target is interesting. I'm looking at my beta, and the list of new items in my favourites list still hasn't updated on a refreshed page 4 hours (and counting) after I listened to them.
Complain about this comment (Comment number 3)
Comment number 4.
At 5th Jul 2010, Nick Reynolds wrote:Russ - I think you're off topic. This post is about scaling not navigation. Probably best to comment here.
Complain about this comment (Comment number 4)
Comment number 5.
At 5th Jul 2010, Russ wrote:I take your point to an extent, Nick, but in my view, navigational aspects are intimately related to architecture, and thus to pageloads, server demands, caching, personalisation, etc, which is what this scaling blog is all about. I'm not sure we can really separate these aspects.
Russ
Complain about this comment (Comment number 5)
Comment number 6.
At 5th Jul 2010, Joenade wrote:Russ, you have some valid points about the number of clicks it takes to get to your intended page or program - but that is a usability issue in the site menu design - which is a separate matter from the topic of this post which is about optimising the server performance and providing scalability for huge masses of visitors.
Although a higher number of clicks does relate to more server load - but that is only marginal and all the things described here are more to do with the stuff that goes on 'under the hood' and behind the scenes to keep the ³ÉÈËÂÛ̳ site running smoothly and staying responsive.
The ³ÉÈËÂÛ̳ server management team should be speaking to companies like Facebook and Twitter to see how they have scaled their site to meet an ever growing number of visitors. I recently read that Facebook had customized Memcached to meet its own needs and has provided the fruit of that labour by making it open source, which the ³ÉÈËÂÛ̳ should consider taking advantage of if it serves an appropriate need within the server network.
Complain about this comment (Comment number 6)
Comment number 7.
At 4th Sep 2010, Andy wrote:Great post about the architecture. I'm pleased the iPlayer is running on open source PHP rather than some ridiculously expensive oracle or Microsoft system :D
Complain about this comment (Comment number 7)
Comment number 8.
At 5th Sep 2010, Sharturse wrote:Russ is on the wrong board here, but he makes excellent points.
Complain about this comment (Comment number 8)
Comment number 9.
At 5th Sep 2010, Suresh Kumar wrote:I think this is a shocking waste of UK tax payers money.
THe ³ÉÈËÂÛ̳ should be focusing on core Content and Programming -- not trying to be a technology infrastructure provider.
Why are you wasting time learning on content infrastructure delivery. Why not leave that to the likes of Apple, Amazon and Google.
Complain about this comment (Comment number 9)
Comment number 10.
At 5th Sep 2010, iphp wrote:This is a great article, thanks for sharing! We are a web development company in the UK and we highly recommend Zend Framework and PHP if it suits their requirements. There is a lot of negative press about PHP and I think you have made the right decision choosing PHP on its merits and not looking at how fashionable it is.
Complain about this comment (Comment number 10)
Comment number 11.
At 5th Sep 2010, UchihaJax wrote:Oh I forgot to ask, what do you use for the db layer? MySql, Postgres, NoSQL (cassandra, etc) or a combination of the two types?
Complain about this comment (Comment number 11)
Comment number 12.
At 5th Sep 2010, Deja Vidor wrote:This is a great, informative article. Thanks for posting it.
I agree with the comment that one should not lose sight of the big picture -- how many mouse clicks and page views it takes for users to accomplish the task.
Complain about this comment (Comment number 12)
Comment number 13.
At 5th Sep 2010, peterdragon wrote:What proportion of users actually want or use the social media features?
I don't use them. It must cost a fortune catering for the minority who do.
"We use loads of servers":
Offer a simple choice up front to switch between a lightweight interface (default) and the fancy social media interface and remember that in a cookie.
The user can choose between a lightning fast response and a 2 second response -just like iGoogle.
Then the fact a good proportion are being served cheaper pages saves a large amount of server resource.
In the current economic climate shouldn't that be a priority?
Also I guess that while PHP is fine, Zend is too much of an overhead, even with lots of caching.
I agree with commenter #9, a lot of these are solved problems and you could benefit by partnering with Amazon, Facebook and Google to make selective use of their technologies, e.g. AWS spot instances to manage your peak load, Amazon Dynamo / S3 rather than trying to reinvent reliable Key-Value storage, Akamai ESI for scalable edge server caching.
Complain about this comment (Comment number 13)
Comment number 14.
At 5th Sep 2010, Mike K wrote:How does your cache-invalidation work, exactly?
Complain about this comment (Comment number 14)
Comment number 15.
At 6th Sep 2010, Ajax Jones wrote:I'd have a couple of comments however. Things like swfobject.js have no expiration date, nor favicon. Normally not a problem but surely they stack up big time for transfer on the site? Likewise a lot of the css and js has only a day expiration, surely you dont change the site css every day?
Also the CSS looks a bit bloated, according to the stats I ran
48.7% of CSS (estimated 64.8kB of 133.1kB) is not used by the current home page.
7kB of 23.7kB is not used
775 bytes of 1.5kB is not used
/iplayer/r23863/style/style.css: 42.6kB of 70.1kB is not used
14.4kB of 37.8kB is not used
Also has 14 very inefficient rules, 134 inefficient rules, so they would be worth fixing
That spread over so many users would certainly stack up to a lot of bandwidth
Complain about this comment (Comment number 15)
Comment number 16.
At 6th Sep 2010, Ajax Jones wrote:Oh, and minify your HTML to save about 10% of transfer.
While I'm looking the following external CSS files were included after an external JavaScript file in /iplayer To ensure CSS files are downloaded in parallel, always include external CSS before external JavaScript.
So , & /iplayer/r23863/style/style.css should come before external JS
finally and not least
The following resources have identical contents, but are served from different URLs. Serve these resources from a consistent URL to save 1 request and 9.2KB, per user !
*
*
The following resources have identical contents, but are also served from different URLs. Serve these resources from a consistent URL to save another request and 2.8KiB.
*
*
in fact looking at some of the other images, there is another 16K to be saved for every page load by optimising some of the images that seem to have been added later.
Complain about this comment (Comment number 16)
Comment number 17.
At 6th Sep 2010, cordas wrote:I just wish you would introduce buffering... Even 30seconds of buffering would greatly improve my experiences when using Iplayer... its incredibly annoying to have shows stutter with connection issues, especially when I consider how easy it should be to make this a none issue....
Complain about this comment (Comment number 17)
Comment number 18.
At 6th Sep 2010, Alex Farran wrote:What made you decide on PHP/Zend rather than Ruby/Rails or Python/Django for example?
Complain about this comment (Comment number 18)
Comment number 19.
At 6th Sep 2010, Martin J wrote:As a Web Professional, I was astounded to learn that you are currently still running Perl with SSIs - I stopped doing that 15 years ago! And you are switching to php! php, as another purely interpreted scripting language doesn't scale particularly well, and is definitely old-tech for large websites.
I would have chosen Java Servlets for their speed, elegance, supreme scalability, and resilience (fail-over session handling, for example).
Still, best of luck!
Complain about this comment (Comment number 19)
Comment number 20.
At 6th Sep 2010, benoconnor wrote:Is this why there is suddenly a frame rate issue with video play back on iPlayer ? It's been happening for a month. Chrome 6.4 Mac OS 10.6.4 - headache inducing flickering hell. One star.
Complain about this comment (Comment number 20)
Comment number 21.
At 7th Sep 2010, Simon Frost wrote:UchihaJax wrote:
"Oh I forgot to ask, what do you use for the db layer? MySql, Postgres, NoSQL (cassandra, etc) or a combination of the two types?"
We use a combination. MySQL for wher we need it such as programme metadata, and CouchDB for our KV where we need the fast read/write for more user-focused data.
peterdragon wrote:
"Offer a simple choice up front to switch between a lightweight interface (default) and the fancy social media interface and remember that in a cookie.
The user can choose between a lightning fast response and a 2 second response -just like iGoogle.
Then the fact a good proportion are being served cheaper pages saves a large amount of server resource."
An interesting idea, but most of our content is served out of cache anyway and much of the overhead in the request is network latency. You'd also probably need a large number of people to make that change to see any kind of benefit, and in our experience we don't see users doing that kind of activity.
"Also I guess that while PHP is fine, Zend is too much of an overhead, even with lots of caching."
Zend gives us the ability to reuse components amongst teams and it's easy to recruit people with the skills we need, both important considerations for us.
"I agree with commenter #9, a lot of these are solved problems and you could benefit by partnering with Amazon, Facebook and Google to make selective use of their technologies, e.g. AWS spot instances to manage your peak load, Amazon Dynamo / S3 rather than trying to reinvent reliable Key-Value storage, Akamai ESI for scalable edge server caching."
We're not in this to solve engineering problems that have already been solved.
Pretty much everything we build is on top of open-source components; where input is provided by all kinds of organisations and companies. As I mentioned above, our KV store is CouchDB from the Apache Organisation.
As for ESI, it's something we've looked at; Varnish (our caching server) can make use of it. It's just not right for us at the moment.
Mike K wrote:
"How does your cache-invalidation work, exactly?"
Much of the invalidation is time or action-based (e.g. someone adds a new favourite). We don't need to cache for very long to see a massive improvement in performance and saving of resource.
@Ajax Jones: Thanks, I'll pass this onto the team to take a look.
Complain about this comment (Comment number 21)
Comment number 22.
At 7th Sep 2010, Simon Frost wrote:cordas wrote:
"I just wish you would introduce buffering... Even 30seconds of buffering would greatly improve my experiences when using Iplayer... its incredibly annoying to have shows stutter with connection issues, especially when I consider how easy it should be to make this a none issue...."
Thanks for your suggestion - actually we do buffer. When you see the buffering symbol it means that your buffer has been depleted (by the rate of frames played exceed the rate at which the connection can replenish them).
Our media playback team work hard to tune this to give the best possible experience, but at the end of the day this is limited by the performance of your connection.
Complain about this comment (Comment number 22)
Comment number 23.
At 11th Sep 2010, Alex Cockell wrote:Hi Simon,
I think @22 is referring to progressive buffering a la Youtube, so someone on a slow connection could let the thing progressively load while paused, then hit play... without the extra load of AIR etc..
Also, as someone who uses Ubuntu on all his kit, I would like to make a minor observation. You state that the iPlayer infrastructure is built on a lot of open source components - but comments have been made by the open-source and Linux community that the ³ÉÈËÂÛ̳ can appear at times to be leeching off the community and not giving anything back.
It is probably the case that that is not your intention... but it has been noticed that while taking advantage of GNU licences, the Beeb have only ever released stuff under traditional closed-source licences... frustrating developers who only want to help.
If you have a dual-core Intel CPU, you;re in luck - but what about SPARC kit running Linux? Adobe don't have a version of Flash for them...
EG - you may have written a Froyo end-user client... but what about older Android phones? You have a Symbian client available for the Nokia N97, but what about the N900 (which is faster, btw). Runs Maemo - and the hardware is very similar. So - maybe look into porting the N97 client?
And with slapping C&D's on anyone who wants to help you extend the reach through developing end-user clients for other architectures (I can understand how the Beeb would get annoyed about different server infrastructure - but end-user client apps designed to talk to the Beeb's kit?!), it does mean that the ³ÉÈËÂÛ̳ are seen as leeches.
Maybe I could suggest a way you can get the community back on-side? How about offering an API into the iPlayer infrastructure so that 3rd party end-user clients could be built . Or maybe look at plugins to native phone media players? Or at least offer something under the LGPL to enable the full-spec iPlayer feeds to be played out on older phones at whatever resolution they can handle?
Most people are *not* interested in ripping iPlayer streams - we simply want to *play* them on phones which we may be stuck with for up to 2 years. But Linux etc runs on a wide range of architectures - why not leverage all that help... people writing software in their spare time? These people writing playback clients for your streaming services could be likened to folks who, back in the '20s, made crystal sets and used the Beeb's 2LO and 4HY transmissions as feeds to test against.
It sticks in the craw a bit... but between us, we could go that extra step.
Thoughts?
Complain about this comment (Comment number 23)
Comment number 24.
At 18th Sep 2010, Ephemeron wrote:Buffering.
"Our media playback team work hard to tune this to give the best possible experience, but at the end of the day this is limited by the performance of your connection."
Hmm, I have 20MB down 6MB up corporate link, 18.30 hours, one user and only since the new iPlayer does HD video stutter, you may want to reconsider the above statement.
Complain about this comment (Comment number 24)
Comment number 25.
At 16th Nov 2010, Karin wrote:I am tired of the excuses. "The poor workman blames his tools."
Whenever I experience terrible playback performance on the ³ÉÈËÂÛ̳ iPlayer, I find I can switch to YouTube and enjoy comfortable viewing and listening. Why is that? Same ISP. Same time of day. Oh, yeah, they use proper buffering and they evidently care about the user experience.
Complain about this comment (Comment number 25)
Comment number 26.
At 18th Nov 2010, Don Foley wrote:What Alex says above in relation to progressive buffering makes sense.
Simon your statement just contradicted the statement made by your colleague to Karin earlier in this linked post relation to not doing Buffering at all - which is it or does the left hand of the development team not know what its right is doing?
I have a 50Mb corporate line, accessing iPlayer, still getting stutter, still can't get the download to work with this new iPlayer version.
I have some subtle suggestions as to what the right hand of the ³ÉÈËÂÛ̳ iPlayer Development team might be doing, and it might do well to stop and try and refocus on architecting something that works properly !
Sorry its harsh, but its utterly fair. You should rename it the "Dell boy player", it performs like it fell of the back of a lorry in Peckham market.
Complain about this comment (Comment number 26)