Over the past 10 days we've had widespread reports of downloading issues on the sites that has gone hand-in-hand with the annual Steam summer sale promotion that sees games getting massively discounted on Valve's gaming platform. These downloading issues were caused, simply put, by the fact every single one of our 20 download servers was filled to capacity with people trying to download.
If our registration statistics are anything to go by this year's summer sale was the most successful one yet for Steam. Over the past ten days we've averaged 8,200 new registrations a day including a new Nexus record of 14,505 new members in a single day beating the previous registration record set on November 26th 2011 of 13,570 new members just a couple of weeks after Skyrim's launch. Typically the Nexus will average 3,500 - 4,500 new registrations a day when something special isn't going on.
When you have a huge influx of new members in a short space of time this has quite a detrimental effect on the file servers. While you can typically only browse the site one page/tab at a time, which helps us maintain our resources on the web servers, you can have many downloads running at any one time. The inherent problem with having a huge influx of new people is that their downloading habits are different to "regular" users. As a new user you want to download a lot of mods all at once. You'll go through the top 100 and look up "best mod" lists on the internet and try and download as many as possible. As a "regular" user you've already done this, your mod list is pretty set, and you're now browsing the Nexus to see what's new, perhaps only downloading one or two new files a day to augment your current mod lists. So having a huge influx of 14,000 new users in a day is like adding an extra million regular users to the site over night for a short term period. The result was 20 file servers all serving 400 concurrent downloads each which meant during the Steam sale we were serving 8,000 concurrent file downloads at any given second and maxing out a 10Gbit line. That number would have likely been much higher if it weren't for the hard connection limits we've set on the servers. Hopefully you can appreciate that's a lot and the infrastructure you need to handle that has to be extremely powerful and resolute. While our file server infrastructure is powerful it's typically designed to handle around 6,000 concurrent downloads, and we average around 4,000-5,000 on a normal, non-Steam sale day.Question: Why has this only become an issue now?
Aha, here's a silver lining (ahem). The reason this is the first time we've maxed our file servers is because this is the first time our web servers (the servers we use just to display the sites) have held under all this traffic. Secretly (ahem), we're patting ourselves on the back that the sites themselves were accessible for practically the entire Steam sale week, which means our new Cloud setup and centralised database cluster is finally working. We're obviously not happy about the file server setup so we're working to sort it out.Question: Why weren't you more prepared?
I thought we were :)
Back in January I posted that we had completely decommissioned our file server setup
and we were moving from a 15 standard download server setup to a 20 standard download server setup, an increase in capacity of 33%. The inherent problem was, because our web servers always used to fail before the file servers did it meant we'd never thoroughly tested our file setup under extreme load conditions. Now that the web servers are up to scratch and holding under these conditions the file servers are taking on a lot more load. And so now we can react.Question: Why didn't you just buy more servers when the Steam sale started and it became apparent the load was too much?
The file servers we need can't just be requisitioned overnight. They need to be ordered, delivered, plugged in and have all the firmware and updates applied before we can even get the entire file database copied on to the drives. That takes time, more time that the Steam sale was going to last.
Picture the situation like a huge rock festival (lets take Glastonbury as it's only just finished) that comes to a very small town (population just under 9,000) in England once a year. 361 days of the year the local road infrastructure is completely fine, but 4 days a year, when the Glastonbury festival sets up in nearby fields, the roads are completely choked full of cars and the local residents can barely get out of their own town. Is it prudent for the local council to build an 8 lane highway to support a 3rd party event that may or may not happen from year to year that will only be used for 4 days of the year? I think not. In a similar vein, we'd be talking an extra $5,000 expense each month, minimum, to accommodate an event that happens once or twice a year.We can't just say to our server provider "we want these servers during November/December and June/July but for the rest of the year we don't want them". Contracts have to be signed and so on and so forth.Question: So what are you going to do about it?
Last year we spent considerable time, effort and money to sort out our web server situation and we moved to a much more flexible cloud and cluster setup. This has worked. It now makes sense that we continue those efforts and bring our file servers inline with the cloud ethos.
We're currently in talks with a big CDN service, who already partner with big video game players like Steam, CCP and Wargaming, to get rid of our current dedicated file server setup and move our entire file serving efforts on to a CDN.
If you don't know what a CDN is I won't bore you by going into detail about what it is (a simple Google search will surely enlighten you!), but I will bullet some key advantages it will have over our current setup:
Question: It sounds good, so why haven't you done this in the past?
- Flexibility and scalability. There's practically no limit to the resources we can use and there's no time delay in making use of them, which means no bottlenecks. We contract for a set amount of usage and any overage due to one-off events, like a Steam sale, is charged at a standard and competitive rate.
- Less administration and more secure. Maintaining 27 file servers (20 normal, 3 Premium, 4 static content) is a huge undertaking that requires a lot of server administration to keep up-to-date and secure. Moving to a CDN places this responsibility in the hands of a team of qualified individuals who are much better suited for the job, freeing us up to both not worry as much, and not work as much on this issue.
- Increased performance and localisation. We currently have 14 download servers in the US and 6 download servers in the UK, but the Nexus has a global reach with many users from South America, Asia and Oceania. CDN networks have data centres distributed across the globe that should ensure you really will max out your connection when downloading from our servers, hopefully, irrespective of where you are in the world.
Partly because it wasn't necessary and partly because it costs more. Between 30%-70% more than our current dedicated file server setup depending on how much bandwidth we use. We've come to the realisation from our work on the cloud and cluster setup that this really has to be the future for us, and the added cost, although tough, is necessary to secure the future of the sites. We need to be able to move fast during these sorts of situations which is something we cannot do with a dedicated server setup.Question: When?
As soon as possible. We're testing out the feasibility of the CDN for our setup as I'm typing this.