Yes, I’m sorry, here’s another one of those 5 page, 2,500 word rambling nonsense blog posts I like to make from time to time to show you that I’m not dead and we’re still moving forward. If you have no interest in servers, money or talking about my narrow views on corporate greed, then I would suggest skipping this one!
We’re about one month into the new year now and I wanted to share with you one of our main priorities for this year, as it’s as important to us as it is to you.
Perhaps one of the Nexus’s biggest pitfalls since its inception has been the stability of the service. I don’t have any uptime statistics on hand to report on but I think it’s pretty safe to assume that we’re not hitting the 99.97% uptime that most big sites try to aspire to. Not only is it annoying because you guys can’t access the site, or particular services on the sites (like the downloads) at random points, but it’s annoying for the staff to be awake at 3am in the morning restarting services and troubleshooting database crashes, and it’s annoying for me to be running a service that isn’t 100% reliable. Moving forward I’d like the Nexus to be taken seriously by game developers, and it’s hard to be taken seriously when you can’t guarantee service.
It’s not as though other sites and companies out there don’t have reliability issues. I’ve been locked out of DotA 2 many times recently because the Steam servers have been down, for instance, and that’s from a multi-billion dollar company like Valve. But the problem with the Nexus is that it’s a regular occurrence.
I can attribute this problem to all sorts of systemic issues right through the Nexus, from the way I’ve set up the business to the way we’ve run the site and prioritised things. It’s not as simple as pointing the finger at the lack of server resources, or hardware failures, or the code, or the DDoS’ers, or being a victim of our own success or any one thing, it’s a multitude of things. But the highest priority of this year is to resolve this issue and make the Nexus as stable and redundant as it possibly can be.
So lets take a look at some of these systemic issues and then I’ll explain what we’ve been doing, and what we will be doing to make the situation better and ensure the Nexus is future-proof for the foreseeable future.
The largest factor of all with the stability issues has been the sheer popularity of the Nexus sites coupled with having an inappropriate server architecture to accommodate the demand on sites that are extremely database (and ergo, resource) intensive. I’ve blogged many times on the popularity of these sites and the difficulty in keeping them up with the load placed upon them, so I won’t bother to go into the numbers again. So you’re thinking, “OK, why not buy more servers then?”. The answer isn’t in needing more servers, the answer is in needing to restructure the architecture of the servers and network we currently have so that the combined resources of all the servers can be used to keep the sites going.
Right now we have a situation where we have lots of lower traffic sites (Far Cry, Neverwinter, The Witcher, Morrowind and so on), some high traffic sites (Forums, Oblivion, Fallout 3, New Vegas) and one super mega ridiculous traffic site (Skyrim). Typically speaking almost every site on the internet can fit onto a single powerful dedicated server. Depending on the size of the sites you can even fit hundreds or thousands of normal sites onto a single dedicated server. We have 6 servers dedicated to just serving the Nexus sites (not the file servers, we’ve got 12 of them!). The problem though, is that Skyrim Nexus, and the forums, are not normal sites and they’re at a point where they can no longer fit on one single dedicated server. Similarly we’ve upgraded the hell out of the servers so we can’t make them any more powerful than they are now.
We’ve reached this point where Skyrim Nexus has outgrown being able to run on a single super-powerful dedicated server, so how do we resolve this issue? The solution is in server clustering, which is a technology that lets you pool together the resources of multiple servers to act as one super mega server, much like SLI allows you to connect up and combine multiple video cards in your PC to dramatically increase your frames per second. Unfortunately server clustering isn’t as simple as connecting an SLI bridge connector to your video cards. It’s a lot more complex.
Server clustering is not only complex, it’s also expensive. We have 6 web servers at the moment. We can’t just flip a switch in the servers we currently have and turn on clustering. We’ve got to buy completely new servers, set them up for clustering and then transfer the network on to these clusters. That means running our current setup in parallel with the new one until everything is transferred which means paying for the original 6 servers plus the new servers we need to buy to form our clusters. That’s a lot of money.
And therein lies another systemic issue with the way things are setup. Money. The Nexus sites have remained completely independent; free of corporate interest and investment for its entire 11 years and it shall remain so for the very foreseeable future. The only investment these sites have had was the initial £10,000 I chucked in to the sites when I rebranded the sites as the Nexus back in 2007. I’m the sole owner and sole decision maker of the sites. There’s no outside interest, board of directors or investors pulling the strings behind the scenes. Similarly no game developers have any influence or sway over me. The buck stops at me.
If I wanted to I could make a business plan (I don’t have one, by the way) and go to Silicon Valley, pitch the idea to a load of private and angel investors, secure (potentially hundreds of) thousands of dollars in investment money and make a proper business out of it like many gaming sites and networks have done over the past few years. However, I then become answerable to shareholders and investors who are looking for a return on their investment as fast as possible. To be frank, F’ that.
Similarly it’s just me and 4 other programmers working on the Nexus. We have absolutely no one doing ad sales. I mean it, we have no ad reps at all. Others in the industry gawp at such an oddity. That’s why the ads you see on the site (if you don’t block them) are pretty crap, and in return we get pretty crap rates. While other networks have entire ad sales teams securing them crazy $1-$10 CPM rates on their ads, we don’t. We don’t get anywhere near that. So why don’t I hire some ad reps to better sell the inventory and use that money to pump it all back in to the sites? The reasoning is very similar to my private investment reasoning; when the focus of your business is on increasing your ad sales, and on ensuring a prompt ROI to your investors, you begin to lose sight of what your original goals were and instead focus on one very simple goal: making money. And money isn’t what I’m doing this for. Indeed, if money was my aim I’d be doing all these things I just mentioned, because the Nexus would be a cash cow. Case-in-point; I know sites that have 5-10 programmers working for them and 25-50 ad sales reps. Yes, that’s a 1:5 ratio of people working on content to people working on making money. To me, it’s crazy to have more people working on selling than actually improving and producing the content that sells. But that’s business for you, and I’m not a good businessman.
What this all breaks down to is limiting the stakeholders in the Nexus. Right now you guys, the people who use the site, the mod authors, the downloaders, the people on the forums, YOU are the Nexus site’s biggest stakeholders. If I don’t appease you then these sites cease to have a point. If I seek private investment, or start directly selling the Nexus site ads then my biggest stakeholders become the shareholders and the advertisers on these sites. My focus gets shifted from serving and pleasing you, the users, to serving and please people who have no interest in you. And the point of the site changes from being about modding to being about making money. That’s not what I want at all. There may come a time in the future when direct ad sales and private investment are exactly what the Nexus needs, but that time is not now.
You guys are really, really good when things go wrong on these sites. By and large the reaction is often tame and supportive rather than stressed and raging. I like to think it’s because you know we’re not some corporate mega-money machine that’s cutting costs by cutting corners, but just 5 gamers trying to provide the best service we can. I don’t want to change that, because being greeted with “Ah that sucks, I hope you can fix it soon! Good luck!” is better than being attacked with “WTF this is the worst pile of crap I’ve ever used and you should be ashamed” when something goes wrong.
I want to retain that focus on you guys being the primary stakeholders in the future of the Nexus, which means it takes a lot of monetary planning and saving to buy more servers and invest in expensive technology like server clustering while other sites can simply throw their private investment resources or ad sales money at the problem. That’s why it takes a long time. Avoiding private investment and direct ad sales is a conscious decision that isn’t without its pitfalls, but one that I think is worth it to retain the core values of what these sites were set up for in the first place; to provide mods authors with an easy platform to share their work with others that will stand the test of time.
So setting up server clustering is currently one of our biggest priorities, and we’ll be setting that all up in the very near future, but in the run up to all this we’ve spent (and are still spending) some considerable time right now focusing on the software side of things.
Over the Christmas period, while the “normal” members of the staff were enjoying a forced two week break, Axel was working on an error logging system for the Nexus.
One of the most annoying aspects of bug hunting and troubleshooting is when someone leaves a comment on one of these news articles, or on the forums, or on the tracker that something is broken. Typically it will go something along the lines of “Downloads are broken at the moment”. To which my response is “.......” , followed by much hair pulling. Downloads are broken? What downloads are broken? On what site? What files? Is it all files or just one file? Is it only happening on one Nexus site or all Nexus sites? Is it just small files or large files? What error are you getting? Is it happening 100% of the time or just some of the time? What browser are you using? Have you tried using another browser? Have you tried turning your PC on and off again? Have you tried logging out and in again? Did downloads ever work for you? Have you installed any new browser plugins, firewalls or anti-virus programs recently? What time did this happen? These are but some of the questions we need answered to actually troubleshoot the issue, especially if all the staff try downloading and it works fine.
What I wanted was a system that would aggregate and parse all the error logs the servers produce and present it to the staff in a system that can help us easily pinpoint not only errors and problem areas of the site, but also pinpoint specific times when the sites are worse than others to help us troubleshoot the problem. Typically the error logs that servers produce are all flat-file text documents. Line after line after line of errors with timestamps that can run up to gigabytes in size. It’s extremely hard to make use of these error logs without having a system to properly display the information, and there’s nothing worse than being told something isn’t working when it works for you and wondering if it’s affecting just one person, 1% of people 25% of people or even more. With the error logging system we can now see that “wow, yes, at 10am today we had 5 times more errors than we usually do”. It’s helping us to investigate things more and we’ve already applied numerous hot fixes to the sites over the past month that have patched up errors and slow areas of the site.
Similarly at this very moment we’re working on some more improvements to the downloading system for both manual downloads and downloads through NMM. Right now, if one of the file servers has hiccupped it can be a real pain in the ass trying to download something. These hiccups generally only last for minutes at a time, but during that minute it can make it hard to download any files, especially small files. With that in mind, we’re going to present the file server selection screen on all files now, irrespective of size. If a file server is down, you can quickly select another one to use. We’re also trying to implement a seamless redirect system incase you choose a file server that isn’t working for whatever reason. If the file server you choose isn’t working, the site will simply try another one until it finds a file server that is working. You won’t really notice a difference (except far fewer, or no errors at all!), although if you typically get fast speeds on only one or two servers you might get slower speeds as your download might be served from a different file server from the ones you normally pick, if they’re down.
This concept of seamlessly being moved to a server that works is very similar to our plans with the sites and servers in general. Right now, if there’s a hardware or network failure on one of the servers a Nexus site is on, that Nexus site becomes unavailable. Once our full clustering solution is done we’ll have a load balanced, redundant solution that means all sites are being served from all servers. If one server goes down, the other servers pick up the strain but the sites still work. It reduces the bottlenecks and also reduces our single point of failure problems. And finally, clustering restores the status quo by making “buy more servers” the viable solution to our strained server issues. If the network needs more power you just tack another server in to the cluster and you’ve boosted the resources available to the network. That’s not possible with the current system. So for me, this is quite exciting. For you, it’s more like “I don’t care, just make it work”.
I think I’ve gone on for long enough now. When I wrote my blog piece on Nexus development and expansion philsophy
I was interested when I got a lot of emails and messages from people who were surprised I thought that hard about the process. I like to use these blog pieces to indeed show you that yes, I don’t just sit around all day watching stocks go up and down on my monitors and playing DotA 2, but I think pretty damn hard about this network. The choices I make aren’t just knee-jerk “oh, I guess we’ll just do that then” solutions but plans that have been made out and expanded upon over a long period of time in consultation with others. We’ve wanted to do server clustering for years now (indeed, I mentioned it in that first YouTube video I did), but we’ve only now been in a position to actually afford it. And that’s why I’m excited, even though the topic is pretty boring.