An update on recent events and site up, down, and all-around...ness
So what’s going on at the moment? SHITE, that’s what! Where to start? Since our failed attempt at setting up the MySQL Cluster last Thursday and Friday we reverted back to our old setup. Naturally the weekend was bad, as they have been for the past month or two, then during last week things went back to being fine, and then, again, this weekend, it was awful. 502 errors, 504 errors, I’m sure you know what I’m talking about. I’ve had enough of that, I’m sure you have too. It may be frustrating for you, it’s freaking debilitating for me.
There’s various issues right now that have reared their ugly head and require immediate remedying. Remember when Skyrim first came out and we were struggling under all the load? The sheer amount of people wanting mods was crushing us. Back then we were getting around 6 million page views a day. If you do the math that’s around 69 page views a second. Typically traffic is meant to go down over 20 months, but nope, we’re still pushing around 5.2 million page views a day, or 60 page views a second. Of course, we’ve made loads of optimisations since those days, but there’s also a big difference between back then and right now; back then our database was tiny. It didn’t have 30,000 files, or 300 million downloads in the database. Searching a database of 300 million rows for your user ID is, naturally, going to take a lot longer than searching a database of 1 million rows for the same user ID. So while traffic has gone down, slightly, our database has grown by many, many, MANY orders of magnitude which has slowed things down. It’s made our jobs difficult as we constantly work between new features, tweaking the servers, the new cluster setup and what not. Frankly, it’s a pissing nightmare and our jobs right now are most definitely not fun. They haven’t been for quite some time.
This is what the Cluster setup was meant to fix for us not only by providing us with a hell of a lot more power for us to make use of but to also make the concept of increasing that power whenever needed that much easier. Instead of trying to eek out every last drop of performance from the servers with config/ini tweaks we could simply buy a new server, tack it on to our cluster and hey presto, we just increased the power available to us by X amount, quickly, easily, efficiently, freeing up time for us to work on making the sites better, rather than simply making the sites work.
I’ve been on the lookout for someone who knows what they’re doing to help us out with our cluster setup. Unfortunately the MySQL Cluster experts, certified in the job, cost an astronomical amount of money to hire on. I got quoted $200/hour from 3 separate companies. If you consider it takes around 12 hours to import our forum database in to the cluster, that’s $2,400 just to do that one simple task. I imagine it’ll need at least 50-60 hours spent on it, and $10,000 is quite an expense. Even then there’s no guarantee it will be completed within that timeframe and I can’t write a blank cheque. The cluster is our salvation and our future, we just need it to work!
Today Skyrim Nexus was absolutely awful, 502 and 504 errors galore. The reason? One of our software licenses decided to think it had expired, when it hadn’t (due to expire in 2014) and brought down the server with it. cpnginx, take a bow, you’re freaking badly made! We’ve got that fixed again and now we’re diagnosing yet another issue.
Our database isn’t the only thing that has grown in size over the past 2 years; the number of people using the Nexus Mod Manager has as well. We’ve just passed the 2 million unique users mark for NMM, and, similar to issues we had 6 months or so back, we’re having to tweak things again to be able to cope with the demand. We’ve actually turned off the NMM web services this evening to run some tests. Getting a “file does not exist” or “server unreachable” error when trying to download through NMM? Sorry, that’s us. Also, sorry we can’t give you a better error message than “file does not exist” when, actually, it does exist, the services are just down. That’s just stupid. I’ll get that put in to the next version of NMM so that when we take down the services for whatever reason you’re informed properly instead of thinking something is actually broken and come to the forums looking for blood. We’ve found that turning off the NMM services makes Skyrim Nexus more than good to browse; I was able to click through 20 search result pages in a minute, a freaking enigma in this Nexus day and age. If we turn the services back on everything slows down to a crawl again. Safe to say NMM is causing problems for the sites, so we’ll be looking into that as soon as possible as well. I’m going to leave the NMM services off until we go to sleep tonight in a couple of hours time. Browse Skyrim Nexus, isn’t it better? Isn’t it MUCH BETTER? That’s what we want it to be like all the time.
And finally I received some “good news” last week, depending on how you want to view it. When our file servers were hacked a few months ago and some downloads were serving malware instead of mods we put in extensive work to not only “harden” the servers but add a ridiculous amount of monitoring to the servers. If you so much as look at our servers in a funny way we’ll be notified about everything we can possibly know about you. NSA, eat your heart out. A couple of weeks ago we were made aware of some breaches to two of our servers again. We were notified automatically and immediately about the breach and took the servers down within 5 minutes of finding out. We spent a ridiculous amount of time trying to work out how the hacker(s) had gained access to our servers. When we exhausted every option we got in contact with the provider for the servers. Turns out we wasted about 100 hours of our time as the breach was in their systems, and not ours. I cannot tell you much about the topic as it’s under investigation with the FBI, but the attack is thought to originate from the Ukraine and it was specifically targeted at the Nexus and one other site/network at the provider. This, frankly, was a relief for us. Not because we were hacked, of course, but because it wasn’t “our fault”. There was nothing we could have done, or added to the servers, to prevent this attack other than not use this ISP altogether. This ISP has been good to us over many years, we had no reason to doubt their competence and they are a major international player on the world stage. The fact the hacks came through this ISP have meant we can sit back and let them stretch the full force of their contacts at the FBI and local crime authorities, rather than us trying to flex our petty influence. This tells me a couple of things: (1) these people targeted us specifically but (2) they couldn’t hack our servers, so they instead hacked our ISP instead to get to us (3) we know when something bad has happened and can act very quickly to prevent anything reaching you (4) our ISP acts fast to rectify the problem on their end. It’s not exactly a happy story, but it was a relief for me at least.
Before I sign off let me just say this. We know there are problems with the performance on the site. It’s very hard to miss. I appreciate all the people letting me know about the problems through various mediums; the forums, support tickets, email, etc.. A few people get pissed off that I don’t respond every day to the posts on the forums on the topic. The inherent problem is that when I respond, or even make a news post like this one, it gets buried within 5 posts and you get Joe Blogs coming along who couldn’t be bothered to check the news or read all the posts in the thread asking the exact same question for the exact same problem, normally across 5 or 6 different threads. This is annoying, and if I spent my time responding to every single post like this I’d have no time for anything else. So understand I most definitely know about these issues and silence from me doesn’t mean I’m disinterested, silence means we know about the problem, we’re working to fix it, and I’ll let you know as soon as I know more about the problem myself.
PS. You CAN still use NMM even without being able to login. Click the "Offline" button and you have access to all your mods. You can even install mods, you just need to use the manual approach (and NMM supports dragging and dropping).
253 comments
Comments locked
A moderator has closed this comment topic for the time beingI've been lingering since I used to mod Oblivion on the TESNexus, and this place is the holy grail of the Bethesda modding community.
(wonder who's going to get that refenrence )
Well, you know, I do wonder why the heck some folks in the Ukraine suddenly take an interest in hacking a website such as this... I mean, The Nexus (afaik) takes no particular political stance on anything global (I don't consider Piracy a topic strong enough to incite such an attack), doesn't sell anything other than Premium memberships and I seriously doubt the Nexus' strict behavioural rules would cause such a severe attack in response.
The only logic result, by deduction, would be that the Nexus is just a great target in proxy to target a huge mass of people. Just like the Malware downloads.
So in a sense, it's a huge compliment that people as far as the Ukraine target this place.
Still sucks, but hey look over there, Optimism!
I do have to say, if 'a' weak spot in this website would be the ads, being premium basically would guarantee as a safety measurement from it, right? That's an interesting perspective...
I'm a professional DBA and quite willing to donate my time in assisting you. If you'd like a little more history on my experience, feel free to send me a PM or shoot me an EMail (I'm sure you have it somewhere in that DB ). I have a background focusing on high rate OLTP and OLAP environments -- 5000+ QPS on a single instance
As for some offhand suggestions:
* Partitioning may help a lot, at least in the short term.
* Cluster may not be quite the knockout solution you think it is/have been lead to believe it is
* Sharding will offer far more performance and keeps your setup fairly simple, though it will require development work to handle said sharding.
hat off to you sir!!