Here’s the best information I’ve got on the situation:
There was a bug in the kernel of the database server’s operating system. This bug kicks in when large files are accessed over particular hardware. The reports I’ve seen suggest that this problem really only appeared on HP hardware, and since we don’t run HP hardware it looked like something we could ignore.
As we learned yesterday though, networked storage was eventually affected as well. The server ran cleanly for about 70 days before the bug raised its head and noticed us.
THR was sluggish in the morning. Initially page loads were taking ~ ten seconds per page when I’m used to seeing 1-2 second page load times. I restarted the database application as a temporary fix — generally this is enough to get things happy again, so I can schedule a database optimization (and its required downtime) later in the evening.
I started a thread to alert THR members to downtime later that evening, then posted a follow-up comment. That comment took over 10 minutes to post. 10 minute posts are slow enough to get me to do maintenance during the day even with the hour of downtime it requires. So, I optimized the database…
…and saw the kernel kicking out some major errors. Things continued cleanly through our biggest database table (the one that stores the individual posts), but then things ground to a halt. Bad news.
So, I restarted the server, upgraded the kernel (which appears to have solved the initial problem), and re-optimized the database.
When the optimization didn’t complete the first time, it marked the database table it never started optimizing as “crashed.” This is not a big deal — it’s a feature designed to get me to scan the database in order to insure there are no issues, because the designers of the database software had to plan for all sorts of failures. So I ran the proper scan as recommended.
And now, 21 hours later, the scan/optimization is still running. I know we’re not 85% complete, but I can’t make any estimates that are more accurate than that. It could be another 24 hours for all I know.
The mistake I made was to run the repair process with a dated configuration file. I modified the settings the last time we had to do this, not realizing that database growth over the last few years made these settings much slower than they could have been. The server we’re running on today is much faster than the machine that I last ran a repair on, but with 8.5 million posts in the post database it’s still enough to really bog things down.
At this point I have a couple of options:
- I can let things continue without knowing how long they will take. This is the safest solution, but I really have no idea how long we’ll be down as a result. There’s a > 5% chance that it’ll be the weekend before we’re back up. Or it could be complete in an hour.
- I can stop the process, optimize the settings for the new server, and restart it. From research I’ve performed it looks like this can speed things up by up to 25x, which would be a nice improvement. There are problems though: the chances are at least 1:3 that even with the best settings running we’ll still need to go through the same slow process we’re watching now, so I’d be throwing away 20 hours of work by restarting; on the other hand, stopping the process runs a real risk of actually corrupting the database (right now it’s just marked as corrupt as far as I can tell.)
- I can give up, drop the database, and restore from yesterday morning’s backup. We’ll lose about 6 hours worth of posts, which is probably reasonable. The problem here is that the restore operation is a single-threaded operation, and the database is huge. This means a restore might take longer than we would have to wait for the current process to finish.
We’re waiting, hoping on completion sooner rather than later. The Posts table is being processed right now; once this is complete we’ll know we’re about 85% finished, but right now we could be 10% done as easily as we could be 80% done. I just can’t tell without risking loss of data.
And now you know what I know. But you probably slept better than I did.
It’ll be back up as soon as I can get it back up. Then decisions will need to be made about whether to rearchitect things a bit…