Here’s the best information I’ve got on the situation:

The Cause

There was a bug in the kernel of the database server’s operating system.  This bug kicks in when large files are accessed over particular hardware.  The reports I’ve seen suggest that this problem really only appeared on HP hardware, and since we don’t run HP hardware it looked like something we could ignore.

As we learned yesterday though, networked storage was eventually affected as well.  The server ran cleanly for about 70 days before the bug raised its head and noticed us.

Yesterday’s Actions

THR was sluggish in the morning.  Initially page loads were taking ~ ten seconds per page when I’m used to seeing 1-2 second page load times.  I restarted the database application as a temporary fix — generally this is enough to get things happy again, so I can schedule a database optimization (and its required downtime) later in the evening.

I started a thread to alert THR members to downtime later that evening, then posted a follow-up comment.  That comment took over 10 minutes to post.  10 minute posts are slow enough to get me to do maintenance during the day even with the hour of downtime it requires.  So, I optimized the database…

…and saw the kernel kicking out some major errors.  Things continued cleanly through our biggest database table (the one that stores the individual posts), but then things ground to a halt.  Bad news.

So, I restarted the server, upgraded the kernel (which appears to have solved the initial problem), and re-optimized the database.

The Problem

When the optimization didn’t complete the first time, it marked the database table it never started optimizing as “crashed.”  This is not a big deal — it’s a feature designed to get me to scan the database in order to insure there are no issues, because the designers of the database software had to plan for all sorts of failures.  So I ran the proper scan as recommended.

And now, 21 hours later, the scan/optimization is still running.  I know we’re not 85% complete, but I can’t make any estimates that are more accurate than that.  It could be another 24 hours for all I know.

The mistake I made was to run the repair process with a dated configuration file.  I modified the settings the last time we had to do this, not realizing that database growth over the last few years made these settings much slower than they could have been.  The server we’re running on today is much faster than the machine that I last ran a repair on, but with 8.5 million posts in the post database it’s still enough to really bog things down.

At this point I have a couple of options:

  1. I can let things continue without knowing how long they will take.  This is the safest solution, but I really have no idea how long we’ll be down as a result.  There’s a > 5% chance that it’ll be the weekend before we’re back up.  Or it could be complete in an hour.
  2. I can stop the process, optimize the settings for the new server, and restart it.  From research I’ve performed it looks like this can speed things up by up to 25x, which would be a nice improvement.  There are problems though: the chances are at least 1:3 that even with the best settings running we’ll still need to go through the same slow process we’re watching now, so I’d be throwing away 20 hours of work by restarting; on the other hand, stopping the process runs a real risk of actually corrupting the database (right now it’s just marked as corrupt as far as I can tell.)
  3. I can give up, drop the database, and restore from yesterday morning’s backup.  We’ll lose about 6 hours worth of posts, which is probably reasonable.  The problem here is that the restore operation is a single-threaded operation, and the database is huge.  This means a restore might take longer than we would have to wait for the current process to finish.

Status

We’re waiting, hoping on completion sooner rather than later.  The Posts table is being processed right now; once this is complete we’ll know we’re about 85% finished, but right now we could be 10% done as easily as we could be 80% done.  I just can’t tell without risking loss of data.

And now you know what I know.  But you probably slept better than I did.

It’ll be back up as soon as I can get it back up.  Then decisions will need to be made about whether to rearchitect things a bit…

 

 

10 Responses to September 25th: Ongoing Outage

  1. Jorg says:

    Derek,

    You may want to consider editing the “includes/database_error_page.html” to include a link to this blog so people can see update. You could also toss in a note mentioning that error is normal from 3:30 to 4am Eastern Time or something.

    Good luck reviving the db!

  2. skidder says:

    I know how ya feel. I maintain Debian webservers. Keep up the good work and take some benadryl… helps me sleep when things go south.

  3. God bless and good luck. Thanks for your dedication.
    Doc

  4. Richard Ballard says:

    Thanks for the info, and all your hard work!!

    Best of luck!!

    rcmodel

  5. Doug Young says:

    It’s tough being the person everyone turns to when things go wrong but whom we too often forget to thank when things are going well.

    Thank you, in arrears and in advance!

    beatledog7

  6. Robert Davis says:

    I am a web developer. I know the issues with what you are dealing with. Best of luck to you. It’s not an easy process. Rogue Coder

  7. blarby says:

    If you need folding help to get things processing faster, you can have this IP if you can utilize it.

    No coffee Derek, you’ll just kill yourself.

    Lemme know…….

  8. tom e gun says:

    Sorry to hear about the troubles, THR has become my go to firearms community for the last couple years. Best of luck in getting things in order! What type of server platform are you running on? (Microsoft, Linux, others) Also, what kind of database setup do you have? (SQL server, MySQL, Access, etc.)

  9. wgaynor says:

    Thanks for the update. I am confident it will work out and will be patiently waiting. Best of luck.

    Wes

  10. mdauben says:

    I’ve been missing my regular doses of THR but I agree with your better-safe-than-fast approach to fixing the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.