I try and announce these outages in advance, but performance this morning was just atrocious.

I expected the outage to take about an hour, but it’s running beyond that.  I’ll post more as I learn more.

Sorry for the outage.

12:12PM EST UPDATE: everything is still running cleanly, but slower than normal.  The cause seems to be a kernel bug that pops up when one program (like a database server) starts performing heavy file access (like a database that’s gigs in size.)

I expect the maintenance to run cleanly, but once it’s complete I’m going to perform some other updates, so expect THR to come up, be slow, then go down for 5 minutes again.

Thanks for your patience and understanding.

12:31 EST UPDATE: I was overly optimistic — the database operation stopped proceeding.

Updating the kernel now, then I’ll finish the database operation.  This will take longer, as now one of the database tables is marked as “corrupt” so I’ll need to run the repair checks as well.  I’m guessing at least 90 more minutes until we’re back up.

13:51 EST Update: It’s working hard, chugging along.  It’s just not there yet:

15:15 EST Update: well, things appear to still be chugging along.  I’m worried about the timing here though: generally most of the database maintenance is focused on the table that contains our posts, and that’s what I’m seeing now.  The problem is that I haven’t had to run a repair on the database in years, and the database has grown considerably since then so I don’t know how long this should be taking.

I’ll let it run a bit longer, but I’m not liking what I’m seeing, even though things seem to be running cleanly.

16:45EST Update: We’re coming up on four and a half hours for the process, and it still seems to be running cleanly.  Now we’re just waiting.

19:25 EST Update: We’re more than six and a half hours into the process and it’s still running.  I’m starting to think this may be an all night thing.  If so, I can think up some steps I can go through to keep this from happening in the future.  Of course, that will also involve hours (tens of hours?) of down-time.


2 Responses to September 24th Outage

  1. brian says:

    did you see the comment from that new 4chan/ anonymous poster dragone pone about giving us problems, could this problem have come from them?

  2. Derek says:

    No – as far as I can tell this is something tied to the particular kernel we were running.

