We were down for a couple of hours this morning. From the firewall log:
What happened is this: our primary firewall failed around 7am, and the secondary firewall mostly took over. Servers behind the firewalls were still accessible, but for some as-yet undiagnosed reason requests to THR were no longer going to the load balancer as they should have been.
To correct the problem I rebooted the firewalls and a switch they attach to, and the loss of the network switch caused the servers that house THR to get confused, and that recovery required 90 minutes of downtime that would have otherwise been avoidable.
Corrective actions at this point include:
- Swapping the firewalls, so that the backup will now be the primary. The primary has failed 4 times now, and swapping the hardware will tell us whether the problem is one with the hardware itself, or if it is instead a problem with the firmware and/or current settings.
- Moving the (now identified as critical) network switch to a different power outlet, so future firewall restarts will not cause this sort of cascading failure.
- Diagnosing why the firewall fail-over was partial, rather than complete.
There are lots of moving parts here that have to work together. I should be able to fix the problems I identified this morning, then we will watch to see if the system acts healthy, or if these fixes help us discover another weakness in the system.
I’ll keep you all informed.