There are a number of issues to be resolved while I’m at the datacenter, but I expect that the total outage will last less than 2 hours.
The tasks I’ll be working on include:
- Putting new memory in one of the firewalls. I’m running a pair of firewalls in fail-over mode, and the fail-over transition works well and takes less than a second. This is a good thing, because the primary firewall has been going down every 20 days or so since it was installed. I’ve had the motherboard swapped once, and the manufacturer now thinks swapping the memory may help. I’m willing to try. Then I’m willing to move to a different piece of hardware if it keeps failing. Total time on this is probably 20 minutes or so.
- I need to reboot the main switch the cluster is using, but doing this while the cluster is up and running is guaranteed to cause problems. Everything will be powered down first. Figure 5-8 minutes to make sure everything is powered down, then 2 minutes to reboot the switch, then another 8 minutes to make sure everything came back up properly. Total time should be another 20 minutes.
- Spare switches will replaced with a single spare switch. It’s preconfigured though, so this should be a quick swap. The only issue here is that wiring paths may require that this be done while the servers are down — I’ll know more once I start.
- One of the spares will be kept in the rack, but will be moved so that its purpose is completely clear. This makes recovery after a failure by third parties less error prone. Again, the swap itself should be fast, but performing the move may require disconnecting other network cables, so this may be something that’s done after the servers are powered down.
Overall the process should be clean and there are few things that are likely to fail. Still, I’ll keep my fingers crossed.