I’m posting this here not because it’s relevant to THR’s userbase, but because Google doesn’t seem to have any results listed that relate to this problem.
For years I have periodically pulled backups from my colocated servers to an off-site location using a program called Rsnapshot. This has worked reliably for years, but it started partially failing last week. Running the program with the -V option (to watch it from the command line with Verbose messages) I got this:
receiving incremental file list
Write failed: Broken pipe
rsync: connection unexpectedly closed (756 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(601) [receiver=3.0.7]
rsync: connection unexpectedly closed (692 bytes received so far) [generator]
rsync error: unexplained error (code 255) at io.c(601) [generator=3.0.7]
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot -V hourly
I could tell something was wrong, but the message itself was not very helpful.
Googling suggested that this error (and the timeout associated with it) can happen when the Rsnapshot server runs out of memory because the number of files to be backed up (not their size) is too large. In this instance I was only trying to back up a dozen files, though.
Attempted Fix #1: Reduce File Size
I noticed that around the time that this started my database dumps had grown to just over 6 gigabytes in size. Modifying the backup script to gzip the backup reduced the size considerably, and rsnapshot was again pulling my database backups offsite.
The problem was that I was still receiving errors when trying to back up larger folders. Attachments for instance are stored in the file system, and we have a lot of attachments that we have accumulated over the last decade. These were no longer being backed up successfully.
Attempted Fix #2: Add More Memory
The backup server I’m using is long on storage (6 TB right now, though I’ve still got room to expand) and short on memory (2 gigs initially). It seemed reasonable that the lack of memory had become a problem, though I doubted it. Still, maxing the memory of the system cost less than $300, so I gave it a shot.
Once installing the memory the problem still existed. I won’t say that the money was wasted, but the problem still wasn’t fixed.
Attempted Fix #3: Update rsync
The backup server and the production servers were running different Linux distributions, and the rsync versions were close but different (3.0.6 and 3.0.7 respectively). I downloaded and installed the most recent version and tried again. Still no go.
Test and Reevaluate
I installed rsnapshot on a spare server in the data center and ran it. Everything ran cleanly as expected.
I realized that my off-site firewall had failed around the time that this started (note: Netgate makes some neat embedded systems, but 2 of the three I purchased have failed within 15 months. Great idea, but I am starting to doubt the execution of these products and can’t recommend them, even though I love pfSense as a firewall and have used it on-and-off next to commercial firewall appliances for years. Thank God for effective fail-over via CARP.)
Changes to permissions on the new firewall didn’t work as I’d expected either. Rsync was running over SSH and these connections stayed up forever as I’d expect, but I was still seeing timeouts when running rsync over ssh. Out of desperation I installed BackupPC to make sure it wasn’t something unidentified with my Rsnapshot configuration, but I was still seeing errors.
The Source of the Problem
Here’s a screenshot of a setting I’d previously ignored:
I’ve seen it many times before and never had to mess with it, but the Ignore “Don’t Fragment” setting in IP header flag was turned on by default. Turning it off resulted in my backups running properly again, and the interface on my backup machine was now showing backups that were maxing out my Internet connection.
In my case, backing up multiple production machines via rsync over a VPN was being impeded by this firewall setting. That’s it — simple to see in hindsight, but a decade of managing firewalls for myself and clients (over 7 different firewall applications) made me blind to this particular IPSec setting. Backups are now running cleanly again.
I’m still sitting here waiting on the backups to finish via BackupPC though, rather than defaulting back to rsnapshot though. The reason is reporting:
As you can see the backups are incomplete, but the ability to check a single web page console as part of my daily monitoring is a huge step forward in proactively staying on top of backups. Previously I had to check the contents of my backup folder manually, which I’m bad about checking up on at least weekly. The added control that this offers is just huge.
Either way though, the problem was resolved by a simple firewall setting.
If you found this page via Google because you were facing the same sort of issue, then I hope this information was useful.