The Importance Of Backups

As you probably didn’t notice, given that I don’t think anyone reads my blog… my server has been having issues over the last several days. The saga started on Sunday evening, when I got a call from the server owner that it had locked up. It had done this before on rare occasion, and previously, had just required a quick reboot to get things back in order.

So, we trekked over to the data center, and I hooked up a monitor and keyboard, and rebooted the machine. I was watching the boot messages, when I saw this message:

RAID Array Status: DEGRADED

Any of you who have ever admin-ed a server can probably imagine how I felt at that moment. Yes, it’s RAID, so in theory you should be able to easily recover from a degraded array. But to make things worse, my only previous experience with a RAID array failing involved a catastrophic failure that took at three of the four drives in a RAID 5 array. One of which was the hot standby.

So, I got the system rebooted, and made sure it was up. But something didn’t seem right… The system was getting occasional lockups from time to time, and I didn’t like it. So over the course of the day on Monday, I did some investigation. I got some of the tools I’d been meaning to install for a while, and put them to work, checking the status of the RAID array in detail, and checking the SMART reports from the drives.

The prognosis was bad. Very bad. Of the three remaining drives in the array, two of them were reporting error status, and one of those two was reporting critical levels of repaired sectors, meaning it was almost out of spare sectors to repair with.

Having recently seen a small thread on the PLUG mailing list about what you should back up on a Linux box, I decided it was time to revisit my backup strategy. My existing backups consisted of rdiff-backups of /etc, /var/www, and a mysql backup. But after reading the thread on PLUG, I was pretty sure I was missing some pieces, so I decided to upgrade.

I started by changing my strategy from including directories I wanted to back up, to excluding directories I didn’t want to back up. Here’s what I ended up with:

Cron scheduled nightly:
rdiff-backup –include-filelist ~/backuplist.txt \
root@209.90.77.126::/ /var/backup/nimitz_backup

cat ~/backuplist.txt:
– /var/cache
– /var/db
– /var/log
– /var/spool/mail
– /var/tmp
– /var/lib/boinc
– /var/lib/mysql
– /var/lib/denyhosts
– /var/lib/postgresql
– /var/lib/asterisk
– /var/lib/init.d
– /var/lib/ntop
– /LC_MESSAGES
– /bin
– /dev
– /lib
– /lib32
– /lib64
– /lost+found
– /mnt
– /proc
– /sbin
– /sys
– /tmp
– /usr
+ /usr/local

Basically, what the above does, is back up everything except for a handful of select directories on my main server (named nimitz, after the aircraft carrier). The directories excluded are generally either binaries that are part of the OS install, or non-critical temp and caching files used by applications, but not important to their function.

So, with my new backup plan in hand, I fired it up to pull a snapshot of my server with the unhealthy RAID array. The initial snapshot took several hours, but it finished.

Fast-forward to Tuesday night. The worst of the three drives in the array was steadily getting even worse, and I was very uneasy, so me and the owner headed over to the server with a new drive in hand to try and get the array to rebuild.

I got the drive installed, and started the rebuild, but something wasn’t right. The system lockups were getting more and more common, and the rebuild restarted after only 2% completed.

At this point, we made the decision to bring up the backup server with all the production data, and let the main server work on rebuilding its array with no load. This took a couple hours to get set up properly, and we headed home around two in the morning. I stayed up for a couple more hours, fine-tuning things on the backup server, making sure things worked.

But, around 3 AM, the not entirely unexpected happened. Nimitz, the main server, stopped responding again. I had been trying to pull one last incrimental backup off of it, to get my backup server as up to date as possible, but unfortunately, fate had other plans.

The lesson I learned from all of the above? It’s NEVER too late to improve your backup strategy. Only 24 hours before the death of my server, I chose to broaden my backups, and the effort saved me many hours of work, and almost definately saved me from losing irreplaceable data that wasn’t being covered by my previous backups.

%d bloggers like this: