Archive for June 1st, 2005

A hard drive crash

Wednesday, June 1st, 2005

“Pumpkin, I can’t get my mail.” That was the cry of help from Chris which prompted me to investigate the status of my pop3 server.

The strange thing was that my pop3 server does not seem to want to start up. I restarted inetd to see if that was the problem. That wasn’t it. I attempted to connect to port 110 while checking the logs, and noticed a strange error “I/O error.” Whatever that meant it was bad. Further checks on the log files revealed a series of “DriveReady SeekCOmplete Error” followed by “UcorrectableError” from the dma_intr function in the ide driver of the linux kernel. This could only mean one thing: a corrupt disk!

Luckily, I had a spare 120 GB hard drive lying around waiting to be used. So I shutdown the server, and proceeded to perform an operation to install the new hard drive to co-exist with the corrupt one. The operation went smoothly and was a success. The system recognized the new drive and proceeded to boot.

The system boot, on the other hand, did not go as smoothly. In fact it did not go at all. The kernel module loader (/sbin/modprobe) had been corrupted. So although the system was running, it could not load any of the drivers for the network, floppy drive, cdrom drive, file systems, pci cards, and pretty much everything else. The system was almost useless.

I booted to single user mode. The root disk failed to mount since it had inconsistencies which fsck could not fix. At least the disk is readable to some extent. I then proceeded to partition the new drive to match the partitioning scheme of the old drive, created file systems on the newly created partitions, and proceeded to copy the data from partitions in the old drive to the new. File system after file system were copied without errors. I breathed a sigh of relief as I discovered that all user data was safe. The drive crash only corrupted one partition, the root partition. I copied as much as I could from the old root partition to the new one.

Now that I have all the files I could possibly recover, how do I convince the system to boot from the new drive. The first attempt to use lilo to configure it to boot failed. After some configuration adjustments, I was able to run lilo to install the boot block on the drive. I needed to test this. I opened the computer again, removed the old drive, and installed the new drive as the first drive on the system. I turned the system on, but it didn’t boot. “How do get lilo to install a boot block?” I asked myself. The hunt for a bootable linux floppy or CD turned up empty.

After reconfiguring the linksys router to connect to my ISP, and the Mac to route through the linksys, I was able to download a CD image of the debian installer. This bootable CD gave me a window into my system with enough functionality so I can work on restoring the corrupted system files.

Several reboots later, I managed to re-install the corrupted debian packages, and get the system up and running again.

Total down time: approximately 3 hours.

Funny, this isn’t my first incident with a failed hard drive on my server. It seems like every three or four years, I go through this process. Each one is unique in its own way, with its unique challenges. One would think I would have processes and procedures in place to prevent these types of disasters, like a backup process. But I don’t. Will I ever learn?

-- Posted in Geeks Paradise