Chapter 12. Rescue and Recovery

Years of linux administration have convinced me that you learn the most about a system by repairing it when it is broken. Nothing pushes you to the limits of find arguments, dd commands, or general shell know-how like a critical system that no longer boots. I have had my share of broken systems over the years—some my fault and some not—and in this chapter I describe some of the recovery techniques I find I use over and over again.

There are three main recovery tools I describe in this chapter. The first is the recovery boot mode that is included with a default Ubuntu Server install. This mode provides the most limited set of recovery tools, as it requires a system that can at least partially boot. The second is the recovery CD mode that comes with your Ubuntu Server install CD. This option gives you all of the functionality of the recovery mode but adds extra recovery tools that you can run directly from the CD. Unfortunately, both of these tools are somewhat limited in the types of disasters from which they can recover, so the final section of the chapter will describe some recovery techniques that require a separate rescue disc. In this case I describe how to use the Ubuntu Desktop live CD for rescue, but you could use any live CD that allows you to install packages to the live CD such as Knoppix.

Ubuntu Recovery Mode

The Ubuntu recovery mode is a boot option that is included with your default server install. As you boot your system, GRUB will provide a basic prompt before it starts the boot process. After you press Shift, you will see that each kernel on your server has a recovery mode option beneath it. When you select the recovery mode, Ubuntu will start the normal boot process, but instead of launching all of the services on your system, once it completes you will be greeted with a recovery menu as shown in Figure 12-1. This menu provides you with eight options:

resume

Choose this option to continue the boot process back to your regular system. You would pick this option if you accidentally chose the rescue mode or if you had successfully completed any fixes in the rescue mode and were ready to go back to the normal system.

clean

The clean option attempts to clear up some free space on your system in case you have filled up / and can’t access the system.

Figure 12-1 Ubuntu recovery mode menu

dpkg

This option performs an apt-get update and upgrade and attempts to repair any problems you might have with half-installed packages. You might choose this option if a package did not fully install or upgrade correctly and its init script is stalling out so the system can’t boot fully. This choice could potentially fix the package problems.

fsck

This option checks all file systems for errors and attempts to repair them.

grub

If you select this option, the rescue mode updates GRUB—handy if you have accidentally trashed your GRUB configuration.

network

This option enables networking on the host, which you may want to do before you select the root mode.

root

This mode is the most useful of the options in this menu as it just drops you to a root shell on your booted server. The rest of this section focuses on what you can recover with this option.

system-summary

This self-explanatory mode provides a summary of system information.

In the rest of this section I discuss some potential rescue steps you can take once you choose the root option from the recovery menu. This drops you to a root-owned shell on the system, brings up your network (if you selected the network mode), and is a bit further along the boot process. This recovery mode requires that you can at least partially boot and mount the root file system; depending on what is broken on your system, this may not be possible. If you can’t boot into this mode and need to recover a system, move on to the Ubuntu Server Recovery CD section later in this chapter.

The rescue root shell is both limited and unlimited. It is limited in that there are no apparent automated tools to recover common problems on the system; however, it is unlimited in that you have full access to any tools already on your server. Usually you go into a rescue mode like this because your system won’t fully boot, so I cover some of the common problems you might want to fix in this mode.

File Systems Won’t Mount

The file systems in /etc/fstab generally are mounted as the system boots. If a file system won’t mount at boot, you often need to drop to a rescue shell so you can either repair the file system or correct problems in /etc/fstab so it can mount. Of course, if the problem file system is the root file system, you probably won’t even be able to get to this rescue mode, so skip ahead to the Ubuntu Server Rescue CD section.

File System Corruption

There are a number of scenarios when a file system might get corrupted through either a hard reboot or some other error. In these cases the default fsck that runs at boot might not be sufficient to repair the file system. Be sure before you run fsck on a file system that it is unmounted. You can run the mount command in the shell to see all mounted file systems and type umount <devicename> to unmount any that are mounted (except the root file system). We are assuming that since this file system is preventing you from completing the boot process, it isn’t mounted. In this example let’s assume that your /home directory is mounted on a separate partition at /dev/sda5. To scan and repair any file system errors on this file system, type

# fsck -y -C /dev/sda5

The -y option automatically answers Yes to repair file system errors. Otherwise, if you do have any errors, you will find yourself hitting Y over and over again. The -C option gives you a nice progress bar so you can see how far along fsck is. A complete fsck can take some time on a large file system, so the progress bar can be handy.

Sometimes file systems are so corrupted that the primary superblock cannot be found. Luckily, file systems create backup superblocks in case this happens, so you can tell fsck to use this superblock instead. Now I don’t expect you to automatically know the location of your backup superblock. You can use the mke2fs tool with the -n option to list all of the superblocks on a file system.

Warning

Be sure to use the -n option here! Otherwise mke2fs will simply format your file system and erase all of your old data.

# mke2fs -n /dev/sda5

Once you see the list of superblocks in the output, choose one and pass it as an argument to the -b option for fsck:

# fsck -b 8193 -y -C /dev/sda5

When you specify an alternate superblock, fsck will automatically update your primary superblock after it completes the file system check.

Fstab Mistakes or UUID Changed

Another common problem you might face is that a file system won’t mount because of a mistake in your /etc/fstab file. It might be that you migrated a file system from one partition to another and forgot to update its UUID in /etc/fstab. If this happens for the root partition, the same steps apply, but you will likely have to either run the commands from a rescue CD or edit the GRUB prompt at boot time so that the root= option points to the partition’s device name instead of the UIID.

In any case, to discover the UUID for any file system, type

# ls -l /dev/disk/by-uuid

This directory provides symlinks between UUIDs and their partitions, so it’s easy to see what is assigned where. Just make a note of the correct UUID, open /etc/fstab in a text editor, and update the UUID reference.

Problem Init Scripts

Sometimes an init script on a server stalls out. It could be that it requires a network connection that isn’t available, or it could be any sort of other problem. No matter what the problem is, if an init script isn’t written to automatically time out, when it stalls it can completely tie up the rest of the boot process. In these cases you might want to temporarily disable the init script from starting at boot time so you can fully boot the system and solve the problem.

The problem init script is likely in one of two locations. If it is a system init script, it will be located under /etc/rcS.d. Otherwise, since the default runlevel on an Ubuntu server is runlevel 2, it will likely be found under /etc/rc2.d. In either case, to disable an init script, locate it under one of these directories and then rename it so that the S at the beginning is now a D. For instance, if I was having some sort of problem with custom programs I put in my rc.local script that tied up the boot process, I would type the following to disable it:

Click here to view code image

# mv /etc/rc2.d/S99rc.local /etc/rc2.d/D99rc.local

Now I could resume the boot process normally and look into the problem init script. Once I finish, I just rename the file again and replace the D with an S.

Reset Passwords

A final system problem that might put you in a rescue mode is the situation where you have forgotten your user’s password or you are taking over a system from a previous administrator and don’t know the user password. In either case it is trivial to reset the password in the recovery shell. Just type passwd along with the name of the user to reset:

Click here to view code image

# passwd ubuntu

Enter new UNIX password:

Retype new UNIX password:
passwd: password updated successfully

If you get an error that the authentication token lock is busy, you likely forgot to remount the file system read/write, so first type

# mount -o remount,rw /

Then run your password command.

Once you are finished with any recovery from the root shell, you can type exit to return to the rescue menu, where you can choose to resume the boot process.

Ubuntu Server Recovery CD

While the Ubuntu recovery mode can help you fix certain problems, it requires that GRUB functions and that you can get through at least the beginning phase of the boot process. If your root file system is corrupted or GRUB stops working, you will need some other method to access and repair your server. The good news is that if you still have an Ubuntu Server install CD around, it has a built-in recovery mode. This recovery mode allows you to access the root file system as with the GRUB recovery mode, but since it boots from its own kernel and provides its own set of Linux tools, you can also use it to recover from problems with a root file system.

Unfortunately, the Ubuntu Server recovery CD has its own set of limitations. Essentially you will have access to a BusyBox shell prompt with a limited set of recovery tools. While you can certainly repair file systems and restore GRUB, if you want to do more sophisticated recovery such as deleted file recovery or partition table restoration, you will need a more advanced rescue disc that either includes tools like sleuthkit, gpart, and ddrescue or allows you to install these packages from the live CD. In this section I discuss some of the situations beyond the GRUB recovery mode where you can use the Ubuntu Server recovery CD to repair your system.

Boot into the Recovery CD

To boot into the recovery CD, set your server to boot from the CD-ROM drive and insert the Ubuntu Server install CD. After you choose a language, you will see the standard installer screen. Instead of choosing the install option, use the arrow keys to select “Rescue a broken system” and then hit Enter. This will enter into a special recovery system on the installer.

After the recovery CD boots, you will be prompted with a lot of the same questions you might have seen in a standard server install such as language, keyboard, and time zone questions. Don’t worry; this won’t install over the top of the system (note the Rescue mode title in the top left of the display). Just answer the questions up until you see the prompt to select the root partition. Ideally you will already know which partition is the root partition, but if not I suppose at this point you will need to perform some trial and error until you locate it.

After you choose a root file system, you will see the main recovery operations menu as shown in Figure 12-2. The options are pretty self-explanatory:

Execute a shell in /dev/sdal

This first option will open a shell in your root file system. Here I put /dev/sdal, but this menu will point to the partition you choose. This choice gives you essentially the same recovery options as in the GRUB recovery mode, as you can run any number of commands from inside the root file system such as package managers or other system tools.

Figure 12-2 Recovery operations menu

Execute a shell in the installer environment

The bulk of your recovery work will likely occur from this option. Choose this and you will drop to a BusyBox shell on the install CD itself. The root file system will be mounted under /target so you could potentially edit configuration files from this mode. The advantage to this mode is that it exists outside of the actual root file system, so you can do things such as run fsck on the root partition—something that wouldn’t be allowed if you had booted into the system itself.

Reinstall GRUB boot loader

One of the most common reasons why you might boot into a rescue CD is that GRUB is broken. Without GRUB you can’t boot into the system without some serious Linux kung fu. Choose this option and an automated script will reinstall GRUB onto the disk or partition of your choice. Most people tend to install GRUB on the master boot record, so when you are prompted for a location to install GRUB you will probably choose (hd0). Note that if the rescue CD can’t locate the grub configuration files under /boot/grub, this won’t appear.

Choose a different root file system

This option is pretty self-explanatory. If you happened to choose the wrong root file system, this option will let you change it.

Reboot the system

Here is another self-explanatory option. Once you are finished with your system recovery, choose this option to reboot.

Tip

When you are within either of the shell environments you can type exit to return to the rescue operations menu.

Recover GRUB

I have already mentioned the “Reinstall GRUB boot loader” option from the rescue operations menu. This will reinstall GRUB to the disk or partition of your choice, but sometimes GRUB itself is installed but its configuration file is missing or corrupted. When this happens, instead of a GRUB menu at boot time, you may not see anything at all. To fix this problem, choose the menu option to execute a shell within your root partition. Once there, run

# update-grub

This option will create a new /boot/grub/grub.cfg file based on your available kernels. Once it completes, you can type exit to return to the main menu and reboot the system.

Repair the Root File System

Typically the recovery CD will attempt to mount the root file system if possible. If it can mount the root file system, then you will not be able to unmount it and run any tools such as fsck on it. Of course, if the rescue CD were able to mount the file system, you wouldn’t need to fsck it now, would you? If the root file system is corrupted and the rescue CD can’t mount it, then drop to the installer shell and run

# fsck -y /dev/sda1

Replace /dev/sda1 with the path to your root partition. If fsck complains about a bad superblock, follow the steps in the File System Corruption section under Ubuntu Recovery Mode. Otherwise, depending on how damaged your file system is, you might see fsck output the errors that it finds as it attempts to repair them.

In addition to the specific rescue steps I listed earlier, you should be able to perform all of the recovery steps from the GRUB recovery mode. Just choose the “Execute a shell in /dev/ubuntu/root” (it will replace /dev/ubuntu/root with the root partition you selected) from the recovery operations menu and follow the same steps.

Ubuntu Desktop Live CD

There are certain system rescues you need to perform that require you to boot outside of the server itself. Any system imaging, root partition fsck commands, or any other time that / needs to be unmounted you will need some sort of rescue disc. While I have already mentioned how you can use the Ubuntu Server install CD as a rescue disc, unfortunately you are limited by the tools present on that CD. There are a number of different live CDs available that provide the same set of tools, such as Knoppix and DSL, but since I assume it’s more likely you will have an Ubuntu Desktop install CD around and it doubles as a live CD, I discuss some more advanced recovery techniques you can perform from the CD.

Boot the Live CD

The first step is to boot the live CD into the standard GNOME desktop. Don’t worry if your server doesn’t have a sophisticated video card since basically everything I describe can be done from the command line.

Add the Universe Repository

Once the live CD boots into the desktop, you need to add the universe repository to its list of package repositories. All of the tools I use here come from packages in the universe repository, so either click System_ Administration_Software Sources and make sure that the Community-maintained Open Source software (universe) option is checked, or open a terminal (Applications_Accessories_Terminal) and then as root edit /etc/apt/sources.list in your favorite text editor and change

deb http://us.archive.ubuntu.com/ubuntu precise main restricted

Click here to view code image

deb http://us.archive.ubuntu.com/ubuntu precise main restricted
universe

Of course, if you are running a newer live CD than Lucid, you might see some other name here, so change the example to suit your Ubuntu live CD. Then from the same terminal run

$ sudo apt-get update

to update the list of available packages. Now you can install the tools for any of the rescue tips requiring apt-get in the rest of the chapter.

Recover Deleted Files

It has happened to the best of us. I think every sysadmin has accidentally deleted the wrong file at one point in his or her career. For a long time I thought that once a file was deleted under Linux there was no way it could be recovered, but it turns out that’s not entirely true. When you delete a file on Linux, the file system returns those blocks to the available space. Until another file uses those blocks, the data from the old file is still there and potentially recoverable. The key is to stop writing to that file system as soon as you can once you delete a file. That way you reduce the probability that the data will be overwritten.

In this example I assume you have halted the machine with the deleted file and have booted the Ubuntu live CD. Once the CD boots and you have added the universe repository, use the package manager to install the sleuthkit package or open a terminal and type

$ sudo apt-get install sleuthkit

Sleuth Kit is a set of forensics tools to aid investigation of a break-in on a system. Recovery of deleted files is a valuable thing for a forensics investigation, and Sleuth Kit has provided a pair of tools, fls (forensics ls) and icat (inode cat), that have deleted file recovery features.

For this example we assume that you have accidentally deleted the /etc/shadow file on your root file system /dev/sda1. Because these tools copy recovered files to another file system, you need to make sure that you have enough space to store them. Since /etc/shadow is a small file, the RAM disk used by the live CD is enough to store it, but if you need to restore a large number of files, or files that take up a lot of space, you will want to attach some sort of external storage or NFS share. I store everything under /home/ubuntu/, but if you mounted a USB drive at /media/disk, for instance, just replace occurrences of /home/ubuntu with /media/disk.

The first step is to create a directory to store the fls output and any files you recover. In this example I will call the directory recovery and put it under /home/ubuntu. Once the directory is created, use the fls tool to scan /dev/sda1 for any deleted files and output the results in a text file:

Click here to view code image

$ mkdir /home/ubuntu/recovery
$ sudo fls -f ext -d -r -p /dev/sda1 > /home/ubuntu/recovery/
deleted_files.txt

Since the fls command has to scan through the entire /dev/sda1 partition, it might take some time to complete, depending on the size of the drive. To get more information about each of the fls arguments, you can type man fls in a terminal to see the full manual.

Once the command completes, I can open the deleted_files.txt file in a text editor and I will see a list of files and directories like the following:

d/d * 458:   etc/skel
r/r * 2094:  etc/shadow
r/r * 5423:  etc/wgetrc

The first column tells whether the file in question is a directory (d/d) or a file (r/r). The numerical column tells which inode this particular file uses, and finally you can see the full path to the file in the final column. Since we want to restore the etc/shadow file, we need to locate and copy inode 2094. Sleuth Kit provides the icat tool for this purpose—it is like the cat command only it accepts inodes as arguments. To restore this file, I type

Click here to view code image

$ sudo icat -f ext -r -s /dev/sda1 2094 > /home/ubuntu/recovery/
shadow

If the file is indeed recoverable, once this command completes I will see a copy of my shadow file under /home/ubuntu/recovery/shadow. Then I could mount the /dev/sda1 file system from the rescue disk and restore /etc/shadow from here. Now if you wanted to recover more than one file, either you could run this command multiple times and restore files one at a time or you could write a script to do it for you. There are a number of such scripts online, and the following is based off of a script originally found at http://forums.gentoo.org/viewtopic-t-365703.html that I then tidied up and improved:

Click here to view code image

#!/bin/bash

DISK=/dev/sda1 # disk to scan
RESTOREDIR=/home/ubuntu/recovery # directory to restore to

mkdir -p "$RESTOREDIR"
cat $1 |
while read line; do
    filetype=`echo "$line" | awk {'print $1'}`
    filenode=`echo "$line" | awk {'print $3'}`
    filenode=${filenode%:}
    filenode=${filenode%(*}
    filename=`echo "$line" | cut -f 2`

    echo "$filename"

    if [ $filetype == "d/d" ]; then
      mkdir -p "$RESTOREDIR/$filename"
    else
      mkdir -p "$RESTOREDIR/`dirname $filename`"
      icat -f ext -r -s "$DISK" "$filenode" > "$RESTOREDIR/$filename"
    fi
done

Save this file to /home/ubuntu/restore and change the DISK and RESTOREDIR variables to match the partition you want to scan and the directory you want to restore into, respectively. Then to use the script, give it execute permissions and run it with the path to your complete list of deleted files as an argument:

Click here to view code image

$ sudo chmod a+x /home/ubuntu/restore
$ sudo /home/ubuntu/restore /home/ubuntu/recovery/deleted_files.txt

The script will then systematically go through all of the files in the deleted_ files.txt file and attempt to restore them to RESTOREDIR. It will create directories as necessary as well, so once it is finished you should see a directory structure matching that of your deleted files within RESTOREDIR.

Restore the Partition Table

The partition table is a 64-byte section of the 512 bytes at the beginning of a hard drive known as the master boot record. These 64 bytes contain the settings for any primary or extended partitions you have created on the disk. It’s easy to take the partition table for granted. After all, it does take some effort to erase or corrupt it. Then again, all it would take is an fdisk command on the wrong drive to make a hard drive unreadable by your server.

The good news is that even if a partition table is erased, the data for each partition is still on the disk. All you need to do is figure out where each partition begins and ends and you can reconstruct the partition table and restore your data. Of course, this can be rather difficult to do manually, but Linux has a tool called gpart (short for Guess Partition) that can do the hard work for you.

The way that gpart works is to scan through the entire disk looking for sections that match the beginning or end of a certain type of partition. When it finds these sections, it makes a note of them and moves on. By the time gpart is finished, it has what it believes is a complete partition table for your disk.

Before I go into how to restore a partition table with gpart, it’s worth discussing some of gpart’s limitations. The primary limitation it has is with extended partitions. While gpart is good at finding primary partitions, extended partitions are more difficult to identify, so if your disk has extended partitions you might get incomplete results. Also, gpart sometimes can be slightly off on where a partition begins (or more often) ends. I’ve seen gpart miss the end of a partition by a megabyte or two, but since most of us build partitions back to back, typically these sorts of small errors are easy to correct manually.

To install gpart on the live CD, either use the graphical package manager to install the gpart package or open a terminal and type

$ sudo apt-get install gpart

Once gpart is installed, run it in a terminal as root and pass it the drive to scan as an argument:

$ sudo gpart /dev/sda

Of course, replace /dev/sda with the path to your device. Once gpart is done, it outputs its results to the screen but does not write anything to disk. This way you can examine its output and see if it matches what you expect. Once you approve of the output, run gpart again, only this time with the -W option so it writes its changes to disk:

$ sudo gpart -W /dev/sda /dev/sda

The -W option takes a disk to write to as an argument, which is why you see /dev/sda here twice. Once gpart is finished scanning, you will be prompted to edit its results. In my opinion the gpart editor is a bit more difficult to use than fdisk or cfdisk, so I typically write the changes to disk and then do any minor corrections with fdisk or cfdisk. Remember, you can shift around the partition table and write it to disk without directly impacting your data, so it’s OK to have gpart write an incorrect table that you then follow up and correct.

Rescue Dying Drives

If you have read Chapter 11, Troubleshooting, you will be acquainted with Smartmontools. This package can scan your hard drives and report when any of them appears unhealthy. Of course, what do you do when a hard drive is unhealthy or, worse, is so unhealthy that it will no longer mount? Usually the longer an unhealthy drive runs, the more data is lost, so you want to react quickly. Ubuntu has an excellent tool called ddrescue that you can use to create an image of a drive even if it has numerous errors.

The way that ddrescue works is to scan through a drive bit by bit. When it encounters errors on the drive, it makes a note of them and skips ahead. Since bad blocks are often in clusters, this means that ddrescue potentially skips ahead to good data. Once ddrescue finishes scanning the entire drive, it will divide and conquer the remaining bad block clusters until it has attempted to recover the entire drive. With this algorithm you have the best chance of recovering good data instead of spending all of your time trying to recover a cluster of bad blocks at the beginning of the disk, only to have the drive ultimately fail.

Note: Why Not dd?

The traditional tool that one might use to image a drive under Linux is dd. Unfortunately, dd is not ideally suited for hard drives with errors. By default when dd encounters an error, it will simply exit out of the program. While you can tell dd to ignore errors, doing so means it will simply skip that particular block and not write anything, so that you could end up with an image that is smaller than the original. These reasons, combined with the block cluster skipping algorithm and progress output, make ddrescue the better choice for this task.

To install ddrescue on your Ubuntu live CD, either install the ddrescue package using the graphical package manager, or open a terminal and type

$ sudo apt-get install ddrescue

Before you image a dying drive, make sure that you can store it somewhere. The ddrescue tool can image a hard drive or partition to either another hard drive or a file, but you need to make sure that the other device is equal to or greater in size than the drive you are imaging. The great thing about this is that you don’t even necessarily need to connect extra storage to your server. If you have an NFS server with enough capacity, you can mount the NFS share on your live CD and have ddrescue image to that. For this example I assume that you want to image one partition, /dev/sda1, on your server, you have attached an external USB drive to the server, and the desktop has found it and mounted it under /media/disk. To image the drive you simply run ddrescue and list the drive to image and the location to image to as arguments:

Click here to view code image

$ sudo ddrescue /dev/sda1 /media/disk/sda1_image.img /media/disk/
sda1_image_logfile

Replace /dev/sda1 with the partition or complete drive you want to image, and /media/disk/sda1_image.img with the mount point and file you want to image to. If you wanted to image from /dev/sda1 to /dev/sdb1, you would just replace /media/disk/sda1_image.img with /dev/sdb1. Notice that I added a third argument, /media/disk/sda1_image_logfile. The third argument tells ddrescue where to store an optional log file of its progress. With this log file in place you can actually stop ddrescue at any time, and when you start it again it can resume where it left off.

The great thing about ddrescue is that it provides you with a nice progress bar so you can keep track of how much longer it has to go. That, combined with its resume feature, means if you do need to interrupt it for some reason, you know you can go back and complete the job later.

Note: Image Drives or Partitions?

You may have noticed that in my example I chose to image a single partition instead of the entire drive. I did this because partition images are much easier to fsck and mount loopback when you image to a file. Generally speaking, it’s much simpler if you image each partition on a disk one at a time, especially if you image to a file. If you plan to image directly to another drive, then image the entire drive since you can then easily access each partition individually.

Once ddrescue completes, check the image you have created for any file system errors by running fsck on it:

Click here to view code image

$ sudo fsck -y -C /media/disk/sda1_image.img

Once fsck has completed, you can mount the image loopback and recover any files you need from the disk or, alternatively, you can use a tool like dd to copy this image to yet another drive. To mount the drive loopback, I create a temporary mount point at /mnt/temp and mount the drive there:

Click here to view code image

$ sudo mkdir /mnt/temp
$ sudo mount -o loop /media/disk/sda1_image.img /mnt/temp

From here I can copy particular files from /mnt/temp to some other storage or otherwise just confirm that the data on the drive is intact. Later I can use a regular imaging tool like dd or even rsync to copy the data from this file back to a partition.