Chapter 19. Troubleshooting Hard Drives and RAID Arrays

This chapter covers the following A+ 220-1001 exam objective:

• 5.3 – Given a scenario, troubleshoot hard drives and RAID arrays.

Hard drives contain the data that we need. So, we depend on our hard drives and RAID arrays to run efficiently every day. For this to happen, the drives need to be healthy. We can keep our drives healthy by carrying out a variety of precautionary measures. But sometimes, our systems can be affected by powers outside of our control, and beyond our planning. Then, they can potentially fail—and then, we have to troubleshoot.

5.3 – Given a scenario, troubleshoot hard drives and RAID arrays.

ExamAlert

Objective 5.3 focuses on troubleshooting common symptoms of hard drives and RAID arrays, such as: read/write failure; slow performance; loud clicking noise; failure to boot; drive not recognized; OS not found; RAID not found; RAID stops working; proprietary crash screens (BSOD/pin wheel); and S.M.A.R.T errors.

Troubleshooting Hard Drives

Hard drives will fail. It’s not a matter of if; it’s a matter of when, especially when it comes to mechanical drives. The moving parts are bound to fail at some point. Hard drives have an average warranty of 3 years, as is the case with the SATA drives used in this book. It is interesting to note that most drives last around 3 years before failing. But remember, an ounce of prevention is worth a pound of cure, or for those of you using the metric system, 29 grams and .45 kg—but that just doesn’t seem to roll off the tongue quite so well! Either way, by implementing good practices, you can extend the lifespan of a hard drive. So, before we get into troubleshooting hard drives let’s give some examples of prevention:

• Turn the computer off when not in use: This can help the lifespan of a magnetic-based drive. By doing this, the hard disk drive is told by the operating system to spin down and enter a “parked” state. It’s kind of like parking a car or placing a record player’s arm on its holder. Turning the computer off when not in use increases the lifespan of just about all its devices (except for the lithium battery). You can also set the computer to hibernate, stand by, or simply set your operating system’s power scheme to turn off hard disks after a certain amount of inactivity, such as 5 minutes. The less the drive is in motion, the longer lifespan it will have. Of course, if you want to take the moving parts out of the equation, you could opt for a solid-state drive, as discussed later in this chapter.

• Clean up the disk: Use a hard drive cleanup program to remove temporary files, clean out the Recycle Bin, and so on. Microsoft includes the Disk Cleanup program in Windows. And there are free cleanup programs available on the Internet (just be careful what you download.) By removing the “junk” from the hard drive, there is less data that the drive must sift through, which makes it easier on the drive when it is time to defragment.

• Defragment the drive: Defragmenting, also known as defragging, rearranges the data on a partition or volume so that it is laid out in a contiguous, orderly fashion. You should attempt to defragment the disk every month, maybe more often if you are a power user. Don’t worry: the operating system tells you if defragging is not necessary during the analysis stage. Over time, data is written to the drive, and subsequently erased, over and over again, leaving gaps in the drive space. New data will sometimes be written to multiple areas of the drive, in a broken or fragmented fashion, filling in any blank areas it can find. When this happens, the hard drive has to work much harder to find the data it needs. Logically, data access time is increased. Physically, the drive will be spinning more, starting and stopping more—in general, more mechanical movement. It’s kind of like changing gears excessively with the automatic transmission in your car. The more the drive has to access this fragmented data, the shorter its lifespan becomes due to mechanical wear and tear. But before the drive fails altogether, fragmentation can cause intermittent read/write failures. Defragmenting the drive can be done with Microsoft’s Disk Defragmenter, with the command-line defrag, or with other third-party programs. If using the Disk Defragmenter program, you need 15 percent free space on the volume you want to defrag. If you have less than that, you need to use the command-line option defrag -f. To summarize, the more contiguous the data, the less the hard drive has to work to access that data, thus decreasing the data access time and increasing the lifespan of the drive. While defragmenting works best on magnetic drives, it can also help with solid-state drives, but not to the same extent, or in the same way because of the design differences between the two.

Note

Be careful with defragging, it can wear out a drive if it is done too much, especially when performed on SSDs.

ExamAlert

Know how to troubleshoot failures such as read/write errors using tools; for instance, Optimize Drives/Disk Defragmenter and the defrag command.

• Leave at least 10% of the drive free: If you use up all the space of a drive, it’s performance and lifespan will decrease greatly. Consider leaving between 10 and 25% of the space on the drive free of data. Some manufacturers add a 10% buffer by design, and some companies have a policy that states drives should never go past 50 or 60% of capacity. This preventive measure applies to HDDs and SSDs.

• Make sure that high-performance drives have good airflow: NVMe drives (such as M.2 and PCIe-based), as well as RAID arrays, can generate a lot of heat. Be sure to have good airflow, adequate cooling, and if at all possible, don’t cramp the drives too much.

• Scan the drive with anti-malware: Make sure the computer has an anti-malware program installed. Also known as an endpoint protection platform, it should include antivirus and anti-spyware at the very least. Verify that the software is scheduled to scan the drive at least twice a week. (Manufacturers’ default is usually every day.) The quicker the software finds and quarantines threats, the less chance of physical damage to the hard drive.

It’s the preventive techniques that will save you time, save your users some heartache, and save your organization money.

Now, let’s get into some of the problems you might encounter concerning hard drives:

• BIOS does not “see” the drive: If the BIOS doesn’t recognize the drive you have installed, you can check a few things. First, make sure the power cable is firmly connected and oriented properly. Next, make sure SATA data cables are fully seated in the ports, and weren’t accidentally installed upside down; if you find one that was, consider replacing it because it might be damaged due to incorrect installation. An OS Not Found error message, or other boot failure, could also be caused by improperly connected drives, or an erroneous BIOS boot order. Finally, check if there is a motherboard BIOS update to see the drive; sometimes newer drives require new BIOS code to access the drive.

• Windows does not “see” a second drive: There are several reasons why Windows might not see a second drive. Maybe a driver needs to be installed for the drive or for its controller. This is more common with newer hard drive technologies. Perhaps the secondary drive needs to be initialized within Disk Management. Or it could be that the drive was not partitioned or formatted. Also try the methods listed in the first bullet.

• Slow reaction time: If the system runs slow, it can be because the drive has become fragmented or has been infected with a virus or spyware. Analyze and defragment the drive. If it is heavily fragmented, the drive can take longer to access the data needed, resulting in slow reaction time. You might be amazed at the difference in performance! If you think the drive might be infected, scan the disk with your anti-malware program to quarantine any possible threats. It’s wise to schedule deep scans of the drive at least twice a week. You will learn more about viruses and spyware in Chapter 32, “Wireless Security, Malware and Social Engineering.” In extreme cases, you might want to move all the data from the affected drive to another drive, being sure to verify the data that was moved. Then format the affected drive and, finally, move the data back. This is common in audio/video environments and when dealing with data drives, but it should not be done to a system drive (meaning a drive that contains the operating system).

• Missing files at startup: If you get a message such as BOOTMGR Is Missing, the file needs to be written back to the hard drive. For more on how to do this, see Chapter 36, “Troubleshooting Microsoft Windows.” In severe cases, this can mean that the drive is physically damaged and needs to be replaced. If this happens, the drive needs to be removed from the computer and slaved off to another drive on another system. Then the data must be copied from the damaged drive to a known good drive (which might require a third-party program), and a new drive must be installed to the affected computer. Afterward, the recovered data can be copied on the new drive.

• Other missing/corrupted files: Missing or corrupted files could be the result of hard drive failure, operating system failure, malware infection, user error, and so on. If this happens more than once, be sure to back up the rest of the data on the drive, and then use the preventative methods mentioned previously, especially defragmenting and scanning for malware. You can also analyze the drive’s S.M.A.R.T. data. S.M.A.R.T. stands for Self-Monitoring, Analysis, and Reporting Technology—it is a monitoring system included with almost all hard drives that creates reporting data which, when enabled in the BIOS, can be accessed within the operating system. You can easily view some basic S.M.A.R.T.-based information in the Windows Command Prompt by using the command: wmic diskdrive get status. Each drive (if S.M.A.R.T.-enabled) will be analyzed: a message of OK means that Windows didn’t find any issues. A message of Bad, Unknown, or Caution should convince you to initiate more analysis. There are also plenty of third-party tools available that can be downloaded from the Internet and are very easy to use. The problem with S.M.A.R.T. data is that it can be unreliable at times due to lack of hardware and driver support within the third-party S.M.A.R.T. application, lack of common interpretation, and incorrectly diagnosed data. Also, a hard drive might be diagnosed as a failing drive when in reality the problem is power surges or another issue.

Note

If a file is written during a power surge (whether originating internally or externally), that file will most likely be placed on the drive in a corrupted fashion—the associated sector being affected by the power surge. In this case, you should find out two things: 1, if the power supply has the right capacity for the equipment in the computer, and 2, if the proper power suppressing/conditioning equipment is being used. If a drive is making clicking sounds or other strange noises, analysis with S.M.A.R.T. data is not recommended. See the following bullet for more information.

• Noisy drive/lockups: If your SATA magnetic disk drive starts getting noisy, it’s a sure sign of impending drive failure. You might also hear a scratching or grating sound, akin to scratching a record with the record player’s needle. Or the drive might intermittently just stop or lock up with one or more loud audible clicks. You can’t wait in these situations; you need to connect the drive to another computer immediately and copy the data to a good drive. Even then, it might be too late. However, there are some third-party programs available on the Internet that might help recover the data.

As I mentioned, hard drives will fail, so it is important to make backups of your data. The backup media of choice will vary depending on the organization. It could be the cloud, a secondary system, DVD-ROM discs, even USB flash drives. It differs based on the scenario. In some cases, an organization might decide to back up to tape. Remember that RAID arrays are not considered to be backups. They are fault tolerant ways of storing data. Backup and archiving goes beyond the RAID array, and usually incorporates some kind of off-site storage system.

Troubleshooting RAID Arrays

Sometimes, hardware RAID arrays will fail. They might stop working or the OS could have trouble finding them. If you see an issue like this, check whether the hard drives are securely connected to the controller and that the controller (if an adapter card) is securely connected to the motherboard. Also, if you use a RAID adapter card or external enclosure, and the motherboard also has built-in RAID functionality of its own, make sure you disable the motherboard RAID within the BIOS—it could cause a conflict. Verify that the driver for the RAID device is installed and updated. Finally, check if any of the hard drives or the RAID controller has failed. If a RAID controller built into a motherboard fails, you will have to purchase a RAID adapter card.

Intel-based RAID setups are common as part of server and workstation motherboards, and as separate RAID adapter cards. To configure Intel RAID a technician needs to press CTRL + I when the system first boots up, perhaps even before the BIOS on some systems. From there, the RAID array can be configured as shown in Figure 19.1

Figure 19.1 Intel RAID Configuration Screen

In Figure 19.1 you will see that there is a RAID 1 Mirror, but that the status is “Degraded”. That means that the array has failed, or has been deconstructed in some way. The listed drive is part of a RAID 1 volume called Data, but the second drive of the mirror is missing, so the mirror is broken. (That’s because I removed it from the system to show this very error.) Look in the listed physical devices for the drive that is 931.5 GB; you will see that it is a member disk, meaning that it is part of an array. A degraded RAID 0, 1, 5, or 10 array will either result in a loss of access to data, or—if the OS is installed to the array—the OS will not boot. Either way, the array would have to be repaired or the data would have to be recovered from backup and placed on a new array. Repairing a RAID array could be as simple as reconnecting the physical drives, but it could also mean reconfiguring the array within the RAID utility. Some organizations have a rule: if a RAID array fails—and it is older than 3 years—the array should be downgraded, and a new array be created with new drives, after which the data should be recovered from backup.

Now, let’s say that our RAID functionality is indeed built into our motherboard as it is on the system shown in Figure 19.1. In order to configure a RAID array, we first have to enable RAID in the BIOS. Quite often, that is done by accessing the SATA configuration screen and changing from AHCI to RAID. If you don’t, then you won’t be able to access the RAID utility at bootup. Take it to the next level. If someone was to reset the BIOS to defaults, then that SATA setting would revert back to AHCI, rendering the RAID array useless and non-bootable—ultimately leading to various error messages. This could also happen after a BIOS flash update. Yet another reason to know the BIOS of your systems!

Note

AHCI stands for Advanced Host Configuration Interface, the default setting for SATA drives in many BIOS programs.

One way to check the status of a RAID array is to use S.M.A.R.T. For example, in Figure 19.2 you can see the S.M.A.R.T. information screen for one of the disks in a RAID 1 mirror of a NAS device. This screen gives some meaningful data that requires some analysis, but for quick peace of mind, just check the status column. OKs are good—anything else requires further attention, and could be a pre-cursor to a RAID failure. You will also note in the figure a S.M.A.R.T. Test page where you can do additional testing of the drive, drives, or array. Just be sure to run tests of this nature off-hours!

Figure 19.2 S.M.A.R.T. Information and Status of a NAS Hard Drive

Cram Quiz

Answer these questions. The answers follow the last question. If you cannot answer these questions correctly, consider reading this section again until you can.

1. What should you do first to repair a drive that is acting sluggish?

A. Remove the drive and recover the data.

B. Run Disk Cleanup.

C. Run Disk Defragmenter.

D. Scan for viruses.

2. Which of the following are possible symptoms of hard drive failure? (Select the two best answers.)

A. System lockup

B. Antivirus alerts

C. Failing bootup files

D. Network drive errors

E. BIOS doesn’t recognize the drive

3. You just replaced a SATA hard drive that you suspected had failed. You also replaced the data cable between the hard drive and the motherboard. When you reboot the computer, you notice that the SATA drive is not recognized by the BIOS. What most likely happened to cause this?

A. The drive has not been formatted yet.

B. The BIOS does not support SATA.

C. The SATA port is faulty.

D. The drive is not jumpered properly.

4. You are troubleshooting a SATA hard drive that doesn’t function on a PC. When you try it on another computer, it works fine. You suspect a power issue and decide to take voltage readings from the SATA power connector coming from the power supply. Which of the following readings should you find?

A. 5 V and 12 V

B. 5 V, 12 V, and 24 V

C. 3.3 V, 5 V, and 12 V

D. 3.3 V and 12 V

5. You are troubleshooting a Windows Server that normally boots from an SATA-based RAID 0 array. The message you receive is “missing operating system”. As it turns out, another technician has been updating the BIOS on several of the servers in your organization, including this one. What configured setting needs to be changed? (Select the best answer.)

A. RAID 1

B. AHCI

C. S.M.A.R.T.

D. NVMe

6. What should you do first if your SATA magnetic disk begins to make loud clicking noises? (Select the best answer.)

A. Copy the data to another drive

B. Replace with a new SATA drive

C. Update the UEFI/BIOS

D. Replace the SATA power cable

Cram Quiz Answers

1. C. Attempt to defragment the disk. If it is not necessary, Windows lets you know. Then you can move to other options, such as scanning the drive for viruses.

2. A and C. System lockups and failed boot files or other failing file operations are possible symptoms of hard drive failure. Antivirus alerts tell you that the operating system has been compromised, viruses should be quarantined, and a full scan should be initiated. Sometimes hard drives can fail due to heavy virus activity, but usually if the malware is caught quickly enough, the hard drive should survive. Network drives are separate from the local hard drive; inability to connect to a network drive suggests a network configuration issue. If the BIOS doesn’t recognize the drive, consider a BIOS update.

3. C. Most likely, the SATA port is faulty. It might have been damaged during the upgrade. To test the theory, you would plug the SATA data cable into another port on the motherboard. We can’t format the drive until it has been recognized by the BIOS, which, by the way, should recognize SATA drives if the motherboard has SATA ports! SATA drives don’t use jumpers unless they need to coexist with older IDE drives. Most of today’s drives do not come with jumpers.

4. C. If you test a SATA power cable, you should find 3.3 V (orange wire), 5 V (red wire), and 12 V (yellow wire). If any of these don’t test properly, try another SATA power connector.

5. B. When the BIOS was updated, the SATA setting in the BIOS probably reverted back to AHCI. That caused the RAID 0 array to be ignored, and so the OS would not boot, because it is stored on that array. The setting should be changed from AHCI to RAID (or similar name). Now, if this was a RAID 1 mirror, then a copy of the OS would be on each drive, and it might still boot (though you would probably receive a message as to the state of the mirror being degraded or broken). But with RAID 0, the OS is striped across two or more drives—all drives need to be present and accessed via RAID in order for the OS to boot. That’s one of the reasons why the golden rule for many years was to “mirror the OS, and stripe the data”. Phew! Anyways, on to the incorrect answers. RAID 1 is incorrect, there would be no option to set this; in the scenario we are using RAID 0. S.M.A.R.T. is the monitoring system included in HDDs and SSDs. NVMe (Non-Volatile Memory Express) is the specification for non-volatile storage used by M.2 drives, PCIe card-based drives, and so on. Remember to backup any and all BIOS configurations!

6. A. Don’t hesitate! Copy the data to another drive. Afterward, update the UEFI/BIOS, replace the drive with a new one, and consider a new SATA cable while you are at it.