Case Study: A successful recovery of a RAID5: one failed disk, one foreign disk, 3 online disks, and no hotspare.

PURPOSE

The purpose of this document is to provide a case study of the details that took place in this incident. This is NOT a set of instructions that should be followed or used in troubleshooting other problems. The authors will not be held responsible for any losses that occur because of the information that is stated below.

We have a high degree of IT knowledge and we are experts of everything and nothing at the same time. Our position requires that we understand everything yet we realize that it is impossible to know everything. We do our best with what we have and beyond that we leave it to the experts and Google Search.

HARDWARE INFORMATION

  • Dell PowerEdge T410 (Purchased October 2009)
    • SBS2008
      • Active Directory
      • Exchange
      • DNS
      • IIS
      • Terminal Server Gateway
    • PERC 6/i raid controller – RAID5
      • 5 X 250GB – DELL branded Seagate ES2 SATA Hard Drives
        • 1000GB of usable space
      • 3 Logical drives seen by BIOS
        • DISK 1 – 250GB
          • C: Windows Installation
          • DISK 2 – 250GB
            • D: Exchange Data Store Location
          • DISK 3 – 500GB
            • E: User Folder Redirection
    • Backups
      • Acronis Backup and Recovery 2010
        • Complete Disk image – WEEKLY
        • Incremental Backup – DAILY 1AM
      • Windows Backup
      • System State / Exchange – DAILY

TIMELINE

Initial RAID Problem

  • SATURDAY AUGUST 18, 2012 – 10PM
    • Microsoft Forefront Protection reports message queue backing up online. Mail not being delivered to local server.
    • No action taken
  • SUNDAY AUGUST 19, 2012
    • 9AM
      • RDP does not connect – must go to location
    • 12PM
      • Server stopped on RAID BIOS reporting failed state and foreign disks detected
        • HDD00 – FAILED
        • HDD01 – OK
        • HDD02 – FOREIGN
        • HDD03 – OK
        • HDD04 – OK
        • HDD05 – EMPTY
        • Imported HDD02 to RAID configuration as a foreign disk
          • Shutdown server
        • Installed new Western Digital 320GB Blue hard drive in HDD05 as “hotspare”
          • Controller “Rebuilt” HDD00 to “hotspare” HDD05
          • Operation took ~1 hour
          • Shutdown server
        • Installed new Western Digital 320GB Blue hard drive – replacing HDD00
          • Controller “Rebuilt” HDD00 from “hotspare” HDD05
          • Operation took ~1 hour
          • Restart Server
        • At this point it appeared that the physical RAID issue was corrected.
          • HDD00 – Ok
          • HDD01 – OK
          • HDD02 – OK
          • HDD03 – OK
          • HDD04 – OK
          • HDD05 – HOTSPARE

 

Secondary Problems Caused by Disk Corruption

  • SUNDAY AUGUST 19, 2012
    • 4PM
      • Boot into Windows Server 2008 fails repeatedly with BSOD
        • C00002E2 Directory services could not start because…
      • The following steps were found on the internet (apologies to the original author I don’t remember what website I retrieved this info from)
        • 1. Restart the server and press F8 key, select Directory Services restore mode.
        • 2. Log in with the local administrator username and password (hope you remember what you set it to!).
        • 3. Type cd \windows\system32
        • 4. type NTDSUTIL
        • 5. type activate instance NTDS
        • 6. type files
        • 7. If you encounter an error stating that the Jet engine could not be initialized exit out of ntdsutil.
        • 8. type cd\
        • 9. type md backupad
        • 10. type cd \windows\ntds
        • 11. type copy ntds.dit c:\backupad
        • 12. type cd \windows\system32
        • 13. type esentutl /g c:\windows\ntds\ntds.dit
        • 14. This will perform an integrity check, (the results indicate that the jet database is corrupt)
        • 15. Type esentutl /p c:\windows\ntds\ntds.dit
        • 16. Agree with the prompt
        • 17. type cd \windows\ntds
        • 18. type move *.log c:\backupad (or just delete the log files)
        • This should complete the repair. To verify that the repair has worked successfully:
        • 1. type cd \windows\system32
        • 2. type ntdsutil
        • 3. type activate instance ntds
        • 3. type files (you should no longer get an error when you do this)
        • 4. type info (file info should now appear correctly)
        • One final step, now sure if it’s required:
        • From the NTDSUTIL command prompt:
        • 1. type Semantic Database Analysis
        • 2. type Go
    • 6PM
      • Can now successfully boot into Windows
      • The following items fail to start and cause numerous event log errors
        • DNS server role
          • Remove DNS Server role
          • Restart Windows
          • Install DNS Server role
          • Add back DNS configurations manually
        • 5 or 6 Automatic Services fail to start
          • I don’t remember which specifically however each of them were fixed with permission changes in the registry
          • http://techruckus.com/forum/vista-dhcp-base-filtering-service-access-denied-t-t91.html
            • Windows could not start the DHCP Client service on Local Computer.
              • Error 1079: The account specified for this service is different from the account specified for other services running in the same process.
              • To fix this error, open the Properties for the service and go to the Log On tab. Set it to use This Account: Local Service -or- Network Service and clear the password boxes. Most of the time Local Service will work, but I found a few that required Network Service.
              • Now we’re getting the Error 5: Access Denied message when starting these services, or they don’t start because of a dependency.
                • Base Filtering Engine (BFE)
                  IPSec Policy Agent
                  Windows Time
                  IKE and AuthIP IPSec Keying Modules
                  DHCP Client
                  Diagnostic Policy
                  Network List Service
                  Network Location Awareness
                • NOTE: These services are local machine accounts and not domain accounts! So change the source from Forest to local server
                  • Regarding the DPS service, we need to add “NT Service\BFE” account the following allow permissions on HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BFE:
                    Query Value
                    Set Value
                    Create Subkey
                    Enumerate Subkeys
                    Notify
                    Read Control
                  • Regarding the DPS service, we need to add “NT Service\DPS” account the following allow permissions on HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DPS:
                    Query Value
                    Set Value
                    Create Subkey
                    Enumerate Subkeys
                    Notify
                    Read Control
                  • Also it was necessary to give the same permissions to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\WDI\Config
                  • Regarding the Windows Firewall service, we need to add “NT Service\mpssvc” account the following allow permissions on HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\mpssvc:
                    Query Value
                    Set Value
                    Create Subkey
                    Enumerate Subkeys
                    Notify
                    Read Control
                  • Also it was necessary to give the same permissions to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\SharedAccess

                    NT Service\DPS” account the following allow permissions on HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DPS:
                    Query Value
                    Set Value
                    Create Subkey
                    Enumerate Subkeys
                    Notify
                    Read Control

                  • Also it was necessary to give the same permissions to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\WDI\Config
    • 8PM
      • All Windows services appear to be working.
      • All aspects of network, application server, Active Directory, Exchange, and other services appear to be working correctly.
    • 12AM
      • Acronis incremental backup and Windows backup fail to run.

Work Week

  • MONDAY AUGUST 20, 2012 – FRIDAY AUGUST 24, 2012
    • No problems reported by end users.
    • Backup troubleshooting postponed till end of work week.
    • Temporary file backup created using SyncbackSE (just incase)

Residual Disk Errors Causing Backups to Fail

  • FRIDAY AUGUST 24, 2012
    • 6PM
      • Backups have been failing all week long and we believe that there is residual hard drive corruption which is causing the backups to fail.
        • Acronis backup identifies a single sector that has an error which is causing the backup to fail. “Failed to read from sector 10,091,183
        • Windows backup runs though check and fails during the file copy state at 4.77GB consistently.
      • CHKDSK /F run on all Logical Drives multiple times from within Windows.
        • No issues were identified or corrected With the CHKDK /F
        • Backups continue to fail
      • We decide that the HDD02 should also be replaced since it showed as FOREIGN in the controller in the initial problem. (at this point we were guessing)
    • 9PM
      • Upon arrival to location HDD02 shows as rebuilding from “hotspare” HDD05
        • This means that the HDD02 had failed and the “hotspare” was rebuilt to take the place of the failed drive.
        • We have no record of how long the disk had been failing, when the disk failed, or how many times HDD02 had failed during the week.
      • Installed new Western Digital 320GB Blue hard drive in HDD02
        • Controller “Rebuilt” HDD02 from “hotspare” HDD05
        • Operation took ~1+ hours
        • Restart Server
      • CHKDSK /F run on all Logical Drives multiple times from within Windows.
        • No issues were identified or corrected With the CHKDK /F
        • Backups continue to fail
    • 11PM
      • At this point we believe that there is still residual hard drive corruption, though the RAID controller should have taken care of any physical hard drive errors.
      • *** We believe a CHKDSK /R would resolve the issue but many people in online support channels say that running a /R will permanently damage a RAID5 setup.
      • We opt for the safer solution of changing out the remaining drives before turning to more desperate measures.
  • SATURDAY AUGUST 25, 2012
    • 4PM
      • Changing out the remaining 3 hard drives will be a time consuming process though ultimately the safer route.
        • RAID BIOS -> Initiate replace HDD01 to “hotspare” HDD05
          • Operation took ~1 hour
          • Shutdown server
        • Installed new Western Digital 320GB Blue hard drive in HDD01
          • Controller “Rebuilt” HDD01 from “hotspare” HDD05
          • Operation took ~1 hour
          • Shutdown server
        • RAID BIOS -> Initiate replace HDD03 to “hotspare” HDD05
          • Operation took ~1 hour
          • Shutdown server
        • Installed new Western Digital 320GB Blue hard drive in HDD03
          • Controller “Rebuilt” HDD03 from “hotspare” HDD05
          • Operation took ~1 hour
          • Shutdown server
        • RAID BIOS -> Initiate replace HDD04 to “hotspare” HDD05
          • Operation took ~1 hour
          • Shutdown server
        • Installed new Western Digital 320GB Blue hard drive in HDD04
          • Controller “Rebuilt” HDD04 from “hotspare” HDD05
          • Operation took ~1 hour
          • Shutdown server
    • 10PM
      • Backups continue to fail.
      • We are faced with two choices:
        • Recovering the C: Logical Disk from the image backup which is a week old.
        • Attempting the CHKDSK /R on the C: logical disk.
      • We attempt CHKDSK /R on C:
        • CHKDSK completed in  ~2 hours
        • Found 1 error during the stage 4 of 5 (file verification)
    • 12AM
      • PROBLEM RESOLVED, no further actions taken.

SERVICE DELIVERY DOWNTIME

The Initial RAID failure occurred Saturday August 18 during the evening. We were aware of the problem by Sunday morning and began work on it by noon that day. We were able to get the server in a working condition by Sunday night before work on Monday morning. We received no reports of problems, downtime, or corrupted files that hindered work during the week. Aside from weekend email delays the end users experienced a fully functioning uninterrupted work week.

  • SATURDAY AUGUST 18, 2012 10PM – SUNDAY AUGUST 19, 2012 8PM
    • All services dependent on server unavailable.
    • Office computers -> no login, no files, no internet
    • Remote VPN Location -> no internet
    • Email -> Forefront is queuing backup message delivery online
    • All services working
    • FRIDAY AUGUST 24, 2012 9PM – 11PM
      • All services dependent on server unavailable.
      • Office computers -> no login, no files, no internet
      • Remote VPN Location -> no internet
      • Email -> Forefront is queuing backup message delivery online
      • SATURDAY AUGUST 25, 2012 4PM – 11PM
        • All services dependent on server unavailable.
        • Office computers -> no login, no files, no internet
        • Remote VPN Location -> no internet
        • Email -> Forefront is queuing backup message delivery online

STANDARD PROCEDURE

In the event of a 2 disk failure in a RAID5 array, standard procedure dictates a complete RAID5 rebuild and a restore from a whole disk backup.

This option was available to us however we decided this was not in our best interest because the last time our backup successfully ran was Friday at midnight. We would be doing a complete restore on Sunday therein losing any data created or modified over the weekend including files and email.

JUSTIFICATION OF OUR PROCEDURE

The array had not suffered a complete 2 disk failure instead one single disk failure and one foreign disk.

Overall the client suffered minimal downtime and end users saw zero downtime during the workweek.

Perhaps more effort was expended on our part due to changing out the RAID  disks one at a time and using the hotspare replace function however it was the safest option.

The use of the CHKDSK /F and /R options were considered risky on a RAID5 array and many people on the internet advise against it however we decided that the logical drive separation from the RAID container would not cause any damage to the RAID container itself.

PROBLEM ANALYSIS

The root problem can have been caused in 2 areas: the RAID controller and the physical hard drives.

Though I do concede the controller may be the issue it is my best guess that the physical drives were the root cause of the problems. Either HDD00, HDD02, or both are in a physically failing state. It is possible that only one of the drives is failing and the failure caused the other to become degraded. The Dell hard drive power wiring connects every other drive to the same power connector which means that HDD00, HDD02, and HDD04 were connected to the same power lead.

We will continue to closely monitor the server and RAID configuration to ensure the controller is not at fault.

CONCLUSIONS

RAID5

RAID 5 without a hotspare should no longer be considered a viable redundant array for our purposes. The chance of hard drive failure is too great to have only one parity drive.

Hotspare option provides a padding option for current RAID5 arrays. Most likely we will opt for a RAID6 in future server implementations however it will still be susceptible to the WRITE HOLE problem.

CHKDSK /F and /R ON RAID5 CONTAINERS

CHKDSK in both instances have no ill effects in RAID5 containers in working RAID5 arrays being run while booted in the logical drive.

We ran CHKDSK /F multiple times on all 3 Logical Drives including the C: scheduled on reboot. The CHKDSK /F appeared to find and fix multiple problems in the USN Journal every time it was run though it does not appear to cause problems or fix anything.

The CHKDSK /R was run once on C: and found one bad file on stage 4 of 5 (file verification). The time required to complete CHKDSK seemed to be similar to those of a single drive running the same operation.

Problems would be foreseen if run on individual drives outside the RAID container or if run while the RAID container was in a degraded, rebuilding, foreign, or non-ideal state.

Acronis Full Disk Incremental Backup

The Acronis full disk backup had not been used in the resolution of this problem but has been and will always be a necessary component of any server deployment.

The one factor we may change in this implementation and all other implementations is the frequency of incremental backups. On our current schedule we make a weekly full backup and a separate daily incremental backup overnight. We will consider increasing the frequency of the incremental backup to twice daily at 12PM and 12AM.

Disk Layout and Data Separation

After this incident we realize the importance and necessity of separating the user data and exchange data stores from the operating system to different partitions or better yet to different Logical Disks. This separation allows for a simpler, more granular restoration of backup images from the Acronis backup images.

Microsoft Exchange Server

With the availability of the Office 365 Hosted Exchange solutions this incident provides additional motivation to migrate to an online solution. Removing Exchange email considerations from the local server considerably reduces the necessity for high availability and uptime. The local SBS server would then be utilized only for local Active Directory Authentication and File Storage.

Offsite Backups

Offsite backups have been in consideration for about a year now. We will implement an online file backup solution as soon as possible.