Posted by Rod Howell on 11/1/2008, 11:05:16
The incident began due to hardware failures in the RAID setup.
RAID stands for "Redundant Array of Independent Disks" and is a technology
that employs the simultaneous use of two or more hard disk drives to
achieve greater levels of reliability and performance.
The website is stored across the RAID system twice over different hard
drives, if one of the hard drives fails the web site will continue to run.
The failed hard drive is replaced and the data that was on the drive copied
again from the other drives within the RAID, this is known as rebuilding the
RAID, and normally happens seamlessly without any effect to the web hosting
server or the website. This is a daily task performed in the data centers
and is standard for large data storage systems such as used in the web hosting
environment.
In this instance, the failed drive was replaced with a new drive and the RAID
started to rebuild. While this was happening the rebuild process failed, corrupting
all the data within the RAID set. A rebuild failure should not occur.
However, the system administrators do not rely on the RAID system as the only
source of backup. They run a rolling backup of the live system to external backup
servers to ensure that in a case like this they have a restore solution.
After the RAID corruption occurred, the engineers analyzed the situation and found
that the only solution left to us was to recover the data from the backup systems.
At this point the RAID was reinitialized ready to receive data, this process itself
took several hours to perform. Once the RAID was reinitialized, the system
administrators worked to restore the data from the rolling backup files taking several days .
The Data
---------
After restoring the data they turned on the services that deliver the website to
the Internet. A small amount of data loss may have occurred, and has been replaced
A situation like this is extremely seldom. They are working on measures to increase the frequency
of rolling backups and other security measures to make sure the data will stay safe.
Message Thread:
![]()
« Back to thread