I work in IT and one of my job functions is to warehouse the image files of a corporate creative department. Translated… that means I buy a lot of storage. One of the things that storage admins are looking at is the failure rate of the disc drives that make up their SAN environments. The higher the failure rate of a particular drive the better your chances of having a catastrophic loss… Or in other words you’re restoring from tape if you loss a lot of drives at one time!
MTBF (or mean time before failure) is a standard measurement (in hours) we use to calculate the life of a disk drive before it fails. The other measurement we use is AFR (or the annualized failure rate), which is expressed as a percent based on the MTBF verse the amount of time that device is powered on and running. A couple of things to note… MTBF is not necessarily a devices useful life. And AFR is not meant to be applied to a single drive but rather it is the expected failure rate of any given drive within a particular production run (population).
So what does this all mean?
Well most vendors spec consumer-geared disk drives at about 300000 MTBF. That being said the key word in MRBF is M (or mean). So what we’re looking at is about half of the drive for a given population with fail in the first 300000 hours of use.
Translated again… and I got help on this one
If you had 600,000 drives with 300,000 hour MTBFs, you’d expect to see one drive failure per hour. In a year you’d expect to see 8,760 (the number of hours in a year) drive failures or a 1.46% Annual Failure Rate (AFR) (Harris, 2007).
Realizing that this is what a manufacturer quotes as the expected life, one has to ask how does that hold up in reality. Well Google did a bit of research on this and found that their failure rate was much different from that of the manufacturers. Why? Because there is no clear definition between what a manufacturer considers a failure and the real world’s expectation on these devise are.
In reality many factors will determine whether a drive should remain in production. Call is an IT admins intuition… Call is that odd clicking sound… calls it taking forever to save a file… Often time we (IT professionals) will replace a drive before it is completely unusable (or the point where we can no longer retrieve data from the device). Did the drive fail? Technically no… Practically yes! If we can’t rely on the drive to reliably save and retrieve data that it has fails for our purpose… guess some manufactures don’t see it the same way!
Resources:
Harris, R., (2007, February, 19th), Google’s Disk Failure Experience, retrieved on June 3rd 2010 from http://storagemojo.com/2007/02/19/googles-disk-failure-experience/