To get your attention, I thought I would start with this statement from the abstract of a new study of Solid State storage devices (SSD) done by researchers at Ohio State University.
Our experimental results reveal that thirteen out of fifteen tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, meta-data corruption and total device failure.
If you are a technical person using or considering the use of SSD devices in your production environment, you might want to download this study.
If you are a CTO or IT manager, you might only need to know what impact this study might have on your disaster recovery plans. If so, read on.
Implications for Disaster Recovery
Does this study mean that you shouldn’t use SSD storage? No, it does not. It only means that you cannot rely on an SSD device any more than you can rely on a spinning platter device. This study, disturbing as it seems at first glance, does not really mean very much in terms of your disaster recovery plans unless those plans are built on the false assumption that certain types of hardware devices won’t fail. The practical importance of this study might be that it spurs you to re-examine your disaster recovery plan.
It is common to think that because SSD devices have no mechanical moving parts, they are not as susceptible to failure as are spinning platter devices. But this study shows that SSD devices can and do fail. In fact, it suggests that high-end spinning platter devices might be more reliable than SSD devices. Whether you use spinning platter or SSD devices you should examine your disaster recovery plans to make sure that they take into account the possibility of sudden, total loss of your data on disk due to the failure of any hardware device.
High Availability
Disaster Recovery and High Availability are two different things that are often confused as one. Note that this article differentiates between them. High availability certainly depends on device reliability, but the consequences of an availability failure are far less critical than the loss of your data. Complete hardware redundancy and automated failover gets you as close as possible to 24/7 availability, but it is expensive, hard to achieve and fallible in terms of data safety.
The users of your database will forgive you if a device failure takes you off-line for a while, even a long while. But if you lose all of their data, they will not be kind to you.
No Power Faults in MY Server Room
Since the SSD failures in this study all occurred under power faults you might think that you are immune to damage caused by power faults because you have a top of the line UPS. But consider that Amazon Web Services spent a fortune equipping their data centers. Despite everything that Amazon could do, two of those data centers experienced serious power faults. Have you deployed more reliable power technology than Amazon?
More About the Study
The study included both solid state disks and memory cards. If you are interested in knowing which brands and models of the fifteen SSD devices failed, or more importantly, the two that did not fail, you won’t find it in this study. The study included fifteen models from five different vendors but neither the vendors nor the models are named. The researchers categorized the devices only as “low end”, “high end” or “Enterprise class” according to the price-per-megabyte. Failures occurred in all of those groups regardless of price.
For comparison, the researchers also ran two spinning platter devices through the same tests. The low-end drive failed but the high-end spinning platter drive did not fail.
This is too little data to make unqualified statements about the results of the comparison, but it does suggest that some high-end spinning platter devices might be more reliable than some SSD devices.
The Bottom Line
Device reliability is certainly relevant to Disaster Recovery. High quality hardware can significantly reduce the risk of data loss but it cannot guarantee it. Although it makes sense to use the most reliable hardware you can afford, the bottom line is that your disaster recovery plans cannot depend on the reliability of any piece of hardware whether it be a disk, SSD, UPS or anything else.
Frequent backups and redundant copies of the data in multiple locations are the crucial components of any good Disaster Recovery Plan. They will get you as close as possible to guaranteed data safety.