Bulletproof Storage
Disk systems will repair themselves or can be left unrepaired for years.
You can fly a two-engine plane with one engine, but how many passengers
would want to be on it?
That's the idea behind "bulletproof
storage," a concept that IBM has been
developing for two years and plans to begin unveiling incrementally over
the next one to three years.
IBM's technology
initiative deals with fault tolerance in every part of a
storage system: disk, controller,
network cards, power supplies and
software. By building more-robust
storage systems that can defer
replacement of failed parts for up to three years because of redundant
components, IBM believes it can also
eliminate many human errors that
happen when failing components are replaced.
According to Stanley Zaffos, an analyst at Gartner Inc. the bulletproof
storage concept still has another five to 10 years before it's
broadlyembraced by users. But once it is,
storage systems will require less
maintenance and, therefore, cost less to maintain.
"We know how to build very
reliable code. We use appliances every day that
have software built into them that work forever: your automobile, your
calculator, the disk drive in your PC, your telephone,"Zaffos says.
But IBM is looking to attack far more complex systems than telephones or
calculators.
Under its bulletproof
initiative, IBM is addressing disk-sector failures
that grow along with disk capacity. While disk capacities double every 12
to 18 months, uncorrectable read/write error rates haven't improved, nor
has the
probability of an uncorrectable error occurring on a disk read
decreased. There are more sectors on today's disks and, therefore, a
greater chance of an uncorrectable error.
The answer is to create self-healing capabilities for
storage management
software and more-robust RAID configurations.
IBM says that in about a year it will release
storage systems that can
support three simultaneous disk-drive failures in a single array by
introducing additional parity disks into RAID configurations,
offeringmany times the resiliency of a RAID configuration with two parity disks.
Today, standard systems allow for only two disk failures.
But Zaffos argues that 80% of downtime today is caused by user error and
software failures, not
hardware failures. He says that the failures
resulting from software are created by complexity and that there is an
almost
infinite number of failures that can occur in a complex system.
IBM is addressing those code failures with a software project called
N-Version Programming, where two pieces of code in the same application
save data and then compare the data to ensure that there are no errors.
In N-Version Programming, two copies of data are protected using different
means. One copy might be protected by standard RAID-5 programming coded by
Programmer A.
The second copy is protected by a different algorithm coded by Programmer
B. That way, if the first copy gets corrupted due to a particular bug in
the program written by Programmer A, then the second copy can be used.
The second copy may have its own bugs, but they will
manifest in different
ways at different times, and when they do, the first copy will be the one
which is good and which you can then use. It's kind of like having a
second person check the work of a first person and keep fixing it whenever
it finds mistakes.
One way IBM plans to
detect and correct corrupted data is to create
more-resilient
storage software with repairable data structures. The code
checks that certain conditions, which are described in rules, are met. For
example, in a file system with multiple files, the sum of the space taken
by the files plus the free space in the system must be equal to the total
available space. The code will check this property
automatically at
various times and use a
procedure to repair and fix problems if the
property isn't met.
In this case, the software isn't checking the code to see that it's
functioning properly and isn't checking data contents. If certain
properties aren't met, the software knows how to fix the data structures.
But don't expect to see fruit from N-Version Programming or checkable data
structures for another two to three years.
防弹存储
磁盘系统自行修理或者几年不用修理。
双引擎飞机能用一个引擎飞行,但有多少乘客愿意乘坐?
"防弹存储"背后的想法就是这样一个概念,IBM已经研究了两年,并计划在今后一至三年中不断公布进展。
IBM的此项技术首创是要在存储系统的方方面面:磁盘、控制器、网卡、电源和软件,实现容错。IBM相信,通过制造更健壮的、并由于有冗余部件从而能将故障部件的更换推迟两至三年的存储系统,能避免很多在更换故障部件时产生的人为错误。
Gartner公司的分析师Stanley
Zaffos称,防弹存储概念能为用户广为接受还需要5至10年的时间。但一旦得到认可,存储系统将需要更少的维护,因而需要更低的维护成本。
Zaffos说:"我们知道如何编制非常可靠的程序。我们每天使用各种各样的装置:汽车、计算器、PC机中的磁盘机和电话,它们都内装了使其能永远工作的软件。"
但IBM着眼于攻克比电话或计算器更复杂的系统。
在此项技术首创中,IBM要解决随磁盘容量增加而增加的磁盘部分故障。磁盘容量每12至18个月就翻一番,但无法纠正的读/写错误率没有得到改进,而且发生在磁盘读时的无法纠正的错误概率也没有降低。今天的磁盘上有更多的扇区,因而出现无法纠正错误的机会就更多。
这个问题的答案是提供存储管理软件的自修复能力以及更健壮的RAID(冗余磁盘阵列)配置。
IBM称,约在一年的时间里,将公布通过在RAID配置中增加一个奇偶盘而能在单个阵列中支持三个磁盘同时发生故障的存储系统,这将比两个奇偶盘RAID配置的弹性高出了很多倍。今天,标准的系统只允许两个磁盘出现故障。
但Zaffos认为,今天80%的宕机是由于用户的错误和软件故障,而不是硬件故障引起的。他说,软件带来的故障是因复杂性造成的,而在复杂系统中可能发生的故障几乎是不计其数的。
IBM用一个叫N-Version
Programming的软件项目来解决这些程序故障,其中同一应用软件中有两段程序保存数据,然后通过比较数据来确保没有错误。
在N-Version Programming中,使用不同的方式保护数据的两个备份。一个备份可以用由程序员A编写的标准RAID-5编程保护。
第二个备份由程序员B编写的不同算法进行保护。这样,如果第一个备份由于程序员A编写的程序中的特定错误而被破坏了,就可以使用第二个备份。
第二个备份也可能有其自己的错误,但这些错误将以不用的方式、在不同的时间表现出来,当出现这些错误时,第一个备份将是好的,你可以使用。这好像是有第二个人来检查第一个人的工作,一发现错误就纠正。
IBM计划用来检测和纠正被破坏数据的一个方法,就是用可修理的数据结构来生成更有弹性的存储软件。这种程序检查在规则中描述的某些条件是否得到满足。例如,在有多个文件的文件系统中,文件占用的空间与系统中未用的空间之和应该等于总的可用空间。上述程序在不同的时间自动检查此特性,并在此特性未能得到满足时启用程序进行修理并纠正此问题。
此时,软件不是检查此程序,看看它是否正常运行,也不是检查数据内容。如果某些特性未能满足,软件知道如何来修正数据结构。
但不要指望在今后两三年内就能见到N-Version Programming项目,即可检查数据结构的成果。
关键字:
好文共赏生词表: