RAID Concepts


History And Background

RAID (Redundant Array of Inexpensive Disks, as the original authors - David Patterson, Garth Gibson and Randy Katz - called it in 1988) is one of the many buzzwords today. One view it as the cure for all performance and reliability problems, yet others are sceptical about users and system managers being misguided by false assumptions and pretentious marketing slogans. They argue that a significant cost factor is considered too optimistically by the followers: Managing a disk array. The truth lies in the middle - as usual.

Let us go back to 1988. If someone needed a disk to store really big files he or she had to buy a really big disk. State of the art then was between 500MB and 1GB. (In a from factor that an IBM XT would fit easily into the space needed for it.) The price tag then showed something around $30000 with even more expensive models around. At the same time there were little, cheap disks with a capacity between 50MB and 100MB. So, what prevented anyone of simply putting ten of the smaller ones together to form one big disk?

Technically this was not a problem at all. But what then? Reliability! The MTBF (Mean Time Between Failures) for the big and the little disk then was specified in the 50000h range. But when one combines ten to form a set the reliability of that set is reduced drastically by an order of magnitude. From more than a year down to a month or two - absolutely unacceptable. So some "eggheads" at Berkeley thought about it and came out with the paper mentioned above. Soon after the publication the RAB (Raid Advisory Board) was formed, consisting of representatives of all major players in the computing and storage business. The RAB now cares about the further development of this technology. Although a lot has changed since then - the SLEDs (Single Large Expensive Disk) are almost extinct and "Inexpensive" therefore substituted by "Independent", for example - one inherent problem remains:

Advertisement
Book Now Before Your
Competitor Makes You
Redundant
More info from our partner
Onsite Computer
that coordinates our training.
Who needs what kind of RAID? As no "level" of RAID is the silver bullet a well informed decision can be made only with sufficient knowledge of the underlying concepts. And someone with the job of setting up or repairing these arrays needs that training even more.

Sure, there are systems on the market that completely shield their inner structure from the user, but these do not fit for those who want or must tune the array to give the maximum for it's price. And they cost noticably more than systems that require some basic knowledge of the working principles.

The RAID Levels

Now, let us have a look at the different RAID "levels" (this is the second greatest misnomer after the so called ISA (industry standard architecture), because in industry they prefer more reliable architectures). The original paper (mentioned above) defined 5 levels, numbered from 1 to 5. Immediately after the RAB took over the development it added the level 0. In 1997/1998 level 6 was added and level 7 followed. So, what is the difference between them?

The greatest differences among them lie in the price you pay for your net storage capacity and the tolerable losses of disks. Level 0 has no redundancy at all and levels 1 through 5 can tolerate the loss of one disk per set (note, you can configure level 1 to tolerate more than one disk loss, but at a hefty price increase; see below) and levels 6 and 7 can tolerate the losses of more than one disk per set.

Level 0 is called striping. Here some disks are combined in a way that they together look like one big disk. No redundany is present, so the loss of one of the disks result in the loss of all the data, including that on the "surviving" disks. Stripe sets are - when configured correctly - fast and scalable. They are mainly used for temporal storage of intermediate results that can recalculated in case of a defect. Or a stripe set is used to hold data that is to be saved to a backup storage.

Level 1 is called mirroring or shadowing. As the name indicates two or more disks are physical copies of each other (like clones). The price per megabyte is the price of one multiplied by the number of disks in the set. But the advantages are the very predictable read performance and the superb scalability in environments that need mostly read operations from the mirror set. Because always n (n = number of disks in the set) read requests can be serviced concurrently a mirror set is good choice when it comes to ultimate read performance. One more advantage is the possibility to locate the different members (disks) of a mirror set hundreds of kilometers apart. Companies in Kobe, Japan, that used this advantage where operable again only months after the big earthquake, while others simply lost their most valuable assets in their fire-resitant safes (that were and are not designed to withstand the drop of a building onto them, followed by a drowning in water for weeks). The disadvantage, besides the high price for the storage space, is the rather poor write performance. This is always the same as for a single disk, no speedup is possible. (But one can combine level 1 with level 0 to blend some of the adavantages of both levels. This gives an improved performance, but still keeps a rather high price tag.)

Level 2 is completely obsolete. Actually it never was implemented. It uses a hamming encoding of the data that can correct single bit errors at a time. But nobody, even in 1988, thought about correcting bad spots on a disk. Bad spots are marked bad and the space on the disk is blocked for further access. MS-DOS in it`s generosity even blocks the whole cylinder, not only the block with the bad spot. So, level 2 was killed by the PC before it was ever used.

Level 3 uses parity to build in redundancy. The "chunksize" - the most critical parameter when it comes to the configuration of any RAID set (except, of course, level 1) - here is set to a very small value in the range of some bytes. This results in a very good distribution of the accesses to the members of the set and moreover makes the maintenance of the parity information faster than with level 4 or 5. But it suffers from the fact that it's performance is very dependent of the configuration. You cannot build sets with 3 or 5 disks that have a good performance. For best results you must use a number that is evenly divisible into 512 (the size of a disk block in bytes). So, this level is almost never used alone, although there are controllers on the market that allow you to configure a pure level 3 set.

Level 4 uses a dedicated disk in the set that is used solely to store the parity information. But this disk is a potential bottleneck as it must be accessed for each and every write to any of the other disks in the set. Actually each write is not simply a write, but a time consuming read-modify-write cycle. Huh? Why read something from the disk when I want to write? Because of the parity information! We must keep the parity up to date all the time. But the parity, as it is stored at any moment on the disk, is the parity over all the data disks. When we want to modify any data on one of the disks we first must calculate the data to be overwritten "out" of the current parity (as it contributed to the current parity) before we can calculate the new (updated) parity and can store the new data to the disk. So, any write results in a tedious
READ-old-data, READ-old-parity, CALCULATE-"out"-old-data-and-new-"in", WRITE-new-data and finally WRITE-new-parity
cycle. And, depending on the algorithms used by the RAID controller, a bad choice of that "chunksize" parameter can bring the write performance of a level 4 (and level 5) set down to a screeching halt of 3 to 10 IO operations per second, although the disks themselves are busy with a couple of hundred IOs per second meanwhile. But the problem of the parity disk as bottleneck remains. This is addressed with the next level.

Level 5 is similar to level 4, but it distributes the parity evenly across the disks in the set in a way that never data and the parity for it is on the same disk. Thus for the price of more computing power in the controller the bottleneck of a dedicated parity disk is eliminated. But the tedious write cycle remains. To remedy the situation modern controllers use a relatively large cache (but not one in the GB range, as one manufacturer does). The controller relies (better bets) on a phenomenon called "data locality". Simply put, this states that the vast majority of data accesses go to a rather small set of disk blocks. So, the controllers gather all write requests in their caches until one complete set of chunks has been modified by user write requests. Then the controller does not need to read the old data and the old parity, as all the data on all the disks will be modified, so it calculates the new parity from the data in the cache alone and simply writes it to disk together with the data. As this looks like a blend of level 5 with the inherent advantage of level 3 this strategy usually is called level 53. But it is not a real level that someone can configure. Either the controller has a write back cache and this strategy implemented or it has not.

Level 6, for the first time, defines two different strategies for implementing redundancy. The first is called a "linear array", the second a "field array". The former resembles a level 5 set with two sets of parity information. These two sets are calculated independently (the algorithm is up to the manufacturer of a RAID controller) and stored in a way that never both parity sets for a given piece of data are stored on the same disk, nor on a disk where some data that contributes to this set of parity information is stored. The advantage of an array of this kind is the ability to tolerate the loss of two disks simultaneously. But the write penalty of level 5 is still present and must be leveraged by some other means. Basically the same tricks as with level 5 can be played here. And they are needed for the "field array" as well. This consists of an array of disks, logically arranged in a n×m matrix. The parity is calculated for each row and column and stored onto a dedicated disk (not included in the n×m array, but additional disks). This makes the ratio of parity disks to data disks rather high unless a very big array is used. But the clear advantage is that a least n+1 (or m+1, whichever is smaller) disks must fail before data is lost.

Now you know enough about RAID concepts to mess up the arrays of your systems - enjoy it (until your boss realizes it and fires you)!


© Paul Elektronik, 1998-2002