cs161 2004 Lecture 8: RAID Lab 2: Questions? Read the source Collaboration - don't show your source socketpair() Subprocesses for reading pass the data back? mmap() piece by piece? multiple outstanding requests? how many subprocs? fix open/stat? fix the supplied code? memory / fd rules exec/fork/threads? waitpid()? Questions? The rename() mystery Applications want the "all or nothing" property, not the "old or new" Why are we reading this? Not just to understand RAID (it's really not that complicated) To understand how to attack bottlenecks As a case study in evolving design So these guys invented RAID, huh? No, people are already selling them. Good salesmen, they just categorized and codified it. Amdahl's Law: If you speed up a subsystem that responsible for only a fraction of your total time, your speedup only applies to that fraction. Level 0: Aggregate volumes (no redundancy) Pro: multiple outstanding reads & writes Con: decreased MTTF (MTTF / disks) Basic idea to fix reliability, put disks in groups, add in some check disks. As long as only one disk dies at a time, no data loss. D = total DATA disks G = DATA disks / group C = CHECK disks / group n_G = # groups What's the new MTTF? MTTF(Level-0) * 1/prob(failure during MTTR) prob(failure during MTTR) = MTTR / (MTTF(disk)/(G+C-1)) Each time, these assume: D=100, G=10, MTTF(disk)=30,000hr (3.5yr) MTTR=1hr Workload Supercomputing: Throughput Transaction Processing: Individual I/O Level 1: 500yr Mirror Performance is shown as ratios to what one disk would have done Examples: Reads=2D/S Writes=D/S What is S? (slowdown of having to wait for the worst case) Why not shown for small reads/writes? Level 2: G=10: 50yr G=25: 12yr Hamming over disks D=10,C=4 D=25,C=5 (note, hamming CORRECTS, parity detects) Identifies failing disk, and corrects Large reads/writes = D Small reads/writes, divide by G (Dismal) Level 3: 90yr, 40yr (all same now, up b/c fewer total disks) Assumes identifying is easy, just corrects 1 check disk Performance is "the same", but really it improves per disk (so, per $) Level 4: Interleave blocks, not bytes, so a single disk can return a whole sector What did we fix? small reads Remaining problem, we always hit the check disk on writes D1 D2 D3 D4 C 1d 1c 2d 1c 3d 1c 4d 1c 5d 2c 6d 2c 7d 2c 8d 2c Level 5: Distributed check disk What did we fix? small writes (no longer bottlenecked) D1 D2 D3 D4 D5 1d 1c 2d 1c 3d 1c 4d 1c 5d 2c 6d 2c 7d 2c 2c 2c 8d Hardware vs Software hardware can do xor work in parallel (avoid "copying" in CPU) hardware can have all disks RAID / software must boot somehow bus bandwidth? hardware can send only the logical data (magic bus in the RAID) software can keep check data in mem, avoid reread so you really want hardware with a cache, located near disks Problems? Disk failures are not independant Identical disks? Somehow I still hear story after story about RAID arrays being corrupted What about reconstruction time? (This affects MTTR!) What about performance during reconstruction? Conclusions? For small number of disks, RAID 0 seems fine (you still have to backup) For 10s of disks, RAID 1 if you can afford it else RAID 5