cs161 2004 Lecture 6: Crash Recovery for Disk File Systems Common theme in system-building: You have some persistent state you're maintaining, reading and writing. Maybe the data is replicated in multiple servers' memory, maybe it's on one or more disks. You use caching and fancy data structures for efficiency. The hard part always turns out to be recovery after a crash. Most solutions have a similar flavor: *Order* persistent store updates so that there are atomic commit points. Recovery procedure can finish or un-do partial operations. And it all has to be fast! Moore's law for storage not on our side. Case study: disk file systems. Critical for performance and crash recovery of individual machines. Interacts with distributed protocols, for both reasons. Crash recovery techniques similar to those in distributed systems. Trade-offs are often the same (performance vs durability). A file system is a fairly complex abstract data structure: (this is for UNIX) tree rooted at root i-node directory entries file/subdirectory i-nodes file blocks file block number lists i-nodes and directory contents usually called meta-data. As opposed to file contents. Even at this abstract level there are crash recovery problems: What if you crash in the middle of a rename()? But there is more: These objects live somewhere on the disk. [picture with directory, i-node, data...] The file system objects have disk block addresses. File system must allocate disk blocks for new i-nodes &c. Someone decides where to place i-nodes. File system must release unused blocks after deletion. So there must be a list of free disk blocks. File systems do their own disk layout and free block list. 1. Known layout helps FS recovery: e.g. this range of blocks holds only i-nodes, so recovery won't interpret user data as i-node. 2. FS recovery may generate free list as a side effect. FS recovery code scans active i-nodes, keeps list of file blocks, other blocks must be free. Typical free list implementation: A bit-map for i-nodes, a bit-map for file data blocks. FS keeps them in memory for speed. Tries to keep disk versions up to date for fast reboot. But (in simple systems) doesn't try too hard, since easy to recover. But keep in mind the kind of failure you'll get w/ out of date free list: You'll allocate the same block or i-node to two files. What does recovery do? For example, UNIX fsck program that runs at boot-time. Descends file system tree, remembering allocated i-nodes and blocks. All other blocks and i-nodes must be free, so fsck reinitializes free lists. Also checks for block used by two files, file length != number of blocks, &c. May find problem it can't fix (which file should get the doubly-used block). Asks the user. Goal: finish or undo operations in progress at the time of the crash. Leave file system in a state that it could have been in if the crash had been slightly earlier or later. Or notify the user if that's not possible. Final ingredient: Kernel's in-memory disk buffer cache of recently used blocks. Hugely effective for reads (all those root i-node accesses). The result is that the bottleneck is often disk writes. So disk caches are also usually write-*back*: they hold dirty blocks. The dirty blocks are an additional headache for crash recovery. So we will always want to know when they are written to the disk, and in what order. Delayed write: no particular time. Async write: put on disk driver queue now, but don't wait. Synchronous write: this system call waits for disk write to finish. What are the specific problems? Example: fd = create("d/f", 0666); write(fd, "hello", 5); Ignore reads (cached), here are the writes: 1. i-node free bit-map (get a free i-node for f) 2. f's i-node (write owner &c) 3. d's contents (add "f" -> i-number mapping) 4. d's i-node (longer length, mtime) 5. block free bit-map (get a free i-node for f's data) 6. data block 7. f's i-node (add block to list, update mtime and length) How fast can we create small files? If each write goes to disk, then 70 ms/file, or 14 files/second. Pretty slow if you are are un-tarring a big tar file. If FS only writes into disk cache, very quickly. But cache will eventually fill up with dirty blocks, must write to disk. Then writes 1, 2, 3, 4, 5, and 7 are amortized over many files But write 6 is one per file. So at a sustained rate of maybe 100 files/second. 7x faster than sync writes. And unlink: unlink("d/f"); 8. d's contents 9. d's i-node (mtime) 10. free i-node bitmap 11. free block bitmap Can we recover sensibly with a write-back cache? The game: a few dirty blocks flushed to disk, then crash, recovery. Example: only write #8, recovery sees unused i-node, frees it. Example: only write #3, recovery sees used i-node on free list; ask the user? Example: only write #10, recovery sees used i-node on free list; ask the user? These are benign, though annoying for the user. Clearly there's a vast number of outcomes. Are they all benign? Here's the worst: unlink("f1"); create("f2"); Create happens to re-use the i-node freed by the unlink. Only create write #3 goes to disk. Crash. After re-start, what does recovery see? The file system looks correct! Nothing to fix. But file f1 actually has file f2's contents! Serious *undetected* inconsisency. This is *not* a state the file system could have been in if the crash had occured slightly earlier or later. And fsck did not notify the user there was an unfixable problem! How can we avoid this delete/create inconsistency? Observation: we only care about what's visible in the file system tree. Goal: on-disk directory entry must always point to correct on-disk i-node. Unlink rule: remove dirent *on disk* before freeing i-node. Create rule: initialize new i-node *on disk* before creating directory entry. In general, directory entry writes should be commit points. Earlier crashes make no difference to visible fs: undo (e.g. free i-node). Later crashes can be rolled forward: complete (e.g. update directory mtime) For example, synchronous disk writes in the order given is sufficient. For most file system operations, there is some recoverable synchronous order. Because the file system is a tree, you can prepare the new sub-tree and cause it to appear in the old one with one operation. And most operations are "small", just affect leaves. What's the exception? rename(). Can we eliminate some of the sync writes in file creation? What ordering constraints can we identify? #2 before #3 #6 before #7 #3 before #4 (maybe, fsck might correct length, mtime less important) #1 and #5 need never occur, since fsck recovers it. #3 can be deferred, since it is a commit point once #2 has completed. So perhaps only #2 and #6 need to be synchronous. To force them to occur before #3 and #7. Perhaps #3 to force it to happen before #4. UNIX: #2, #3, but not #6. It defends only its own meta-data, not your file data. Use fsync() if you care. Performance: two disk writes for a file create, so 50 files/second. What if you want to do better? 1) Forget reliability (linux) 2) Synchronous metadata, asynchronous data (DOS, FFS, etc) 3) NVRAM 4) Atomic updates (asychronous log, commit when log says it's ready) 5) Pass dependecies down to disk driver 6) Write-back cache enforces dependencies (circular b/c of false sharing) crate and delete in the same directory, inodes in same block 7) Softupdates tracks and temp undos. Softupdates Track dependencies between writes explicitly, and write in order as needed. Rules to enforce Never point to a structure before initialized Never reuse a resource before nulling all pointers Never reset old pointer to new resource before new pointer Snapshots Let system calls complete Keep a copy-on-write tree NetApp does this, guest lecture No fsck. Why?