cs161 Lecture 23: Clustering, Load-Balancing Both try to put a single face on mutliple machines Load-balancing (generally) chooses which machine handles a request Clustering (generally) gives the impresssion that you have one machine The goal is scalability Performance (add more machines for more performance) Managability (using more machines, how can you keep it simple) Availability (similar problem to RAID - more machine, must mask faults) Load-balancing Usually considered for RO cases (or writes localized to one client) At any of many-layers - link, IP, HTTP (application, layer-7), DNS Choose based on: Lightest load Request size (heavy tail) Machine that can handle the request static allocation sticky allocation (SSL, most web apps) Clustering Closer to single-system image More communication during a request Minimize for performance, sanity We'll talk about this in the context of Porcupine, a highly scalable, cluster-based mail service (1999). - Problem: email service for many users. many users receive lots of email per second (billion message a day, ~10,000 per second). More than one machine can deal with. - Email operations: - deliver msg requires updating the user's index of messages and storing the message - retrieve message lookup message and return optionally remove message and update the user's index - Idea: use a cluster of machines to construct a high-performance email server. - PCs are inexpensive - Plan 1: simple partition of users over machines. I.e. each user's mail stored on a particular machine. problems: - manual configuration (doesn't scale) - one machine fails, many users no email (low availability) - how do outsiders know which machine to send mail to? - how do users know which machine to read from? - load imbalance (low performance) - rebalance is hard, requires manual intervention - adding new servers doesn't automatically improve performance Why is load balance so important? Some users may be stuck on overloaded machines. Some machines overloaded while others are idle. Wasted resources. Means we could have handled a higher load. - Plan 2: #1 + plus replication for availabilty and performance - replicate each user's mailbox on a few machines (availability) - read operations can happen at any replica (good performance) - write operations must touch each replica (bad performance) - must serialize to maintain replication (bad performance) create-then-delete better happen in the right order on all replicas! - it's a pain to keep the replicas precisely identical - what if a replica is down when mail arrives? can mail not be delivered? do "replicas" diverge? - Does an email service have to mimic single-copy semantics for mailboxes? No: 1. It's OK to reorder incoming messages: updates commute. 2. It's OK to deliver a message twice. 3. It's OK to undelete a message. 4. It's OK to present only a subset of messages during a partial failure. 5. It's OK to change the order from one session to another. 6. Updates are individually atomic, so no locking. I.e. "mailbox" does not need to be a coherent idea. Enough to have just a set of messages. Real operations are add/delete message. Only real consistency rule is "don't lose a message". All else is optional. Plan 3: When a new message arrives, copy it to 2 random servers. For availability. Just that message, don't worry about the whole mailbox. When a user reads mail, Ask all reachable servers for the messages they store. When a user deletes mail message, send a delete to all servers. What's good about Plan 3? Highly available. Probably load-balanced. Automatically starts using new servers. No management required. What's bad about Plan 3? It's expensive to contact every server during mail fetches. User may temporarily miss a message if two servers are down. No big deal. Down servers may miss deletes. This might be annoying but not disastrous. How does Porcupine fix the problems with Plan 3? 1. Maintain affinity between each user and a few preferred servers. "Fragment list" So reading mail only need to contact those servers. Need to remember correct per-user server set despite failures. But can be soft state! If all else fails, can ask each server who it holds mail for. 2. But mail delivery needs to find fragment list. They are distributed across servers, using "user map". Every node has complete copy of (identical) user map. Which is again soft state. 3. What about preserving order? Incoming mail stamped with current wall-clock time. Clocks of different servers roughly synchronized. Can be sorted when user reads mail. What happens when mail is delivered? Mail arrives on randomly chosen server. User map finds server holding user's fragment list. Fragment list tells us where to store/replicate. Can store somewhere else. If preferred servers are down or out of disk or overloaded. Just add new server to fragment list. How does Porcupine make sure data is replicated properly? I.e. if can't (or slow) contacting some replica servers? Log updates, keep trying. What happens when a user reads mail? Again, user contacts any server. Find fragment list. Gather fragments. Ordinarily only from a few preferred nodes. Sort by timestamp, eliminate duplicates. What happens during recovery? Soft state has to be correct. Not just a hint, since not cheap to verify. If user maps don't agree, a user may have multiple (different) frag lists. If fragment list isn't correct, frag may exist but not be read. All live servers must agree on the total set of live servers. First, choose a server to coordinate the recovery. Coordinator decides which servers are live. And computes a new user map -- with minimum change from old one. Tells every node the user map. Then servers help each other reconstruct fragment lists. Are these techniques widely applicable? I.e. could you use them for e.g. a replicated NFS server? How fast *should* such a system be? I.e. messages per second per server. Limited by disk seek? Because messages must be permantently stored. One seek per message delivered, retrieved, deleted. So three disk I/Os per message. Assume 15ms per disk I/O. 66 I/Os per second. 22 messages per second per server. What would we expect if every message replicated (i.e. two copies)? Delivery and deletion are each two seeks. Retrieval still one seek -- just need one copy. Expect a bit more than half as fast, roughly. Though surely we could use logs for the replicas... Are we really limited by disk seeks? I.e. can we run at sequential disk b/w instead? Not clear we can have non-seeking delivery *and* retrieval. How fast is Porcupine? (does synthetic workload balance delivery and retrieval? probably.) No replication: 26 messages / server / second. Replication: 10 m/s/s. Sendmail: 10 m/s/s. What do they claim the bottleneck is? Disk seeks.