class: center, middle # Designing Data Intensive Applications ## Chapter 8: Total Chaos ##### By Tarik Eshaq --- class: center, middle # Faults and Partial Failures ## A single computer vs a distributed system --- # Why we can't just crash - It's online, users have expectations - High failure rates, commodity nodes - Network topologies are different that super computers, unreasonable to use multi-dimentional meshes using ethernet and IP - High likelyhood some part of the system is always broken - Rolling upgrades without interrupting service - Geographical distribution, nodes can be far way from each other --- class: center, middle # Unreliable Networks ## What's the worst that could happen? --- # Unreliable Networks ## What's the worst that could happen? - Request is lost - Request is waiting in a queue and will be delivered later - Remote node failed - remote node temporarily stopped responding, and will be back - Remote node processed the request, but response got lost - Remote node processed the request, and response is waiting in a queue --- class: center, middle # Unreliable Networks ## You can't distinguish between if a request is lost, remote node is down, the response is lost, or if you will eventually get back the response ## Timeout! --- # Network faults in practice ## Pretty much networks are shit. They always break in practice - 12 network faults INSIDE a data center a month - Shit happens, sharks bite wires, network links start working in one direction - someone drives their car into your data center - A software upgrade can cause a reconfiguration of the topology that delays network packets and clients freakout --- class: center, middle # Network faults in practice ## Even if they are unlikely, you need to deal with them --- # Detecting Faults ## It's kinda important - Need to kick a node out of the network - If a leader dies, need to promote a new one --- # Detecting Faults ## Why it's hard - The non-determinism that's core to distributed systems. What if the node is not dead? - Sometimes nodes can tell you when they die - Otherwise ACK ACK ACK --- # TIMEOUT ⏲️ ## How long to wait? - Too short and you declare alive nodes dead - Then that sorry ass node will have system duplicate it's work, load get hit - cascading failure, all nodes declare each other dead. Look what you've done - Too long and users think your service is slow - Users hate you, and that dead ass node has been dead for a while. Look what you've dones --- # Network Congestion and queuing - Network switches can delay the delivery of packets by queuing them with each other - Operating system will queue the IO from the incoming request - A virtual machine running an operating system will get queued while the CPU runs another VM - TCP's flow control has a limited window, so packets are queued - TCP retries if it doesn't get ACKs, delays the application - Lotta shit is going on in a data center that can clog the network switches --- # Synchronous vs Asynchronous Networks ## How fixed-line telephone networks work - When making a call, the network establishes a circuit - A guaranteed bandwidth is given to the call - Key thing is that nobody can use this bandwidth while the call is active - We get a bounded delay! Because no queuing and bandwidth is allocated all the way --- # Synchronous vs Asynchronous Networks ## How TCP works - No bandwidth allocation - TCP sends a packet whenever bandwidth is available - TCP does not take up bandwidth from others when idle - Ethernet and IP use packets, not circuits so TCP can't lock the network - BURSTY traffic all the way, gimme what I want, when I want it, quick. --- # Unreliable Clocks ## They're important, like, time is cool - Wanna measure how long something took - Wanna measure absolute time something happened --- # Unreliable Clocks ## Too bad you can never really find out what time it is - Network hops take time - Some events you can't even order - Even the fucking hardware you computer uses skews --- # Monotonic clocks vs Time of day ## Time-of-day clocks - They... return the data and time - They synchronize with a group of servers every now and then - Sometimes you can go back to the past, fun stuff, time travel! --- # Monotonic clocks vs Time of day ## Monotonic clocks - They measure durations - They never look back - Useless to compare monotonic clocks on different computers, because they skew differently - Multiple CPUs can have different clocks, with different skews. Operating systems do some magic or some shit here - A bunch of servers can try to slow down a clock's frequency - Always measure on the same computer, even on a distributed system --- # Clock Synchronization ## Here we go again - Jumping to the future and past because the server lords deem it so - Careful not to firewall the NTP servers, otherwise you skew all the way - BTW, the synchronization? it happens over a network... and guess what? unbounded delay - NTP server are sometimes just wrong - never trust anyone on the internet - Leap seconds are a thing ladies and gents, and some systems didn't even know - VMs have virtualized clocks, geez, imagine switching back and forth, time goes brrrr - Kids these days know how to reset their device's clocks --- # Relying on Synchronized clocks ## Just don't - Or if you must, make sure your software monitors it and deals with it - Definitely don't across multiple nodes. Shit is nasty. If you use last writer wins and depend on synchronized clocks, god save you - Writes can go poof - What's the difference between sequential and asynchronous writes? this distributed system definitely doesn't know - What happens when two write at the exact same time? - Sometimes packet can get sent to the past! We are time traveling again - There's always a confidence interval, if you're serious, use it. Google Spanner is good at this - Transaction ID in snapshot isolation transaction (yay throwback) in Google Spanner user the confidence intervals to define causality --- # Process pauses ## Dammit stop, pls - Time based leases exist, a leader is only a leader if they have one. And they expire! - but.. what happens if a lease expires while the leader holds it? - bad shit. multiple leaders, an uprising, some french revolution shit - But come on, this shit can't happen... - Fuck you, yes it can. JVM loves garbage and will stop everything to gather it all - VMs sometimes get tired and to go to sleep - Some dumbfuck user can close their laptop - An operating system can context switch away - If you're a shitty dev, IO can cause a delay. If you're a great dev, IO can cause a delay. Might as well be a shitty dev. - Oh btw, page faults and page swapping is a thing, don't worry about it, it'll only fuck up your whole distributed system - Unix signal are also a thing - You can sometimes create real-time delay guarantees, but buddy, we're not building a car or airplane. Plus making this guarantee fucks everything else up - Sometimes you can just let the Garbage pile up --- class: center, middle # What is life? ## Knowledge, Truth and Lies --- ## Knowledge, Truth and Lies ### Truth is defined by the majority - If the majority think you're dead, off you go. - How to be a real-life zombie - A zombie can pretend to be a crewmate and mess everybody up - Miguel's favorite sport --- class: center, middle ## Knowledge, Truth and Lies ### Byzantine faults #### Honestly, just don't worry about it --- class: center, middle # Like seriously, the book was like "radiation" ## nope --- class: center, middle # System Model and Reality --- # System Model and Reality ## Timing assumptions - Synchronous Model: - bounded network delay, bounded process pauses, bounded clock error... as if life could be that easy - partially synchronous: - sometimes exceeds bounded delay, and all - no guarantees, realistic, most of the time.. like a glimmer of hope that stuff works, sometimes - Asynchronous model: - can't assume anything, you're on your own... --- # System Model and Reality ## Types of crashes - Crash-stop - Crash-recover - Byzan.. nope --- # System Model and Reality ## Correctness properties - Uniquness - Monotonic sequence - Availability ps your choice for system timing model is important here --- # System Model and Reality ## Safety and liveness - Safety: "Nothing bad happens" - Liveness: "Eventually... something good happens" --- # System Model and Reality ## Safety and liveness (for real this time) - Safety: - If a safety guarantees is broken we can point at the particular point in time at which is was broken. The damage cannot be undone gg. - They **must** hold in distributed systems. otherwise shit hits the fan - Liveness: - It may not hold at an exact point in time, but there is always hope 🙄 - Eeeh in our system model, we'll just give up hope and recover --- class: middle, center # I'm sad