Designing Data Intensive ApplicationsChapter 8: Total ChaosBy Tarik Eshaq1 / 32

Faults and Partial FailuresA single computer vs a distributed system2 / 32

Why we can't just crashIt's online, users have expectations
High failure rates, commodity nodes
Network topologies are different that super computers, unreasonable to use multi-dimentional meshes using ethernet and IP
High likelyhood some part of the system is always broken
Rolling upgrades without interrupting service
Geographical distribution, nodes can be far way from each other
3 / 32

Unreliable NetworksWhat's the worst that could happen?4 / 32

Unreliable NetworksWhat's the worst that could happen?Request is lost
Request is waiting in a queue and will be delivered later
Remote node failed 
remote node temporarily stopped responding, and will be back
Remote node processed the request, but response got lost
Remote node processed the request, and response is waiting in a queue
5 / 32

Unreliable NetworksYou can't distinguish between if a request is lost, remote node is down, the response is lost, or if you will eventually get back the responseTimeout!6 / 32

Network faults in practicePretty much networks are shit. They always break in practice12 network faults INSIDE a data center a month
Shit happens, sharks bite wires, network links start working in one direction
someone drives their car into your data center
A software upgrade can cause a reconfiguration of the topology that delays network packets and clients freakout
7 / 32

Network faults in practiceEven if they are unlikely, you need to deal with them8 / 32

Detecting FaultsIt's kinda importantNeed to kick a node out of the network
If a leader dies, need to promote a new one
9 / 32

Detecting FaultsWhy it's hardThe non-determinism that's core to distributed systems. What if the node is not dead?
Sometimes nodes can tell you when they die
Otherwise ACK ACK ACK
10 / 32

TIMEOUT ⏲️How long to wait?Too short and you declare alive nodes deadThen that sorry ass node will have system duplicate it's work, load get hit
cascading failure, all nodes declare each other dead. Look what you've done

Too long and users think your service is slowUsers hate you, and that dead ass node has been dead for a while. Look what you've dones

11 / 32

Network Congestion and queuingNetwork switches can delay the delivery of packets by queuing them with each other
Operating system will queue the IO from the incoming request
A virtual machine running an operating system will get queued while the CPU runs another VM
TCP's flow control has a limited window, so packets are queued
TCP retries if it doesn't get ACKs, delays the application
Lotta shit is going on in a data center that can clog the network switches
12 / 32

Synchronous vs Asynchronous NetworksHow fixed-line telephone networks workWhen making a call, the network establishes a circuit
A guaranteed bandwidth is given to the call
Key thing is that nobody can use this bandwidth while the call is active
We get a bounded delay! Because no queuing and bandwidth is allocated all the way
13 / 32

Synchronous vs Asynchronous NetworksHow TCP worksNo bandwidth allocation
TCP sends a packet whenever bandwidth is available
TCP does not take up bandwidth from others when idle
Ethernet and IP use packets, not circuits so TCP can't lock the network
BURSTY traffic all the way, gimme what I want, when I want it, quick.
14 / 32

Unreliable ClocksThey're important, like, time is coolWanna measure how long something took
Wanna measure absolute time something happened
15 / 32

Unreliable ClocksToo bad you can never really find out what time it isNetwork hops take time
Some events you can't even order
Even the fucking hardware you computer uses skews
16 / 32

Monotonic clocks vs Time of dayTime-of-day clocksThey... return the data and time
They synchronize with a group of servers every now and then
Sometimes you can go back to the past, fun stuff, time travel!
17 / 32

Monotonic clocks vs Time of dayMonotonic clocksThey measure durations
They never look back
Useless to compare monotonic clocks on different computers, because they skew differently
Multiple CPUs can have different clocks, with different skews. Operating systems do some magic or some shit here
A bunch of servers can try to slow down a clock's frequency
Always measure on the same computer, even on a distributed system
18 / 32

Clock SynchronizationHere we go againJumping to the future and past because the server lords deem it so
Careful not to firewall the NTP servers, otherwise you skew all the way
BTW, the synchronization? it happens over a network... and guess what? unbounded delay
NTP server are sometimes just wrong - never trust anyone on the internet
Leap seconds are a thing ladies and gents, and some systems didn't even know
VMs have virtualized clocks, geez, imagine switching back and forth, time goes brrrr
Kids these days know how to reset their device's clocks
19 / 32

Relying on Synchronized clocksJust don'tOr if you must, make sure your software monitors it and deals with it
Definitely don't across multiple nodes. Shit is nasty. If you use last writer wins and depend on synchronized clocks, god save youWrites can go poof
What's the difference between sequential and asynchronous writes? this distributed system definitely doesn't know
What happens when two write at the exact same time?
Sometimes packet can get sent to the past! We are time traveling again

There's always a confidence interval, if you're serious, use it. Google Spanner is good at this
Transaction ID in snapshot isolation transaction (yay throwback) in Google Spanner user the confidence intervals to define causality
20 / 32

Process pausesDammit stop, plsTime based leases exist, a leader is only a leader if they have one. And they expire!
but.. what happens if a lease expires while the leader holds it?
bad shit. multiple leaders, an uprising, some french revolution shit
But come on, this shit can't happen...Fuck you, yes it can. JVM loves garbage and will stop everything to gather it all
VMs sometimes get tired and to go to sleep
Some dumbfuck user can close their laptop
An operating system can context switch away
If you're a shitty dev, IO can cause a delay. If you're a great dev, IO can cause a delay. Might as well be a shitty dev.
Oh btw, page faults and page swapping is a thing, don't worry about it, it'll only fuck up your whole distributed system
Unix signal are also a thing

You can sometimes create real-time delay guarantees, but buddy, we're not building a car or airplane. Plus making this guarantee fucks everything else up
Sometimes you can just let the Garbage pile up
21 / 32

What is life?Knowledge, Truth and Lies22 / 32

Knowledge, Truth and LiesTruth is defined by the majorityIf the majority think you're dead, off you go.
How to be a real-life zombie
A zombie can pretend to be a crewmate and mess everybody up
Miguel's favorite sport
23 / 32

Knowledge, Truth and LiesByzantine faultsHonestly, just don't worry about it24 / 32

Like seriously, the book was like "radiation"nope25 / 32

System Model and Reality26 / 32

System Model and RealityTiming assumptionsSynchronous Model:bounded network delay, bounded process pauses, bounded clock error... as if life could be that easy

partially synchronous: sometimes exceeds bounded delay, and all - no guarantees, realistic, most of the time.. like a glimmer of hope that stuff works, sometimes

Asynchronous model: can't assume anything, you're on your own...

27 / 32

System Model and RealityTypes of crashesCrash-stop
Crash-recover
Byzan.. nope
28 / 32

System Model and Reality

Correctness properties

Uniquness
Monotonic sequence
Availability

ps your choice for system timing model is important here

29 / 32

System Model and RealitySafety and livenessSafety: "Nothing bad happens"
Liveness: "Eventually... something good happens"
30 / 32

System Model and RealitySafety and liveness (for real this time)Safety:If a safety guarantees is broken we can point at the particular point in time at which is was broken. The damage cannot be undone gg.
They must hold in distributed systems. otherwise shit hits the fan

Liveness: It may not hold at an exact point in time, but there is always hope 🙄
Eeeh in our system model, we'll just give up hope and recover

31 / 32

I'm sad32 / 32

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Designing Data Intensive Applications

Chapter 8: Total Chaos

By Tarik Eshaq

Faults and Partial Failures

A single computer vs a distributed system

Why we can't just crash

Unreliable Networks

What's the worst that could happen?

Unreliable Networks

What's the worst that could happen?

Unreliable Networks

You can't distinguish between if a request is lost, remote node is down, the response is lost, or if you will eventually get back the response

Timeout!

Network faults in practice

Pretty much networks are shit. They always break in practice

Network faults in practice

Even if they are unlikely, you need to deal with them

Detecting Faults

It's kinda important

Detecting Faults

Why it's hard

TIMEOUT ⏲️

How long to wait?

Network Congestion and queuing

Synchronous vs Asynchronous Networks

How fixed-line telephone networks work

Synchronous vs Asynchronous Networks

How TCP works

Unreliable Clocks

They're important, like, time is cool

Unreliable Clocks

Too bad you can never really find out what time it is

Monotonic clocks vs Time of day

Time-of-day clocks

Monotonic clocks vs Time of day

Monotonic clocks

Clock Synchronization

Here we go again

Relying on Synchronized clocks

Just don't

Process pauses

Dammit stop, pls

What is life?

Knowledge, Truth and Lies

Knowledge, Truth and Lies

Truth is defined by the majority

Knowledge, Truth and Lies

Byzantine faults

Honestly, just don't worry about it

Like seriously, the book was like "radiation"

nope

System Model and Reality

System Model and Reality

Timing assumptions

System Model and Reality

Types of crashes

System Model and Reality

Correctness properties

System Model and Reality

Safety and liveness

System Model and Reality

Safety and liveness (for real this time)

I'm sad

Faults and Partial Failures

A single computer vs a distributed system

Help