+ - 0:00:00
Notes for current slide
Notes for next slide

Designing Data Intensive Applications

Chapter 8: Total Chaos

By Tarik Eshaq
1 / 32

Faults and Partial Failures

A single computer vs a distributed system

2 / 32

Why we can't just crash

  • It's online, users have expectations
  • High failure rates, commodity nodes
  • Network topologies are different that super computers, unreasonable to use multi-dimentional meshes using ethernet and IP
  • High likelyhood some part of the system is always broken
  • Rolling upgrades without interrupting service
  • Geographical distribution, nodes can be far way from each other
3 / 32

Unreliable Networks

What's the worst that could happen?

4 / 32

Unreliable Networks

What's the worst that could happen?

  • Request is lost
  • Request is waiting in a queue and will be delivered later
  • Remote node failed
  • remote node temporarily stopped responding, and will be back
  • Remote node processed the request, but response got lost
  • Remote node processed the request, and response is waiting in a queue
5 / 32

Unreliable Networks

You can't distinguish between if a request is lost, remote node is down, the response is lost, or if you will eventually get back the response

Timeout!

6 / 32

Network faults in practice

Pretty much networks are shit. They always break in practice

  • 12 network faults INSIDE a data center a month
  • Shit happens, sharks bite wires, network links start working in one direction
  • someone drives their car into your data center
  • A software upgrade can cause a reconfiguration of the topology that delays network packets and clients freakout
7 / 32

Network faults in practice

Even if they are unlikely, you need to deal with them

8 / 32

Detecting Faults

It's kinda important

  • Need to kick a node out of the network
  • If a leader dies, need to promote a new one
9 / 32

Detecting Faults

Why it's hard

  • The non-determinism that's core to distributed systems. What if the node is not dead?
  • Sometimes nodes can tell you when they die
  • Otherwise ACK ACK ACK
10 / 32

TIMEOUT ⏲️

How long to wait?

  • Too short and you declare alive nodes dead
    • Then that sorry ass node will have system duplicate it's work, load get hit
    • cascading failure, all nodes declare each other dead. Look what you've done
  • Too long and users think your service is slow
    • Users hate you, and that dead ass node has been dead for a while. Look what you've dones
11 / 32

Network Congestion and queuing

  • Network switches can delay the delivery of packets by queuing them with each other
  • Operating system will queue the IO from the incoming request
  • A virtual machine running an operating system will get queued while the CPU runs another VM
  • TCP's flow control has a limited window, so packets are queued
  • TCP retries if it doesn't get ACKs, delays the application
  • Lotta shit is going on in a data center that can clog the network switches
12 / 32

Synchronous vs Asynchronous Networks

How fixed-line telephone networks work

  • When making a call, the network establishes a circuit
  • A guaranteed bandwidth is given to the call
  • Key thing is that nobody can use this bandwidth while the call is active
  • We get a bounded delay! Because no queuing and bandwidth is allocated all the way
13 / 32

Synchronous vs Asynchronous Networks

How TCP works

  • No bandwidth allocation
  • TCP sends a packet whenever bandwidth is available
  • TCP does not take up bandwidth from others when idle
  • Ethernet and IP use packets, not circuits so TCP can't lock the network
  • BURSTY traffic all the way, gimme what I want, when I want it, quick.
14 / 32

Unreliable Clocks

They're important, like, time is cool

  • Wanna measure how long something took
  • Wanna measure absolute time something happened
15 / 32

Unreliable Clocks

Too bad you can never really find out what time it is

  • Network hops take time
  • Some events you can't even order
  • Even the fucking hardware you computer uses skews
16 / 32

Monotonic clocks vs Time of day

Time-of-day clocks

  • They... return the data and time
  • They synchronize with a group of servers every now and then
  • Sometimes you can go back to the past, fun stuff, time travel!
17 / 32

Monotonic clocks vs Time of day

Monotonic clocks

  • They measure durations
  • They never look back
  • Useless to compare monotonic clocks on different computers, because they skew differently
  • Multiple CPUs can have different clocks, with different skews. Operating systems do some magic or some shit here
  • A bunch of servers can try to slow down a clock's frequency
  • Always measure on the same computer, even on a distributed system
18 / 32

Clock Synchronization

Here we go again

  • Jumping to the future and past because the server lords deem it so
  • Careful not to firewall the NTP servers, otherwise you skew all the way
  • BTW, the synchronization? it happens over a network... and guess what? unbounded delay
  • NTP server are sometimes just wrong - never trust anyone on the internet
  • Leap seconds are a thing ladies and gents, and some systems didn't even know
  • VMs have virtualized clocks, geez, imagine switching back and forth, time goes brrrr
  • Kids these days know how to reset their device's clocks
19 / 32

Relying on Synchronized clocks

Just don't

  • Or if you must, make sure your software monitors it and deals with it
  • Definitely don't across multiple nodes. Shit is nasty. If you use last writer wins and depend on synchronized clocks, god save you
    • Writes can go poof
    • What's the difference between sequential and asynchronous writes? this distributed system definitely doesn't know
    • What happens when two write at the exact same time?
    • Sometimes packet can get sent to the past! We are time traveling again
  • There's always a confidence interval, if you're serious, use it. Google Spanner is good at this
  • Transaction ID in snapshot isolation transaction (yay throwback) in Google Spanner user the confidence intervals to define causality
20 / 32

Process pauses

Dammit stop, pls

  • Time based leases exist, a leader is only a leader if they have one. And they expire!
  • but.. what happens if a lease expires while the leader holds it?
  • bad shit. multiple leaders, an uprising, some french revolution shit
  • But come on, this shit can't happen...
    • Fuck you, yes it can. JVM loves garbage and will stop everything to gather it all
    • VMs sometimes get tired and to go to sleep
    • Some dumbfuck user can close their laptop
    • An operating system can context switch away
    • If you're a shitty dev, IO can cause a delay. If you're a great dev, IO can cause a delay. Might as well be a shitty dev.
    • Oh btw, page faults and page swapping is a thing, don't worry about it, it'll only fuck up your whole distributed system
    • Unix signal are also a thing
  • You can sometimes create real-time delay guarantees, but buddy, we're not building a car or airplane. Plus making this guarantee fucks everything else up
  • Sometimes you can just let the Garbage pile up
21 / 32

What is life?

Knowledge, Truth and Lies

22 / 32

Knowledge, Truth and Lies

Truth is defined by the majority

  • If the majority think you're dead, off you go.
  • How to be a real-life zombie
  • A zombie can pretend to be a crewmate and mess everybody up
  • Miguel's favorite sport
23 / 32

Knowledge, Truth and Lies

Byzantine faults

Honestly, just don't worry about it

24 / 32

Like seriously, the book was like "radiation"

nope

25 / 32

System Model and Reality

26 / 32

System Model and Reality

Timing assumptions

  • Synchronous Model:
    • bounded network delay, bounded process pauses, bounded clock error... as if life could be that easy
  • partially synchronous:
    • sometimes exceeds bounded delay, and all - no guarantees, realistic, most of the time.. like a glimmer of hope that stuff works, sometimes
  • Asynchronous model:
    • can't assume anything, you're on your own...
27 / 32

System Model and Reality

Types of crashes

  • Crash-stop
  • Crash-recover
  • Byzan.. nope
28 / 32

System Model and Reality

Correctness properties

  • Uniquness
  • Monotonic sequence
  • Availability

ps your choice for system timing model is important here

29 / 32

System Model and Reality

Safety and liveness

  • Safety: "Nothing bad happens"
  • Liveness: "Eventually... something good happens"
30 / 32

System Model and Reality

Safety and liveness (for real this time)

  • Safety:
    • If a safety guarantees is broken we can point at the particular point in time at which is was broken. The damage cannot be undone gg.
    • They must hold in distributed systems. otherwise shit hits the fan
  • Liveness:
    • It may not hold at an exact point in time, but there is always hope 🙄
    • Eeeh in our system model, we'll just give up hope and recover
31 / 32

I'm sad

32 / 32

Faults and Partial Failures

A single computer vs a distributed system

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow