Slide

class: center, middle
# Designing Data Intensive Applications
## Chapter 8: Total Chaos

##### By Tarik Eshaq

---
class: center, middle
# Faults and Partial Failures
## A single computer vs a distributed system

---
# Why we can't just crash
- It's online, users have expectations
- High failure rates, commodity nodes
- Network topologies are different that super computers, unreasonable to use multi-dimentional meshes using ethernet and IP
- High likelyhood some part of the system is always broken
- Rolling upgrades without interrupting service
- Geographical distribution, nodes can be far way from each other
---
class: center, middle
# Unreliable Networks
## What's the worst that could happen?
---
# Unreliable Networks
## What's the worst that could happen?
- Request is lost
- Request is waiting in a queue and will be delivered later
- Remote node failed 
- remote node temporarily stopped responding, and will be back
- Remote node processed the request, but response got lost
- Remote node processed the request, and response is waiting in a queue
---
class: center, middle
# Unreliable Networks
## You can't distinguish between if a request is lost, remote node is down, the response is lost, or if you will eventually get back the response
## Timeout!
---
# Network faults in practice
## Pretty much networks are shit. They always break in practice
- 12 network faults INSIDE a data center a month
- Shit happens, sharks bite wires, network links start working in one direction
- someone drives their car into your data center
- A software upgrade can cause a reconfiguration of the topology that delays network packets and clients freakout
---
class: center, middle
# Network faults in practice
## Even if they are unlikely, you need to deal with them
---
# Detecting Faults
## It's kinda important
- Need to kick a node out of the network
- If a leader dies, need to promote a new one
---
# Detecting Faults
## Why it's hard
- The non-determinism that's core to distributed systems. What if the node is not dead?
- Sometimes nodes can tell you when they die
- Otherwise ACK ACK ACK
---
# TIMEOUT ⏲️
## How long to wait?
- Too short and you declare alive nodes dead
    - Then that sorry ass node will have system duplicate it's work, load get hit
    - cascading failure, all nodes declare each other dead. Look what you've done
- Too long and users think your service is slow
    - Users hate you, and that dead ass node has been dead for a while. Look what you've dones
---
# Network Congestion and queuing
- Network switches can delay the delivery of packets by queuing them with each other
- Operating system will queue the IO from the incoming request
- A virtual machine running an operating system will get queued while the CPU runs another VM
- TCP's flow control has a limited window, so packets are queued
- TCP retries if it doesn't get ACKs, delays the application
- Lotta shit is going on in a data center that can clog the network switches
---
# Synchronous vs Asynchronous Networks
## How fixed-line telephone networks work
- When making a call, the network establishes a circuit
- A guaranteed bandwidth is given to the call
- Key thing is that nobody can use this bandwidth while the call is active
- We get a bounded delay! Because no queuing and bandwidth is allocated all the way
---
# Synchronous vs Asynchronous Networks
## How TCP works
- No bandwidth allocation
- TCP sends a packet whenever bandwidth is available
- TCP does not take up bandwidth from others when idle
- Ethernet and IP use packets, not circuits so TCP can't lock the network
- BURSTY traffic all the way, gimme what I want, when I want it, quick.
---
# Unreliable Clocks
## They're important, like, time is cool
- Wanna measure how long something took
- Wanna measure absolute time something happened
---
# Unreliable Clocks
## Too bad you can never really find out what time it is
- Network hops take time
- Some events you can't even order
- Even the fucking hardware you computer uses skews
---
# Monotonic clocks vs Time of day
## Time-of-day clocks
- They... return the data and time
- They synchronize with a group of servers every now and then
- Sometimes you can go back to the past, fun stuff, time travel!
---
# Monotonic clocks vs Time of day
## Monotonic clocks
- They measure durations
- They never look back
- Useless to compare monotonic clocks on different computers, because they skew differently
- Multiple CPUs can have different clocks, with different skews. Operating systems do some magic or some shit here
- A bunch of servers can try to slow down a clock's frequency
- Always measure on the same computer, even on a distributed system
---
# Clock Synchronization
## Here we go again
- Jumping to the future and past because the server lords deem it so
- Careful not to firewall the NTP servers, otherwise you skew all the way
- BTW, the synchronization? it happens over a network... and guess what? unbounded delay
- NTP server are sometimes just wrong - never trust anyone on the internet
- Leap seconds are a thing ladies and gents, and some systems didn't even know
- VMs have virtualized clocks, geez, imagine switching back and forth, time goes brrrr
- Kids these days know how to reset their device's clocks
---
# Relying on Synchronized clocks
## Just don't
- Or if you must, make sure your software monitors it and deals with it
- Definitely don't across multiple nodes. Shit is nasty. If you use last writer wins and depend on synchronized clocks, god save you
    - Writes can go poof
    - What's the difference between sequential and asynchronous writes? this distributed system definitely doesn't know
    - What happens when two write at the exact same time?
    - Sometimes packet can get sent to the past! We are time traveling again
- There's always a confidence interval, if you're serious, use it. Google Spanner is good at this
- Transaction ID in snapshot isolation transaction (yay throwback) in Google Spanner user the confidence intervals to define causality
---
# Process pauses
## Dammit stop, pls
- Time based leases exist, a leader is only a leader if they have one. And they expire!
- but.. what happens if a lease expires while the leader holds it?
- bad shit. multiple leaders, an uprising, some french revolution shit
- But come on, this shit can't happen...
    - Fuck you, yes it can. JVM loves garbage and will stop everything to gather it all
    - VMs sometimes get tired and to go to sleep
    - Some dumbfuck user can close their laptop
    - An operating system can context switch away
    - If you're a shitty dev, IO can cause a delay. If you're a great dev, IO can cause a delay. Might as well be a shitty dev.
    - Oh btw, page faults and page swapping is a thing, don't worry about it, it'll only fuck up your whole distributed system
    - Unix signal are also a thing
- You can sometimes create real-time delay guarantees, but buddy, we're not building a car or airplane. Plus making this guarantee fucks everything else up
- Sometimes you can just let the Garbage pile up
---
class: center, middle
# What is life?
## Knowledge, Truth and Lies
---
## Knowledge, Truth and Lies
### Truth is defined by the majority
- If the majority think you're dead, off you go.
- How to be a real-life zombie
- A zombie can pretend to be a crewmate and mess everybody up
- Miguel's favorite sport
---
class: center, middle
## Knowledge, Truth and Lies
### Byzantine faults
#### Honestly, just don't worry about it
---
class: center, middle
# Like seriously, the book was like "radiation"
## nope
---
class: center, middle
# System Model and Reality
---
# System Model and Reality
## Timing assumptions
- Synchronous Model:
  - bounded network delay, bounded process pauses, bounded clock error... as if life could be that easy
- partially synchronous: 
  - sometimes exceeds bounded delay, and all - no guarantees, realistic, most of the time.. like a glimmer of hope that stuff works, sometimes
- Asynchronous model: 
  - can't assume anything, you're on your own...
---
# System Model and Reality
## Types of crashes
- Crash-stop
- Crash-recover
- Byzan.. nope
---
# System Model and Reality
## Correctness properties
- Uniquness
- Monotonic sequence
- Availability

ps your choice for system timing model is important here
---
# System Model and Reality
## Safety and liveness
- Safety: "Nothing bad happens"
- Liveness: "Eventually... something good happens"
---
# System Model and Reality
## Safety and liveness (for real this time)
- Safety:
  - If a safety guarantees is broken we can point at the particular point in time at which is was broken. The damage cannot be undone gg.
  - They **must** hold in distributed systems. otherwise shit hits the fan
- Liveness: 
  - It may not hold at an exact point in time, but there is always hope 🙄
  - Eeeh in our system model, we'll just give up hope and recover
---
class: middle, center
# I'm sad