+ - 0:00:00
Notes for current slide
Notes for next slide

Designing Data Intensive Applications

Chapter 10: Batch Processing

By Tarik Eshaq & Codepiolt
1 / 17

Batch processing

Large input -> jobs -> output

2 / 17

Batch processing using Unix tools

Simple log analysis, finding the 5 most popular articles!

cat /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -nr | head -5
  • cat: Reads the file
  • |: Pipe the output of the previous command to the next command
  • awk: Parses the output of the previous command
  • {print $7}: Prints the 7th column of the output of the previous command
  • sort: Sorts the output of the previous command
  • uniq: Removes duplicate lines
  • -c: Counts the number of times each line appears
  • sort -nr: Sorts the output in descending order
  • head -5: Prints the first 5 lines of the output

It's flexible! and easily extensible!

3 / 17

The Unix Philosophy

4 / 17

The Unix Philosophy

  • Each program should do one thing, and do it well
  • Expect that the output of the program is the input of the next program
  • Design software to be tried early
  • Use tools in preference to unskilled help, even if you might throw them away
5 / 17

The Unix Interface

It's file descriptors!

  • They can point to actual files, stdin, tcp sockets, etc!

    Logic and wiring are separate

  • The program reads from stdin, and writes to stdout. The caller can modify where those point to!

    Transparency and experimentation

  • Input is immutable, meaning running the program multiple times doesn't change the input
6 / 17

MapReduce & Distributed File Systems

7 / 17

MapReduce & Distributed File Systems

Think Unix, but now across multiple machines!

Table in markdown with two columns, Unix and MapReduce: | Unix | MapReduce | | ---- | --------- | | stdin/out | Files in a distributed file system | | process | A MapReduce job | | Runs on one machine | Runs on multiple machines | | Unix Interface | Mappers and reducers! |

8 / 17

MapReduce

The distributed file system is key

  • It handles the failure tolerance
  • Allows keeping a tremendous amount of data
  • Expands the ability to parallelize
9 / 17

MapReduce

The MapReduce execution steps

Based on the Simple log analysis example, we can see the following steps:

  • MapReduce reads a set of input files, and break each file into records
  • Mappers read each record, and emit key/value pairs. In this case, the key is the URL and the value is blank.
  • All the records are then sorted by the key, and written to a temporary file
  • Reducers read the temporary file, combines values with the same key and emit key/value pairs. In this case, the key is the URL and the value is the number of times the URL was seen.

Steps 2 and 4 are done by custom code, everything else is done by a MapReduce framework.

10 / 17

MapReduce

What's a mapper and a reducer?

  • Mapper: Takes a record, and emits any number of key/value pairs.
  • Reducer: Takes a key and a list of values, and emits a single key/value pair.
11 / 17

Distributed MapReduce

  • The parallelization is achieved by partitioning. The input to a job is typically a directory, each file is a partition.
  • The framework will assign mappers based on who has the replica for a given partition.
  • The reducers are typically partitioned by the job author. The framework will hash the key to determine which reducer to use. This ensures that the same keys are always assigned to the same reducer.
  • Reducers machines are responsible for merging the sorted inputs from the mappers. (Think merge K sorted lists!)
12 / 17

MapReduce Workflows

  • MapReduce doesn't support workflows out of the box.
  • Workflows can be done by setting the output directories of one job to the input directories of another job.
  • NOT the same as Unix. Where in-memory buffers are used to pass outputs from one job to the next.
  • Each job can only start when the previous job has finished.
13 / 17

MapReduce's Reduce Side join

  • Creating associations between related data
  • Grouping data by the same key (for example to collect all data about a user's session)
  • Handling skew in hot keys
    • Split keys between reducers
    • Using multiple MapReduce jobs to compact the hot keys
  • map side join is an alternative to reduce side join
14 / 17

Batch processing outputs

  • Building search indexes (think google)
  • Producing a key-value store (think machine learning)
  • They follow the Unix philosophy of separating wiring from logic & encouraging experimentation
15 / 17

Batch Processing vs Massively Parallel Processing

  • Fault tolerance model (MapReduce can simply be re-run a granular task)
  • MapReduce uses more memory
  • MapReduce more flexible in cases where there is lots of application logic
16 / 17

Graph batch processing, declarative languages, etc!

  • Tarik ran out of time :(
17 / 17

Batch processing

Large input -> jobs -> output

2 / 17
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow