class: center, middle # Designing Data Intensive Applications ## Chapter 10: Batch Processing ##### By Tarik Eshaq & Codepiolt --- class: center, middle # Batch processing ## Large input -> jobs -> output --- # Batch processing using Unix tools Simple log analysis, finding the 5 most popular articles! ```sh cat /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -nr | head -5 ``` - `cat`: Reads the file - `|`: Pipe the output of the previous command to the next command - `awk`: Parses the output of the previous command - `{print $7}`: Prints the 7th column of the output of the previous command - `sort`: Sorts the output of the previous command - `uniq`: Removes duplicate lines - `-c`: Counts the number of times each line appears - `sort -nr`: Sorts the output in descending order - `head -5`: Prints the first 5 lines of the output It's flexible! and easily extensible! --- class: center, middle # The Unix Philosophy --- # The Unix Philosophy - Each program should do one thing, and do it well - Expect that the output of the program is the input of the next program - Design software to be tried early - Use tools in preference to unskilled help, even if you might throw them away --- # The Unix Interface ## It's file descriptors! - They can point to actual files, stdin, tcp sockets, etc! ## Logic and wiring are separate - The program reads from stdin, and writes to stdout. The caller can modify where those point to! ## Transparency and experimentation - Input is immutable, meaning running the program multiple times doesn't change the input --- class: center, middle # MapReduce & Distributed File Systems --- # MapReduce & Distributed File Systems ## Think Unix, but now across multiple machines! Table in markdown with two columns, Unix and MapReduce: | Unix | MapReduce | | ---- | --------- | | stdin/out | Files in a distributed file system | | process | A MapReduce job | | Runs on one machine | Runs on multiple machines | | Unix Interface | Mappers and reducers! | --- # MapReduce ## The distributed file system is key - It handles the failure tolerance - Allows keeping a tremendous amount of data - Expands the ability to parallelize --- # MapReduce ## The MapReduce execution steps Based on the Simple log analysis example, we can see the following steps: - MapReduce reads a set of input files, and break each file into records - Mappers read each record, and emit key/value pairs. In this case, the key is the URL and the value is blank. - All the records are then sorted by the key, and written to a temporary file - Reducers read the temporary file, combines values with the same key and emit key/value pairs. In this case, the key is the URL and the value is the number of times the URL was seen. Steps 2 and 4 are done by custom code, everything else is done by a MapReduce framework. --- # MapReduce ## What's a mapper and a reducer? - Mapper: Takes a record, and emits any number of key/value pairs. - Reducer: Takes a key and a list of values, and emits a single key/value pair. --- # Distributed MapReduce - The parallelization is achieved by partitioning. The input to a job is typically a directory, each file is a partition. - The framework will assign mappers based on who has the replica for a given partition. - The reducers are typically partitioned by the job author. The framework will hash the key to determine which reducer to use. This ensures that the same keys are always assigned to the same reducer. - Reducers machines are responsible for merging the sorted inputs from the mappers. (Think merge K sorted lists!) --- # MapReduce Workflows - MapReduce doesn't support workflows out of the box. - Workflows can be done by setting the output directories of one job to the input directories of another job. - **NOT** the same as Unix. Where in-memory buffers are used to pass outputs from one job to the next. - Each job can only start when the previous job has finished. --- # MapReduce's Reduce Side join - Creating associations between related data - Grouping data by the same key (for example to collect all data about a user's session) - Handling skew in hot keys - Split keys between reducers - Using multiple MapReduce jobs to compact the hot keys - map side join is an alternative to reduce side join --- # Batch processing outputs - Building search indexes (think google) - Producing a key-value store (think machine learning) - They follow the Unix philosophy of separating wiring from logic & encouraging experimentation --- # Batch Processing vs Massively Parallel Processing - Fault tolerance model (MapReduce can simply be re-run a granular task) - MapReduce uses more memory - MapReduce more flexible in cases where there is lots of application logic --- # Graph batch processing, declarative languages, etc! - Tarik ran out of time :(