class: center, middle # Designing Data Intensive Applications ## Chapter 2: Data Models and Query Languages ##### By Tarik Eshaq and Co-pilot 🐙 --- class: center, middle # Why Data Models matter 🐬 --- class: center, middle # Why Data Models matter 🐬 ## Abstractions all the way down! --- class: center, middle # Why Data Models matter 🐬 ## Application developers `model` the real world using data structures --- class: center, middle # Why Data Models matter 🐬 ## Application developers `model` the real world using data structures ### Data structures are `model`ed and stored using general-purpose data models (e.g., JSON, XML, CSV, etc.) --- class: center, middle # Why Data Models matter 🐬 ## Application developers `model` the real world using data structures ### Data structures are `model`ed and stored using general-purpose data models (e.g., JSON, XML, CSV, etc.) #### The database software you use stores those general purpose data models in terms of bytes in memory or on disk --- class: center, middle # Why Data Models matter 🐬 ## Application developers `model` the real world using data structures ### Data structures are `model`ed and stored using general-purpose data models (e.g., JSON, XML, CSV, etc.) #### The database software you use stores those general purpose data models in terms of bytes in memory or on disk ##### The hardware represents those bytes in terms of electrical currents and pulses of light --- class: center, middle # So many models nowadays 📷 💃 🕺 --- class: center, middle # So many models nowadays 📷 💃 🕺 ## So pick the right model for the job --- class: center, middle # To SQL or not to SQL 🐢 ## The Relational Model --- class: center, middle # The Relational Model ## How it works --- class: center, middle # The Relational Model ## How it works - Data is organized into `relations` or `tables` - Each relation is an unordered collection of `tuples` --- class: center, middle # The Relational Model ## Example Relational table ### The people table | id | name | age | |----|:------:|-----:| | 1 | Tarik | 21 | | 2 | Miguel | 5 | | 3 | Ethan | 69 | --- class: center, middle # The Relational Model ## Example Relational table ### The people table | id | name | age | |----|:------:|-----:| | 1 | Tarik | 21 | | 2 | Miguel | 5 | | 3 | Ethan | 69 | ### The friends table |friend1| friend2| |------:|:------:| | 1 | 2 | | 1 | 3 | | 2 | 3 | --- class: center, middle # The Relational Model ## Lets write a query to find out Ethan's friends ```sql SELECT name FROM people WHERE id IN (SELECT friend2 FROM friends WHERE friend1 = 3 or friend2 = 3)); ``` --- class: center, middle # NoSql 📦 --- class: center, middle # The Document Model ## Motivation --- # The Document Model ## Motivation - Need for greater scalability that relational databases can **easily** provide, when it comes to very high write throughput - Free and open source software FTW - Specialized queries that SQL may not support well - SQL is pretty darn restrictive --- class: center, middle # The Document Model ## What it is - A document is a collection of key-value pairs - Documents are stored in a collection - Documents are identified by a unique key or ID --- class: center, middle # The one-to-many problem 🤔 --- # The one-to-many problem 🤔 - SQL kinda craps out when it comes to one-to-many relationships - You have one of the following three options: - Store the one-to-many relationship in a separate table - Use structured data types to represent the one-to-many relationship (e.g., JSON, XML, CSV, etc.) - Use a structured datatype and encode it into text --- # The one-to-many problem ## SQL <center> | id | name | location | |----|:----:|----------| | 1 | Ethan | New York | </center> ________ <center> | id | job_id | job_title | |----|:------:|:--------:| | 1 | 1 | Software Engineer at Hootsuite | | 1 | 2 | Software Engineer at PDT | | 1 | 3 | Floor mopper at The Miguel Company | --- # The one-to-many problem ## NoSQL ```json { "id": 1, "name": "Ethan", "jobs": [ { "id": 1, "title": "Software Engineer at Hootsuite" }, { "id": 2, "title": "Software Engineer at PDT" }, { "id": 3, "title": "Floor mopper at The Miguel Company" } ], "location": "New York" } ``` --- class: center, middle # EVERYTHING GETS A TABLE 💻 <center> |id | name | location | |----|:----:|----------| | 1 | Ethan | New York | | 2 | Miguel | New York | </center> ______ What if The city of New York was renamed? --- class: center, middle # EVERYTHING GETS A TABLE 💻 |id | name | location | |----|:----:|----------| | 1 | Ethan | 314141414 | | 2 | Miguel | 314141414 | ______ | location_id | location_name | |:------------|:------------:| | 314141414 | Goblins | --- # NoSQL trying to join the cool table <table> <tr> <th> People </th> <th> Locations </th> </tr> <tr> <td> ```json { "id": 1, "name": "Ethan", "jobs": [ { "id": 1, "title": "Software Engineer at Hootsuite" }, { "id": 2, "title": "Software Engineer at PDT" }, { "id": 3, "title": "Floor mopper at The Miguel Company" } ], "location": 314141414 }, { "id": 2, "name": "Miguel", "jobs": [], "location": 314141414 } ``` </td> <td> ```c++ { "location_id": 314141414, "location_name": "Goblins", } ``` </td> </tr> </table> f0 --- # NoSQL joining the table ```javascript // Joins the people table with the locations table // to find out where Ethan and Miguel live db.people.find({ $or: [ { id: 1 }, { id: 2 } ] }, { _id: 0, name: 1, location: 1 }).forEach(function(person) { db.locations.findOne({ location_id: person.location }, { _id: 0, location_name: 1 }, function(location) { console.log(person.name + " lives in " + location.location_name); }); }); ``` ```bash $ node index.js Ethan lives in Goblins Miguel lives in Goblins ``` --- class: center, middle # NoSQL vs SQL - Fight! 🤜 🤛 --- # NoSQL vs SQL - Fight! 🤜 🤛 - Pros for Document Database: - Schema flexibility - Better performance due to locality - Closer to the data structures the used by the application --- # NoSQL vs SQL - Fight! 🤜 🤛 - Pros for Document Database: - Schema flexibility - Better performance due to locality - Closer to the data structures the used by the application - Pros for Relational Database: - Better support for joins - Many-to-many and many-to-one relationships are better supported --- # NoSQL vs SQL - Fight! 🤜 🤛 ## What data model leads to simpler application code? - If your data has a document like structure (with one-to-many relationships), then most likely document database! - If your data has a relational structure (with many-to-many relationships), then most likely relational database! --- # NoSQL vs SQL - Fight! 🤜 🤛 ## Schema Flexibility - Document databases are `schema-on-read`: - The data assumes the structure of the schema, when it is read - The schema is not enforced - Relational databases are `schema-on-write`: - The schema is enforced - The data is assumed to have the structure of the schema when it is written to the database --- # NoSQL vs SQL - Fight! 🤜 🤛 ## Schema Flexibility ### How to change the schema - SQL: ```sql --- Migrate the people table to have a first and last name --- Then move the split the old name column into first and last name columns ALTER TABLE people ADD COLUMN first_name VARCHAR(255) ADD COLUMN last_name VARCHAR(255); --- Split the name column into first and last name columns UPDATE people SET first_name = SUBSTRING(name, 1, LOCATE(' ', name) - 1), last_name = SUBSTRING(name, LOCATE(' ', name) + 1); --- Drop the old name column ALTER TABLE people DROP COLUMN name; ``` --- # NoSQL vs SQL - Fight! 🤜 🤛 ## Schema Flexibility ### How to change the schema - NoSQL: ```js // If the user doesn't have a `first_name` or `last_name` field, // then we can add them by splitting the `name` field if (!person.first_name || !person.last_name) { person.first_name = person.name.split(' ')[0]; person.last_name = person.name.split(' ')[1]; } // Then we can assume the user has a `first_name` and `last_name` field... ``` --- # NoSQL vs SQL - Fight! 🤜 🤛 ## Schema Flexibility ### And what if not all documents have the same structure? 😱 --- # NoSQL vs SQL - Fight! 🤜 🤛 ## Data locality - In the Document model, all the data is stored in the same document, so there is an advantage of loading all of it up - In the Relational model, the data could be split across multiple tables, requiring multiple index lookups - That said, newer technologies have done wonders.. --- class: center, middle # NoSQL and SQL - Unite! 🤝 - SQL started supporting structured data types, like `JSON` and `XML` - MongoDB and others started supporting native joins.. --- class: center, middle # Query Languages and Abstractions --- # Declarative vs Imperative Query Languages - Declarative languages hide the implementation details of the query, and describe only the data that it needs - Imperative languages describe the steps needed to retrieve the data --- # Declarative vs Imperative Query Languages #### SQL ```sql --- We join the location table to get the people --- who live in the Goblins SELECT people.first_name FROM people JOIN locations ON people.location = locations.location_id WHERE locations.location_name = 'Goblins'; ``` #### JavaScript ```javascript // we have to look at the people array and // location array to get the people that live in Goblins const result = []; for (const person of people) { for (const location of locations) { if (person.location === location.location_id && location.location_name === 'Goblins') { result.push(person.first_name); } } } ``` --- # Declarative vs Imperative Query Languages ## MapReduce - We'll talk about MapReduce later (psst Batch Processing!) - It's not either of the two, but it's a nice abstraction. - Consumers write two functions, a map function and a reduce function. The map function is called once for each document in the collection. The reduce function is called once for each key in the collection. --- # Declarative vs Imperative Query Languages ```js // MongoDB mapReduce example // to emit the number of people in each location const res = db.people.mapReduce( // The map function function() { emit(this.location, 1); }, // The reduce function function(key, values) { return Array.sum(values); }, { out: 'people_by_location' } ); // output the results res.find().forEach(function(doc) { const location = db.locations.findOne({ location_id: doc.key }); console.log(doc.value + " people live in " + location.location_name); }); ``` ```zsh $ node index.js 2 people live in Goblins ``` --- class: center, middle # ENTER GRAPH DATA MODEL 📈 📈 📈 --- # ENTER GRAPH DATA MODEL 📈 📈 📈 - Graph data models are a way to represent relationships between data - Graph data models are great for modeling very connected data - In graph data models, relationships are represented by edges and nodes represent entities - Graph data models are often used to represent social networks --- # ENTER GRAPH DATA MODEL 📈 📈 📈 ```javascript // dgraph query to get the people and their locations (Looks like GraphQL!) const query = ` query { people(func: has(location)) { name location { location_name } } } `; const res = await dgraph.newTxn().query(query); console.log(res); ``` --- # ENTER GRAPH DATA MODEL 📈 📈 📈 ```bash $ node index.js { "people": [ { "name": "Ethan", "location": { "location_name": "Goblins" } }, { "name": "Miguel", "location": { "location_name": "Goblins" } } ] } ``` --- # Summary - Deciding on a data model is a big decision - It's important to understand the data models you can choose from - There are a few models that are great for different types of data - There are many many models out there, the following are the most common nowadays - Relational data models - Document data models - Graph data models - The query language matters too, and there are two main types: - Declarative query languages - Imperative query languages --- # PS I didn't cover those: - Network data models - The history of it all - RDF data models - SPARQL - Cypher query languages - Datalog and functional programming (almost)