At Cal, I served as a teaching assistant for CS 186: Databases. Here, I present a high-level overview of what we learn in the class. I link all my supplementary teaching resources at the bottom of this page.
What we learn in this class
As succinctly as possible, here’s my narrative of the class.
Phase 0: Motivations. While Professor Joe Hellerstein introduces the concept of data ubiquity and the ethical problems that follow it, my take is more practical. As humans in an increasingly data-driven world, we should at least understand how to communicate with the systems that represent us in digital form. In this class, we begin our journey with SQL, a declarative language that lets users read from and write to relational databases.
Phase 1: Database Internals. At this point, relational databases should still feel like a black box to us. How is our data stored? How does a database retrieve our data? Thus, we build a database from scratch with records, pages, and files on a bare-metal disk. To improve read performance, we add an index over our data. To improve write performance, we rely on the buffer manager to send changes to disk.
Phase 2: Database Operations. We then revisit the individual operations in our SQL queries from phase 0. How do we sort our data in a query with SORT BY
? How do we hash our data in a query with GROUP BY
? How do we join tables (e.g., __ JOIN __ ON
)? One strategy that repeatedly manifests in this section is the divide-and-conquer algorithm.
How do we consider all these operations and organize them such that the query executes as efficiently as possible? In this chapter, we come to understand how the lower-level internals from phase 1 (e.g., buffer managers and indices) play an immense role in computing the operations in phase 2.
Phase 3: Transactions. The second half of the course revolves around the transaction, a sequence of operations that act as one logical unit. For example, I want to send $50 to my sister, Tabitha. A database considers this one transaction that consists of:
- Op 1: Checking if I have more than $50 in my bank account
- Op 2: Subtracting $50 from my bank account
- Op 3: Adding $50 to Tabitha’s bank account
Businesses want to guarantee four main properties in a given transaction. We denote these properties with a funny acronym, “ACID”:
- Atomicity means all operations in a transaction happen, or none of them do. In an RDBMS, we guarantee atomicity with undo and redo mechanisms
- Consistency says that all defined behaviors and constraints are maintained. We can design and enforce these rules using entity-relation diagrams
- Isolation says that a set of transactions should appear to run in serial, even if we run them concurrently. Ultimately, achieving isolation boils down to a scheduling problem
- Durability means once a transaction completes, there’s no turning back––even when machines fail. We can guarantee durability using a fault-tolerant recovery mechanism called ARIES
Phase 4: Networks and Parallelism. In a world where data to a business is like water to a human, speed and scale rise above everything else. For machines that offer multiple concurrent threads for execution, how can we complete a query operation from phase 2 in parallel (speed)? For businesses operating on massive amounts of data every second, how can we coordinate decisions across multiple machines (scale)? The recovery mechanisms and data structures that we learned in phase 3 provide core assumptions to the consensus protocols in phase 4.
List of resources by topic
Discussion | Resources |
---|---|
Query a database with SQL | Slides (pdf) (recording) |
Organize your data in records, pages, and files | Slides (pdf) (recording) |
Improve read performance with a B+ tree index | Slides (pdf) |
Choose optimal page replacement policies; Sort pedabytes of data | Slides (pdf) |
Hash pedabytes of data | Slides (pdf) |
Understand join algorithms | Slides (pdf) |
Implement your own query optimizer | Slides (pdf) |
Enable concurrent transaction schedules | Slides (pdf) |
Execute a transaction in parallel | Slides |
Recover from machine failure | Slides |
Scaling up with distributed transactions (my favorite topic) | Slides (recording) |
Design your database: entity-relationship diagrams | Slides (recording) |