“Intro to Web Dev” Course DB Part I Instructors AbdelAziz Allam abd.ibrahim.allam@gmail.com Solutions Development Manager 15 years of experience in Software industry. Mahmoud Shahin Mahmoudshahin.it@gmail.com Principle Solutions Architect 15 years of experience in Software industry. Module 04 04 DB Part I - The concept. CAP. DB 360 view. DB internal. Content Key outcome: Reminders The concept DB 360 view DB internals • Difference between DB types. • Internal architecture. Reminders Reminder 01 - Scaling Horizontal scaling (scaling out) by adding more instances Vertically (scaling up) by adding more HW resources or moving the app to a larger, more powerful machine. Reminder 02 – Sharding/ Partitioning Sharding involves splitting and distributing data/events across multiple servers. Shards are stored on multiple machines. This allows for larger datasets to be split into smaller chunks and stored in multiple data nodes, Reminder 03 - Replication Sample for a simple use case Replication Replication introduces Replication increases read performance complexity on writeReplication increases either through load focused workloads, availability. balancing or through as each write must geo-located query be copied to every or routing. some replicated node(s). Reminder 03 - Replication Sample for another use case – RF= 3 Replication Replication factor describes how many copies of your data exist. Replication factor of 3 as an example, when you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up [#homework: do a research until we take it during the course!] Consistency Nodes will have the same copies of a replicated data item. The same response is given to all identical requests. Accuracy, completeness, correctness and reliability of data. All reads receive the most recent write. A guarantee that every node in a distributed cluster returns the same (most recent). Distributed System Note A distributed system is a system with multiple components located on different machines that communicate and coordinate actions in order to appear as a single logical system to the end-user. CAP theorem !ﻣﺣدش ﺑﯾﺎﺧد ﻛل ﺣﺎﺟﺔ • In distributed systems as data is stored across multiple nodes or servers to ensure fault tolerance, scalability, and reliability. • The theorem states that in a distributed system, you can achieve at most two out of the three key properties: Consistency, Availability, and Partition Tolerance. CAP Overview I It gives Insights into the trade-offs that system architects must make when designing distributed systems. Refrences: • https://www.ibm.com/topics/cap-theorem • https://www.geeksforgeeks.org/the-cap-theorem-in-dbms/ • https://medium.com/@gurpreet.singh_89/understanding-the-cap-theorem-consistency-availability-and-partition-tolerance-e7faa5103638 CAP Overview II Consistency • Nodes will have the same copies of a replicated data item visible for various transactions. • A guarantee that every node in a distributed cluster returns the same, most recent and a successful write. No matter which node they connect to. • Whenever data is written to one node, it must be instantly forwarded or replicated to all the other nodes in the system before the write is deemed ‘successful.’ Availability • Any client making a request for data gets a response, even if one or more nodes are down. • All working nodes in the distributed system return a valid response for any request Partition tolerance Refrences: • https://www.ibm.com/topics/cap-theorem • The system can continue operating even if the network connecting the nodes has a fault that results in two or more partitions. • Partition tolerance means that the cluster must continue to work despite any number of communication breakdowns between nodes in the system. CAP Overview III Refrences: CA • A system that prioritizes Consistency and Availability (CA) aims to provide strong consistency and high availability. • However, in this scenario, the system might need to sacrifice Partition Tolerance. When a network partition occurs, the system might become unavailable or operate in a limited capacity to ensure data consistency. CP • A system that emphasizes Consistency and Partition Tolerance (CP) focuses on maintaining strong consistency even in the presence of network partitions. • This approach might lead to reduced availability during partitioned scenarios, as some nodes might not be reachable (ie: banking operations). AP • A system that values Availability and Partition Tolerance (AP) aims to remain operational despite network partitions, prioritizing high availability. • In this case, data consistency might be compromised, as different nodes could have varying data states during partitioned periods (ie: new post in social media/ comments/ .. Etc). https://medium.com/@gurpreet.singh_89/understanding-the-cap-theorem-consistency-availability-and-partition-tolerance-e7faa5103638 CAP Overview IV Yalla ne3mil DB DB Overview I The primary job of any database management system is reliably storing data and making it available for users. We use databases as a primary source of data, helping us to share it between the different parts of our applications. DB Overview II Create, update, delete, and retrieve records. Database management systems are applications built on top of storage engines, offering a query language, indexing, transactions, and many other useful features. Every database system has strengths and weaknesses. you can invest some time before you decide on a specific database to build confidence in its ability to meet your application’s needs. Your choice of database system may have long-term consequences. If there’s a chance that a database is not a good fit because of performance problems, consistency issues, or operational challenges, it is better to find out about it earlier in the development cycle, since it can be nontrivial to migrate to a different system. Yalla ne3mil new DBMS 01 Data Model & Storage Decide on the data model your DBMS will support—relational, document-oriented, column store, key-value, graph, etc. Determine how data will be stored—on disk, in-memory, or a combination. Design the storage format and consider the trade-offs between different storage mechanisms (rowbased, column-based, .. etc). Yalla ne3mil new DBMS 02 Storage Engine Select the storage engine you plan to build on the top of it. Understand its APIs, data structures, and how it interacts with storage (disk/memory). DBMS can use the DB engine features (Replication, isolation, ACID .. etc) Integrate your higherlevel database engine logic with the underlying storage engine, ensuring seamless interaction and data retrieval/storage. Yalla ne3mil new DBMS 03 Data structure & Query processing Define how data will be represented, how queries will be processed, and what functionalities will be provided (e.g., indexing, transactions). Develop the logic and algorithms for managing data structures (e.g., B-trees, hash maps) and processing queries. This involves parsing queries, optimizing them, and executing them against the storage engine. Manage parsers, optimizers, and query executors to efficiently process and retrieve data based on user queries. Yalla ne3mil new DBMS 04 Concurrency Control and Transactions Implement mechanisms for concurrency control. Ensure that multiple users accessing the database concurrently maintain data consistency and integrity. Manage concurrent access to data and ensure transactional consistency (ACID properties) when multiple users interact with the database simultaneously. Ensure that multiple users accessing the same data simultaneously do not interfere with each other's transactions. If the underlying storage engine supports transactions, ensure proper handling and support in your database engine layer. Yalla ne3mil new DBMS 05 Buffer Pool / caching Allocating a pool of memory “Shared/Buffer Pool) Pages the read/fetched from disk are placed in the buffer pool. It's a dedicated area in memory where the database engine temporarily stores frequently accessed data pages from disk. A buffer pool is a critical component responsible for managing and caching data pages in memory. This storage in memory enables quicker access to frequently accessed data, reducing the need to repeatedly read from slower disk storage. Yalla ne3mil new DBMS 06 Error Handling and Logging Implement error handling mechanisms and logging functionalities to ensure proper handling of errors and maintaining a log for debugging and recovery purposes. Yalla ne3mil new DBMS 07 Security and Access Control Ensure robust security measures, including authentication, authorization, and encryption, to protect sensitive data stored in the database. Yalla ne3mil new DBMS 08 Scalability Design the system to scale horizontally or vertically as data volume increases. Consider optimizations for performance, such as caching, parallel processing, and query optimization. First step can be picking up appropriate storage engine(s) It is like an interface for those who want to implement (not an interface but just to make it simple) Definetely we won’t re-invent the wheel. We just try to let you think about it!! DB 360 view Major Categories • Some sources group DBMSs into three major categories: • Online transaction processing (OLTP) databases. • Online analytical processing (OLAP) databases • Hybrid transactional and analytical processing (HTAP) OLTP • These handle a large number of user-facing requests and transactions. Queries are often predefined and short-lived. • High-speed data processing and rapid transaction execution in real-time. • relational databases are OLTP databases. They organize data in tables consisting of rows and columns. Also, both MongoDB & Cassandra are OLTP. • e-commerce, online banking, bookings, inventory management, and more. OLAP • These handle complex aggregations. • OLAP databases are often used for analytics and data warehousing, and are capable of handling complex, long-running ad hoc queries. • Hadoop, MapReduce are good examples. HTAP • These databases combine properties of both OLTP and OLAP stores (It breaks the wall” between OLTP and OLAP). • Hybrid transaction/analytics processing, combines transactions, such as updating a database, with analytics, such as finding likely sales prospects. OLTP & OLAP • OLTP is said to be more of an online transactional system or data storage system, where the user does lots of online transactions using the data store. It is also said to have more ad-hoc reads/writes happening on real time basis. • In OLTP, there is less number of writes, e.g. Hotel Information. In such a scenario, there can be 1 write per second but reads could reach to hundreds and thousands. So the ratio here can be around 1:1000. • OLAP is more of an offline data store. It is accessed number of times in offline fashion. For example, Bulk log files are read and then written back to data files. Some of the common areas where OLAP is used are Log Jobs, Data mining Jobs, etc. • There are several writes happening simultaneously. In OLAP, we dump data in one shot i.e., all log files are put into data store and then we start processing. The data pattern or access pattern is exactly the opposite of OLTP kind of application. Here, the Hadoop or MapReduce will be useful. References: https://www.edureka.co/blog/oltp-vs-olap/#:~:text=Some%20of%20the%20common%20areas,for%20analytics%20and%20bulk%20writes. DB engines classifications 01 Sample Relational NonRelational Postgres. MongoDB. Oracle. MySQL. Cassandra. Caching Timeseries Redis. Prometheus. Memcached. InfluxDB. Couchbase. Azure Cosmos DB. Microsoft SQL server. MariaDB. Amazon Aurora. Amazon DynamoDB. Elasticsearch. Neo4j DB engines classifications 02 Sample key-value stores Relational databases Documentoriented stores Graph databases Oracle. Cassandra MySQL. MongoDB. Microsoft SQL server. Redis MariaDB. Amazon Aurora. Neo4j. Couchbase. Latency numbers References: • https://blog.bytebytego.com/p/ep22-latency-numbers-you-should-know Memory VS Disk In-memory database • Store data primarily in memory and management use the disk for recovery and systems logging. Disk-based DBMS • Hold most of the data on disk and use memory for caching disk contents or as a temporary storage. Both types of systems use the disk to a certain extent, but main memory databases store their contents almost exclusively in RAM. Also, memory is faster than accessing disk. Row VS Column stores/DBs • Tables can be partitioned either • Vertically (storing values belonging to the same column together), (a) shows the values partitioned column-wise. • Horizontally (storing values belonging to the same row together), (b) shows the values partitioned row-wise. Column Row • Instead of storing it in rows, values for the same column are stored contiguously on disk. • Store data in records or rows. Their layout is quite close to the tabular data representation. • Every row has the same set of fields. Row Column CRUD C • Create R • Read U • Update D • Delete CRUD & HTTP Methods C R U POST GET PUT PATCH D Delete DDL & DML Data Definition Language (DDL) • Create and modify the structure of database objects in the database. • Create, modify, and delete database structures but not data. Data Manipulation Language (DML) • Insert, update, Delete. SQL & NoSQL Reminder SQL Data is organized into tables, which are made up of rows and columns. Each row represents a record, while each column represents a data attribute. Handling structured data and complex relationships between tables using foreign keys. MySQL, PostgreSQL, and Oracle. NoSQL Offer a schema-less approach, , allowing for easy adaptation to changing data requirements. They can handle unstructured or semistructured data, making them ideal for big data applications and real-time data processing. MongoDB, Cassandra, and Redis. References: • https://medium.com/@venkatramankannantech/a-comprehensive-guide-to-database-internals-37c8d9ed2407 DBMS components DB Internals Part I Storage Engines ACID Transactions DBMS main parts Databases are modular systems and consist of multiple parts: A query processor An execution A transport determining the engine layer accepting most efficient carrying out the requests. way to run operations queries A storage engine storing, retrieving, and managing data in memory and on disk. References: • Database Internals book – Oreilly DBMS main parts (summarized) Transport layer. Query processor. Execution engine. Storage engine. References: • Database Internals book – Oreilly Overview Database management systems use a client/server model, where database system instances (nodes) take the role of servers, and application instances take the role of clients. Client requests arrive through the transport subsystem. Requests come in the form of queries, most often expressed in some query language. The transport subsystem is also responsible for communication with other nodes in the database cluster. Upon receipt, the transport subsystem hands the query over to a query processor, which parses, interprets, and validates it. The parsed query is passed to the query optimizer, which first eliminates impossible and redundant parts of the query, and then attempts to find the most efficient way to execute it based on internal statistics Overview The query is usually presented in the form of an execution plan (or query plan): a sequence of operations that have to be carried out for its results to be considered complete. Since the same query can be satisfied using different execution plans that can vary in efficiency, the optimizer picks the best available plan. The execution plan is carried out by the execution engine. DBMS sample workflow Upon receipt, the transport subsystem hands the query over to a query processor, which parses, interprets, and validates it. The parsed query is passed to the query optimizer, which first eliminates impossible and redundant parts of the query, and then attempts to find the most efficient way to execute it based on internal statistics (index cardinality, approximate intersection size, etc.) and data placement (which nodes in the cluster hold the data and the costs associated with its transfer). The query is usually presented in the form of an execution plan (or query plan): a sequence of operations that have to be carried out for its results to be considered complete. Since the same query can be satisfied using different execution plans that can vary in efficiency, the optimizer picks the best available plan. The optimizer handles operations required for query resolution, usually presented as a dependency tree, and optimizations, such as index ordering, cardinality estimation, and choosing access methods. The execution plan is carried out by the execution engine, which aggregates the results of local and remote operations. DB Storage Engine The storage engine has several components with dedicated responsibilities: Transaction Manager: This manager schedules transactions and ensures they cannot leave the database in a logically inconsistent state. Access Methods (Storage structures): These manage access and organizing data on disk. Access methods include heap files and storage structures such as B-Trees. Buffer Manager: This manager caches data pages in memory. Lock Manager: This manager locks on the database objects for the running transactions, ensuring that concurrent operations do not violate physical data integrity. Recovery Manager: This manager maintains the operation log and restoring the system state in case of a failure. Storage Engines • Storage Engine sometimes called • Embeded databases. • Software libraries that DBMSs use to do low-level storing of data on disk. • Storage engine is a crucial component responsible for managing how data is stored, retrieved, and manipulated within a database management system (DBMS). • It essentially handles the low-level details of how data is stored on disk or in memory References: • Database Internals book – Oreilly • Storage engines such as BerkeleyDB, LevelDB, RocksDB, LMDB, libmdbx, Sophia, HaloDB, InnoDB, MyISAM, Aria and many others were developed independently from the database management systems. • Using pluggable storage engines has enabled database developers to bootstrap database systems using existing storage engines and concentrate on the other subsystems. DBMS & Storage Engines – high level view DB engine Storage engine Notes MongoDB WiredTiger It provides concurrency control, compression, and support for both read and write-heavy workloads. Cassanda RocksDB Optimized for fast storage, it is a key value sorage designed for fast retreival. It can handle write-heavy workloads and scenarios requiring highspeed ingestion of data. InnoDB ACID compliance. Support transactions. Row level locking. A reliable choice for general-purpose use. It offers support for transactions, crash recovery, and foreign keys, making it suitable for OLTP (Online Transaction Processing) workloads. MySQL MariaDB Oracle BerkeleyDB Couchbase Couchbase Offers distributed, NoSQL document database designed for high performance read and writes operations. DBMS & MultipleStorage Engines Example Several database management systems (DBMS) offer support for multiple storage engines, allowing users to choose different underlying mechanisms for storing and managing data based on their specific needs. Couchbase MariaDB & MySQL • It allows users to choose between two storage • They support multiple storage engines. engines: ForestDB, Couchstore and Magma. MySQL historically offered various engines These engines have different characteristics such as InnoDB (transactional), MyISAM related to performance, scalability, and (non-transactional), Memory, etc. features, allowing users to optimize for • MariaDB continued this support and specific requirements. introduced additional engines like Aria and TokuDB. References: • https://mariadb.com/docs/server/storage-engines/ • https://mariadb.com/kb/en/tokudb/ • https://docs.couchbase.com/cloud/clusters/data-service/storage-engines.html • https://docs.couchbase.com/server/current/learn/buckets-memory-and-storage/storageengines.html#:~:text=Couchbase%20supports%20two%20different%20backend,best%20suited%20to%20your%20requirements. DBMS & Storage Engines – high level view DB engine Storage engine Notes Postgres #HomeWork Do a research Microsoft SQL Server #HomeWork Do a research MyISAM VS InnoDB Do a research about the difference between them? • Don't forget to highlight why the default of MySQL changed from MyISAM to InnoDB. Try both practically in Mysql • Use docker for quick MySQL spinning: • “docker run --name training-mysql -p 13306:3306 -e MYSQL_ROOT_PASSWORD=mysqlpwd mysql” MySQL assignment hint 01 MySQL assignment hint 02 MySQL assignment hint 03 MySQL assignment hint 04 DBs internals – Storage Engine – Data structure • Storage Engine: • Manages how data is stored, accessed, and manipulated internally. • It is reponsible for • Organizing data on disk. • Handling querie. • Ensuring data integrity. • Each storage engine has its own way of storing and rereiving data utilizing different algorithms & data structures. • Some use B-trees. • Others might use hash tables or other specialized data structure. DBs internals – Storage Engine indexing • • • • Storage engines manage indexes. Index is important for fast data retrieval. Indexes organize data in a way that allows the database to locate information efficiently. Index can be organized using B-Trees, hash indexes, .. Etc. DBs internals – Storage – concurrency control • Storage engines implement mechanisms to handle multiple users accessing the DB simultaneously. • They ensure data integrity by managing locks, transactions, and isolation level. • They provide support for transactions with ACID properties (Atomicity, Consistency, Isolation, Durability). Do a research about Optimistic VS Pessimistic concurrency control & include it into your presentation DBs internals – Storage – Caching & buffer • Storage engines often use caching mechanism to improve the performance. • They store frequently accessed data in memory buffers reducing the need to fetch data from disk repeatedly. DBs internals – Storage – Optimization • They optimize query execution by decising how to retrieve and manipulate data efficiently. • This includes query parsing, optimization, and execution plans DBs internals – Storage – File management • They handle • How data are stored in files on the disk. • Allocating space. • Managing reads and writes efficiently. Read this URL and see how they think https://www.couchbase.com/blog/magma-next-gen-document-storage-engine/ Ensure you understand the concept of “pluggable storage engine” • What is the benefit of it? • How to switch between them (if needed)? Do a research about real world example and which DB engine they use! Real world product Discord X (AKA Twitter) Meta (AKA Facebook) Instagram Spotify Netflix TikTok Ebay Walmart Airbnb Uber Used DB engines Why Benefits Drawbacks Instructors’ expectations While doing the previous slide’s assingment, • If you found yourself checking why some product switched from DB-X to DB-Y, then you are moving in he right direction. • • • • • https://discord.com/blog/how-discord-stores-billions-of-messages https://hackernoon.com/discord-went-from-mongodb-to-cassandra-then-scylladb-why Discord Journey from MongoDB https://www.uber.com/en-SA/blog/postgres-to-mysql-migration/ goDB Read the above link if you did not hit them before !!! Transactions It is a sequence of multiple operations performed on a database, and all served as a single logical unit of work — taking place wholly or not at all. Begin References: • https://fauna.com/blog/database-transaction There’s never a case that only half/part of the operations are performed and the results saved. Commit Collection of queries – one unit of work Rollback Example of a transaction in action Consider a banking app where a user wishes to transfer funds from one account to another, where the operation’s transaction might look like this: • • • • • BEGIN TRANSACTION – An example of transferring money from BankAccount-A to BankAccount-B Deduct the transfer amount from the source account (BankAccount-A). Add the transfer amount to the destination account (BankAccount-B). COMMIT – Updating the record of the transaction carried out by the customer. The transaction is rolled back, and the database is restored to its initial state if any of its operations fail, such as • If the something wrong happened when adding balance to BankAccount-B. DB went down as an example. ACID ACID is an acronym of the properties used to indicate that the database engine provides atomicity, cons istency, isolation, and durability. References: • Learn PostgreSQL Atomicity • means that a complex database operation is processed as a single instruction even when it is made up of different operations. Consistency • means that the data within the database is always kept consistent and that is it is not corrupted due to partially performed operations. Isolation • allows the database to handle concurrency in the "right way"—that is, without having corrupted data from interleaved changes. Durability • means that the database engine is supposed to protect the data it contains, even in the case of software and hardware failures, as much as it can. Atomicity • Atomicity in terms of a transaction means all or nothing. When a transaction is committed, the database either completes the transaction successfully or rolls it back so that the database returns to its original state. • For example, in an online ticketing application, a booking may consist of two separate actions that form a transaction — reserving the seat for the customer and paying for the seat. A transaction guarantees that when a booking is completed, both these actions, although independent, happen within the same transaction. If any of the actions fail, the entire transaction is rolled back, and the booking is freed up for another transaction attempting to take it. References: • https://fauna.com/blog/database-transaction Consistency • Ensures that transactions only make changes to tables in predefined, predictable ways. • The data must be consistent before & after the transaction. References: • https://fauna.com/blog/database-transaction Isolation • With multiple concurrent transactions running at the same time, each transaction should be kept independent without affecting other transactions executing simultaneously. • Transactions are instead run in parallel, and some form of database locking is utilized to ensure that the result of one transaction does not impact that of another. References: • https://fauna.com/blog/database-transaction Read phenomena Dirty read Non- repeatable read Phantom read • Happens when a transaction reads data written by other concurrent transaction that has not committed yet. • We don’t know if that another transaction will be committed or rolled back. • We might end up using incorrect data if rollback happened.. • When a transaction reads the same record twice and see different values. • This is because row has been modified by other transactions that was committed after the first read. • The same query reexecuted but a different set of rows returned. • This can be due some changes made by other recently committed transactions, such as inserting new rows or deleing existing rows which satisfy the search condition of current transaction queries. Serialization anomaly • #Homework Dirty Read Dirty read is the state of reading uncommitted data. We are not sure about the consistency of the data that is read. This is as we don’t know the result of the open transaction(s). After reading the uncommitted data, the open transaction can be completed either with rollback or committed successfully. References: • https://www.sqlshack.com/dirty-reads-and-the-readuncommitted-isolation-level/ Non-Repeatable read Happens when one transaction reads the same data twice while another transaction updates that data in between the first and second read of the first transaction. References: • https://dotnettutorials.net/lesson/non-repeatable-readconcurrency-problem/ Phantom read When one transaction executes a query twice and it gets a different number of rows in the result set. This generally happens when a second transaction inserts some new rows in between the first and second query execution of the first transaction that matches the WHERE clause of the query executed by the first transaction. References: • https://dotnettutorials.net/lesson/phantom-read-concurrencyproblem-sql-server/ • Read, understand, learn about “Serialization anamoly that we skipped” and other read phenomena. • Support your presentation with Examples. Isolation level Read uncommitted Read committed Repeatable read Serializable • Transactions in this level can see data written by other uncommitted transactions, thus allowing phenomenon to happen. • NO ISOLATION. • Transactions can only see data that has been committed by other transactions. • Because of this, dirty read is no longer possible. • Each query sees committed values. • More strict isolation level. • It ensures that the same select query will always return the same result (sees committed updates at the beginning of the transactions). • This even if some other concurrent transactions have committed new changes that satisfy the query. • The highest isolation level. • Concurrent transactions running in this level are guaranteed to be able to yield the same result as if they’re executed sequentially in some order, one after another without overlapping. References: • https://dev.to/techschoolguru/understand-isolation-levels-read-phenomena-in-mysql-postgres-c2e MySQL isolation level from our container Check transaction isolation Change transaction isolation Try different isolation level with different CRUD operations! MySQL will be a good option to try. Make two transactions and play normally following the below URL: https://dev.to/techschoolguru/understand-isolation-levels-read-phenomena-in-mysql-postgres-c2e Let us recap it well! Yes = May accur. References: • https://dotnettutorials.net/lesson/phantom-read-concurrency-problem-sql-server/ Durability • Durability means that a successful transaction commit will survive permanently. To accomplish this, an entry is added to the database transaction log for each successful transaction. • Changes made by committed transactions should be persisted in a durable storage. • The changes of successful transaction occur even if the system failure occurred. • Durability popular techniques: • WAL (Write Ahead Log). • Append Only File (AOF). • Asynchronous snapshot. References: • https://fauna.com/blog/database-transaction ACID Summarized References: • https://www.bmc.com/blogs/acid-atomic-consistent-isolated-durable/ ACID Summarized • ACID transactions ensure the highest possible data reliability and integrity. • They ensure that your data never falls into an inconsistent state because of an operation that only partially completes. • For example, without ACID transactions, if you were writing some data to a database table, but the power went out unexpectedly, it's possible that only some of your data would have been saved, while some of it would not. • Now your database is in an inconsistent state that is very difficult and timeconsuming to recover from. DB Internals Part II Pages DB Internals Part II B-Tree WAL DBs internals – Pages • Databases often use fixed-size pages to store data. Tables, collections, rows, columns, indexes, sequences, documents and more eventually end up as bytes in a page. • Databases read and write in pages. When you read a row from a table, the database finds the page where the row lives and identifies the file and offset where the page is located on disk. • The database then asks the OS to read from the file on the particular offset for the length of the page. • The OS checks its filesystem cache and if the required data isn’t there, the OS issues the read and pulls the page in memory for the database to consume. • The smaller the rows, the more rows fit in a single page. References: • https://medium.com/p/38cdb2c79eb5 Pages • MSSQL Server: • The disk space allocated to a data in a database is logically divided into pages numbered contiguously from 0 to n. Disk I/O operations are performed at the page level. • SQL Server reads or writes whole data pages. • All data pages are the same size: 8 KB. This is similar to Oracle as well (page size is 8 KB). • The index pages contain index references about where the data is. • There are system pages that store various metadata about the organization of the data. References: • • https://medium.com/p/38cdb2c79eb5 https://learn.microsoft.com/en-us/sql/relational-databases/pages-andextents-architecture-guide?view=sql-server-ver16 Pages • MSSQL Server: • Data rows are stored on the page serially. • Each row offset table contains one entry for each row on the page. References: • https://learn.microsoft.com/en-us/sql/relational-databases/pages-andextents-architecture-guide?view=sql-server-ver16 Pages with Insert, Update, Delete & Select • Update: • When a user updates a row, the database finds the page where the row lives, pull the page in the buffer pool and update the row in memory and make an entry of change (WAL) persisted to disk. • The page can remain in memory so it may receive more writes before it is finally flushed back to disk (minimizing the number of I/Os) • Insert: • When a user inserts a row, the database … #Homework • Delete: • When a user delete row(s), the database … #Homework • Select: #Homework • When a user select row(s), the database … • With index: … • Without index: … Pages – more details 01 • When a user updates a row, • the database finds the page where the row lives, • pull the page in the buffer pool and update the row in memory and make a journal entry of the change (often called WAL) persisted to disk. • The page can remain in memory so it may receive more writes before it is finally flushed back to disk, minimizing the number of I/Os. • Deletes and inserts work the same but implementation may vary. • Row-store databases write rows and all their attributes one after the other packed in the page so that OLTP workloads are better especially write workload. • Column-store databases write the rows in pages column by column such OLAP workloads that run a summary fewer fields are more efficient. • Document based databases compress documents and store them in page just like row stores and graph based databases persist the connectivity in pages such that page read is efficient for traversing graphs, this also can be tuned for depth vs breadth vs search. Pages – more details 02 • Whether you are storing rows, columns, documents or graphs, the goal is to pack your items in the page such that a page read is effective. The page should give you as much useful information as possible to help with client-side workload. If you find yourself reading many pages to do tiny little work, consider rethinking your data modeling. • Each database has a different implementation of how the page looks like and how it is physically stored on disk but at the end, the concept is the same. • Small pages are faster to read and write. However, the overhead cost of the page header metadata compared to useful data can be higher. • Lager sizes can minimize metadata overhead and page splits but at a cost of higher cold read and write. B-tree • One of the most popular storage structures is a B-Tree. • Many open source database systems are B-Tree based, and over the years they’ve proven to cover the majority of use cases. • B-trees are balanced trees, ensuring that the distance from the root node to any leaf node is roughly the same. This balance helps in maintaining consistent search, insert, and delete performance regardless of the size of the tree. WAL I Changes to data to be written to a log before the corresponding data is updated in the main storage. The idea is that before any modifications are made to the actual database or file system, a record of these changes is written to a log file. Once the log entry is successfully written to the disk, the corresponding changes can be applied to the main storage. It is an append-only mechanism and ensures data integrity. By writing the changes to the log first, the system ensures that Write-ahead logging is even if a crash occurs often used in or power is lost, the transactional systems to modifications are not maintain atomicity. lost. During recovery, Atomicity ensures that a the system can use the transaction is treated as log to bring the data a single back to a consistent state by replaying the logged changes. If any part of the transaction fails, the entire transaction is rolled back, and the database remains in a consistent state. References: • https://medium.com/@venkatramankannantech/a-comprehensive-guide-to-database-internals-37c8d9ed2407 WAL II Start of Transaction When a transaction begins, the system creates a new log entry to mark the start of the transaction. Modifications Commit Apply Changes As the transaction progresses and data is modified or updated, the changes are recorded in the log file. When the transaction is ready to be committed, a special log entry called “commit record” is written to the log, indicating that all changes associated with the transaction are now considered durable. After the commit record is successfully written to the log, the changes are applied to the main database or file system. This ensures that data remains consistent References: • https://medium.com/@venkatramankannantech/a-comprehensive-guide-to-database-internals-37c8d9ed2407 Summary We have taken: 1. DBMS 360 view (SQL, NoSQL, Row, Columnar, document, key-value, graph) 2. Internals including Storage engines, B-Tree, WAL, Pages, ACID. 3. CAP theorem understanding. Session Conclusion Assignment Reading • • • • • 1 Storage engines. B-Tree. LSM. WAL. Pages in DB engines Hands on • Follow what is in Homework slides. 2 Resources • Add useful resources to our Knowledge base. 3 4 Questions • Add your valid questions to the github issues. Thank You Remember, Do your best!! No Excuse ..