Distributed databases A brief introduction (Figure numbers may not be the same as in the book) Distributed databases 1 Distributed database concepts • Distributed database (DDB) – Collection of multiple logically interrelated databases distributed over a computer network • Distributed database management systems (DDBMS) – Software systems managing a distributed database, making distribution transparent to the users Distributed databases 2 Transparency • Hiding implementation details from the users of the database • Data organization transparency – Location transparency • Use does not depend on location – Naming transparency • Naming is independent from location • Replication transparency – Copies can be kept for availability, performance, and availability • User are unaware of the existence of these copies • Fragmentation transparency – One table is divided into more locations – Horizontal fragmentation • Table divided by rows – Vertical fragmentation • Table divided by columns Distributed databases 3 Example: Replication and horizontal fragmentation Distributed databases 4 Reliability and Availability • Two common advantages of distributed databases • Reliability – The probability that a system is running at a certain time point • Availability – The probability that a system is continuously available during a time interval Distributed databases 5 Advantages of distributed databases 1. Improved ease and flexibility of application development – Transparency: Developers do not have to know … 2. Increased reliability and availability – Faults are isolated to a single site 3. Improved performance – Data localization, means less network traffic – Parallelism 4. Easier expansion – Easy to add more data, processors, etc. Distributed databases 6 Types of distributed database systems • Degree of homogeneity – Homogeneous: All local DBMSs run identical software – Heterogeneous: Local DMBSs run different software • Autonomy – Local autonomy: Local site can function as a standalone DBMS – No autonomy: Local site can not function as a standalone DBMS Distributed databases 7 Classification of distributed databases Distributed databases 8 Database system architectures Distributed databases 9 General architecture Distributed databases 10 Component architecture of distributed databases Distributed databases 11 Data fragmentation • Which site should store which portion of the database? • Simple fragmentation – Each site has a whole relation • Horizontal fragmentation – Subset of rows in each site • Sometimes based on location • Vertical fragmentation – Subset of columns in each site • Primary key must be in all sites • Mixed / hybrid fragmentation – Horizontal + vertical fragmentation – Described by fragmentation schema Distributed databases 12 Example fragmentation Distributed databases 13 Example fragmentation, continued Distributed databases 14 Data replication • Replication to improve availability • Fully replicated database – All data is replicated to each site • Non replication – All data is stored at exactly one site • Partial replication – Some data is replicated to some sites – Described by replication schema Distributed databases 15 Distributed query processing 1. Query mapping – Query mapped from SQL to relational algebra using the global conceptual schema 2. Localization – Map query on the global schema to separate queries on the local schemas – Using fragmentation and replication information 3. Global query optimization – Cost = CPU time + I/O time + communication time 4. Local query optimization – Same as in centralized databases Distributed databases 16 Distributed transaction management, Two-phase commit protocol (2PC) • Global transaction manager / coordinator – Coordinates the results of local transaction managers. – All local transaction managers must be able to ”commit”, before actually doing the ”commit” • Two-Phase commit protocol (2PC) – Phase 1 • Individual databases tell the coordinator that they have finished transaction • All individual databases have finished: Coordinator sends ”prepare for commit” to all databases • Individual databases answer ”read to commit” or ”cannot commit” – Phase 2 • If all databases answered ”ready to commit”, coordinator sends ”commit” to all databases • If one (or more) databases answered ”cannot commit”, coordinator sends ”abort” to all databases. • Timeout: if one (or more) databases does not answer within a given amount of time, coordinator sends ”abort”. Distributed databases 17 Two-phase commit protocol (2PC) • Problems with 2PC – Coordinator crashes: All participating sites are waiting – No way of knowing whether participating sites really got the ”commit” / ”abort” Distributed databases 18 Three-phase commit (3PC) Distributed databases 19