Chapter 25, 6e - 24, 5e Distributed Databases

advertisement
Chapter 25, 6e - 24, 5e
Distributed Databases
CSE
4701
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
steve@engr.uconn.edu
http://www.engr.uconn.edu/~steve
(860) 486 - 4818


A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps25.1
Classical and Distributed Architectures

CSE
4701


Classic/Centralized DBMS Dominated the
Commercial Market from 1970s Forward
Problems of this Approach
 Difficult to Scale w.r.t. Performance Gains
 If DB Overloaded, replace with a Faster Computer
 this can Only Go So Far - Disk Bottlenecks
Distributed DBMS have Evolved to Address a
Number of Issues
 Improved Performance
 Putting Data “Near” Location where it is Needed
 Replication of Data for Fault Tolerance
 Vertical and Horizontal Partitioning of DB Tuples
Chaps25.2
Common Features of Centralized DBMS

CSE
4701


Data Independence
 High-Level Representation via Conceptual and
External Schemas
 Physical Representation (Internal Schema) Hidden
Program Independence
 Multiple Applications can Share Data
 Views/External Schema Support this Capability
Reduction of Program/Data Redundancy
 Single, Unique, Conceptual Schema
 Shared Database
 Almost No Data Redundancy
 Controlled Data Access Reduces Inconsistencies
 Programs Execute with Consistent Results
Chaps25.3
Common Features of Centralized DBMS

CSE
4701
Promote Sharing: Automatically Provided via CC
 No Longer Programmatic Issue
 Most DBMS Offer Locking for Key Shared Data
 Oracle Allows Locks on Data Item (Attributes)
 For Example, Controlling Access to Shared Identifier




Coherent and Central DB Administration
Semantic DB Integrity via the Automatic Enforcement
of Data Consistency via Integrity Constraints/Rules
Data Resiliency
 Physical Integrity of Data in the Presence of Faults
and Errors
 Supported by DB Recovery
Data Security: Control Access for Authorized Users
Against Sensitive Data
Chaps25.4
Shared Nothing Architecture

CSE
4701



In this Architecture, Each DBMS
Operates Autonomously
There is No Sharing
 Three Separate DBMSs on
Three Different Computers
Applications/Clients Must Know
About the External Schemas of
all Three DBMSs for
 Database Retrieval
 Client Processing
Complicates Client
 Different DBMS Platforms
(Oracle, Sybase, Informix, ..)
 Different Access Modes
(Query, Embedded, ODBC)
 Difficult for SWE to Code
Chaps25.5
Difficulty in Access – Manage Multiple APIs

CSE
4701
Each Platform has a Different API
 API1 , API3 , …. , APIn
 An App Programmer Must Utilize All three APIs which
could differ by PL – C++, C, Java, REST, etc.
 Any interactions Across 3 DBs – must be
programmatically handled without DB Capabilities
API1
API2
APIn
Chaps25.6
NW Architecture with Centralized DB

CSE
4701
High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide
 Clients at Any Site can Access Repository
 Data May be “Far” Away - Increased Access Time
 In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2
 Inefficient, no Replication w.r.t. Failure
Chaps25.7
Fully Distributed Architecture

CSE
4701



The Five Sites (Chicago, SF, LA, NY, Atlanta) each
have a “Portion” of the Database - its Distributed
Replication is Possible for Fault Tolerance
Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join)
Increased Transaction Processing Complexity
Chaps25.8
Distributed Database Concepts

CSE
4701


A transaction can be executed by multiple networked
computers in a unified manner.
A distributed database (DDB) processes a Unit of
execution (a transaction) in a distributed manner.
A distributed database (DDB) can be defined as
 Collection of multiple logically related database
distributed over a computer network
 Distributed database management system as a
software system that manages a distributed
database while making the distribution transparent
to the user.
Chaps25.9
Goals of DDBMS

CSE
4701



Support User Distribution Across Multiple Sites
 Remote Access by Users Regardless of Location
 Distribution and Replication of Database Content
Provide Location Transparency
 Users Manipulate their Own Data
 Non-Local Sites “Appear” Local to Any User
Provide Transaction Control Akin to Centralized Case
 Transaction Control Hides Distribution
 CC and Serializability - Must be Extended
Minimize Communications Cost
 Optimize Use of Network - a Critical Issue
 Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication
Chaps25.10
Goals of DDBMS

CSE
4701


Improve Response Time for DB Access
 Use a More Sophisticated Load Control for
Transaction Processing
 However, Synchronization Across Sites May
Introduce Additional Overhead
System Availability
 Site Independence in the Presence of Site Failure
 Subset of Database is Always Available
 Replication can Keep All Data Available, Even
When Multiple Sites Fail
Modularity
 Incremental Growth with the Addition of Sites
 Dedicate Sites to Specific Tasks
Chaps25.11
Advantages of DDBMS

CSE
4701
1.
There are Four Major Advantages
Transparency
 Distribution/NW Transparency
 User Doesn’t Know about NW Configuration (Location
Transparency)
 User can Find Object at any Site (Naming
Transparency)

Replication Transparency (see next PPT)
 User Doesn’t Know Location of Data
 Replicas are Transparently Accessible

Fragmentation Transparency
 Horizontal Fragmentation (Distribute by Row)
 Vertical Fragmentation (Distribute by Column)
Chaps25.12
Data Distribution and Replication
CSE
4701
Chaps25.13
Other Advantages of DDBMS
CSE
4701
2. Increased Reliability and Availability
 Reliability - System Always Running
 Availability - Data Always Present
 Achieved via Replication and Distribution
 Ability to Make Single Query for Entire DDBMS
3. Improved Performance
 Sites Able to Utilize Data that is Local for
Majority of Queries
4. Easier Expansion
 Improve Performance of Site by
 Upgrading Processor of Computer
 Adding Additional Disks
 Splitting a Site into Two or More Sites

Expansion over Time as Business Grows
Chaps25.14
Challenges of DDBMS

CSE
4701



Tracking Data - Meta Data More Complex
 Must Track Distribution (where is the Data)
 V & H Fragmentation (How is Data Split)
 Replication (Multiple Copies for Consistency)
Distributed Query Processing
 Optimization, Accessibility, etc., More Complex
 Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time
Distributed Transaction Processing
 TP Potentially Spans Multiple Sites
 Submit Query to Multiple Sites
 Collect and Collate Results
Distributed Concurrency Control Across Nodes
Chaps25.15
Challenges of DDBMS

CSE
4701



Replicated Data Management
 TP Must Choose the Replica to Access
 Updates Must Modify All Replica Copies
Distributed Database Recovery
 Recovery of Individual Sites
 Recovery Across DDBMS
Security
 Local and Remote Authorization
 During TP, be Able to Verify Remote Privileges
Distributed Directory Management
 Meta-Data on Database - Local and Remote
 Must maintain Replicas of this - Every Site Tracks
the Meta-Data for All Sites
Chaps25.16
A Complete Schema with Keys ...
CSE
4701
Keys Allow us to
Establish Links
Between Relations
Chaps25.17
…and Corresponding DB Tables
CSE
4701
which Represent Tuples/Instances of Each Relation
A
S
C
null
W
B
null
null
1
4
5
5
Chaps25.18
…with Remaining DB Tables
CSE
4701
Chaps25.19
What is Fragmentation?

CSE
4701

Fragmentation Divides a DB Across Multiple Sites
Two Types of Fragmentation
 Horizontal Fragmentation
 Given a Relation R with n Total Tuples, Spread Entire
Tuples Across Multiple Sites
 Each Site has a Subset of the n Tuples
 Essentially Fragmentation is a Selection

Vertical Fragmentation
 Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites
 Essentially Fragmentation is a Projection
 Not Generally Utilized in Practice

In Both Cases, Sites can Overlap for Replication
Chaps25.20
Horizontal Fragmentation

CSE
4701




A horizontal subset of a relation which contain those
of tuples which satisfy selection conditions.
Consider Employee relation with condition DNO = 5
All tuples satisfying this create a subset which will be
a horizontal fragment of Employee relation.
A selection condition may be composed of several
conditions connected by AND or OR.
Derived horizontal fragmentation:
 Partitioning of a primary relation to other
secondary relations which are related with Foreign
keys.
Chaps25.21
Horizontal Fragmentation

Site 2 Tracks All Information Related to Dept. 5
CSE
4701
Chaps25.22
Horizontal Fragmentation

CSE
4701

Site 3 Tracks All Information Related to Dept. 4
Note that an Employee Could be Listed in Both Cases,
if s/he Works on a Project for Both Departments
Chaps25.23
Refined Horizontal Fragmentation

CSE
4701


Further Fragment from Site
2 based on Dept. that
Employee Works in
Notice that G1 + G2 + G3 is
the Same as WORKS_ON5
there is no Overlap
Chaps25.24
Refined Horizontal Fragmentation

CSE
4701


Further Fragment from Site
3 based on Dept. that
Employee Works in
Notice that G4 + G5 + G6 is
the Same as WORKS_ON4
Note Some Fragments can
be Empty
Chaps25.25
Vertical Fragmentation

CSE
4701


Subset of a relation created via a subset of columns.
 A vertical fragment of a relation will contain
values of selected columns.
 There is no selection condition used in vertical
fragmentation.
 A strict vertical slice/partition
Consider the Employee relation.
 A vertical fragment of can be created by keeping
the values of Name, Bdate, Sex, and Address.
Since no condition for creating a vertical fragment
 Each fragment must include the primary key
attribute of the parent relation Employee.
 All vertical fragments of a relation are connected.
Chaps25.26
Vertical Fragmentation Example

CSE
4701

Partition the Employee Table as Below
Notice Each Vertical Fragment Needs Key Column
EmpDemo
EmpSupvrDept
Chaps25.27
Homogeneous DDBMS

CSE
4701
Homogeneous
 Identical Software (w.r.t. Database)
 One DB Product (e.g., Oracle, DB2, Sybase) is
Distributed and Available at All Sites
 Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc.
 Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or REST API …)
Chaps25.28
Non-Federated Heterogeneous DDBMS

CSE
4701
Non-Federated Heterogeneous
 Different Software (w.r.t. Database)
 Multiple DB Products (e.g., Oracle at One Site,
MySQL at another, Sybase, Informix, etc.)
 Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems)
 Varied Programmatic Access - SWEs Must Know
All Platforms/Client Software Complicated
 Very Close to Shared Nothing Architecture
Chaps25.29
Federated DDBMS

CSE
4701

Federated
 Multiple DBMS
Platforms Overlaid
with a Global
Schema View
 Single External
Schema Combines
Schemas from all
Sites
Multiple Data Models
 Relational in one
Component DBS
 Object-Oriented in
another DBS
 Hierarchical in a
3rd DBS
Chaps25.30
Federated DBMS Issues

CSE
4701



Differences in Data Models
 Reconcile Relational vs. Object-Oriented Models
 Each Different Model has Different Capabilities
 These Differences Must be Addressed in Order to
Present a Federated Schema
Differences in Constraints
 Referential Integrity Constraints in Different DBSs
 Different Constraints on “Similar” Data
 Federated Schema Must Deal with these Conflicts
Differences in Query Languages
 SQL-89, SQL-92, SQL2, SQL3
 Specific Types in Different DBMS (Oracle Blobs )
Differences in Key Processing & Timestamping
Chaps25.31
Heterogeneous Distributed Database Systems

CSE
4701

Federated: Each site may run different database system but the
data access is managed through a single conceptual schema.
 The degree of local autonomy is minimum.
 Each site must adhere to a centralized access policy
 There may be a global schema.
Multi-database: There is no one conceptual global schema
 For data access a schema is constructed dynamically as
needed by the application software.
Object Unix Relational
Unix
Oriented Site 5
Site 1
Hierarchical
Window
Communications
Site 4
network
Object
Oriented
Network
DBMS
Site 3
Linux
Site 2
Linux
Relational
Chaps25.32
Query Processing in Distributed Databases
Issues
CSE
4701

Cost of transferring data (files and results) over the network.
 This cost is usually high so some optimization is necessary.
 Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size =
106 bytes.
Fname
Minit
Lname
SSN
Bdate
Address
Sex
Salary
Superssn
Dno
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size =
3,500 bytes.
Dname
Dnumber
Mgrssn
Mgrstartdate
 Q: For each employee, retrieve employee name and department
name Where the employee works.
 Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Chaps25.33
Query Processing in Distributed Databases

CSE
4701
Result
 The result of this query will have 10,000 tuples,
assuming that every employee is related to a
department.
 Suppose each result tuple is 40 bytes long.
 The query is submitted at site 3 and the result is
sent to this site.
 Problem: Employee and Department relations are
not present at site 3.
Chaps25.34
Query Processing in Distributed Databases

CSE
4701
Strategies:
1. Transfer Employee and Department to site 3.
 Total transfer bytes = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
 Query result size = 40 * 10,000 = 400,000 bytes. Total
transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
 Total bytes transferred = 400,000 + 3500 = 403,500 bytes.

Optimization criteria: minimizing data transfer.

Preferred approach: strategy 3.
Chaps25.35
Query Processing in Distributed Databases

CSE
4701

Consider the query
 Q’: For each department, retrieve the department
name and the name of the department manager
Relational Algebra expression:
 Fname,Lname,Dname (Employee
Mgrssn = SSN
Department)
Chaps25.36
Query Processing in Distributed Databases

CSE
4701
Result of query has 100 tuples, assuming that every
department has a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
 Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 100 = 4000
bytes.
 Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
 Total transfer size = 4000 + 3500 = 7500 bytes.

Preferred strategy: Choose strategy 3.
Chaps25.37
Query Processing in Distributed Databases

CSE
4701
Now suppose the result site is 2. Possible strategies :
1. Transfer Employee relation to site 2, execute the
query and present the result to the user at site 2.
 Total transfer size = 1,000,000 bytes for both queries Q
and Q’.
2. Transfer Department relation to site 1, execute join
at site 1 and send the result back to site 2.
 Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.
Chaps25.38
DDBS Concurrency Control and Recovery

CSE
4701
Distributed Databases encounter a number of
concurrency control and recovery problems which are
not present in centralized databases, including:
 Dealing with multiple copies of data items
 How are they All Updated if Needed?

Failure of individual sites
 How are Queries Restarted or Rerouted?

Communication link failure
 Network Failure

Distributed commit
 How to Know All Updates Done at all Sites?

Distributed deadlock
 How to Detect and Recover?
Chaps25.39
The Next Big Challenge

CSE
4701

Interoperability
 Heterogeneous Distributed Databases
 Heterogeneous Distributed Systems
 Autonomous Applications
Scalability
 Rapid and Continuous Growth
 Amount of Data
 Variety of Data Types
 Dealing with personally identifiable information (PII)
and personal health information (PHI)
 Emergence of Fitness and Health Monitoring Apps
 Google Fit and Apple HealthKit
 New Apple ResearchKit for Medical Research
Chaps25.40
Interoperability: A Classic View
CSE
4701
Local
Schema
Simple Federation
Multiple Nested Federation
FDB Global
Schema
FDB Global
Schema 4
Federated
Integration
Federated
Integration
Local
Schema
Local
Schema
FDB 1
Local
Schema
Federation
FDB3
Federation
Chaps25.41
Java Client with Wrapper to Legacy Application
CSE
4701
Java Client
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
Interactions Between Java Client
and Legacy Appl. via C and RPC
C is the Medium of Info. Exchange
Java Client with C++/C Wrapper
NATIVE LAYER
Native Functions (C++)
RPC Client Stubs (C)
Legacy
Application
Network
Chaps25.42
COTS and Legacy Appls. to Java Clients
CSE
4701
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that
Map to COTS Appl
NATIVE LAYER
Native Functions that
Map to Legacy Appl
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
JAVA NETWORK WRAPPER
Mapping Classes
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers
Chaps25.43
Java Client to Legacy App via RDBS
CSE
4701
Transformed
Legacy Data
Java Client
Updated Data
Relational
Database
System(RDS)
Extract and
Generate Data
Transform and
Store Data
Legacy
Application
Chaps25.44
Database Interoperability in the Internet

CSE
4701

Technology
 Web/HTTP, JDBC/ODBC, CORBA (ORBs +
IIOP), XML, SOAP, REST API, WSDL
Architecture
Information Broker
•Mediator-Based Systems
•Agent-Based Systems
Chaps25.45
JDBC

CSE
4701

JDBC API Provides DB Access Protocols for Open,
Query, Close, etc.
Different Drivers for Different DB Platforms
JDBC API
Java
Application
Driver Manager
Driver
Oracle
Driver
Access
Driver
Driver
Sybase
Chaps25.46
Connecting a DB to the Web

CSE
4701
DBMS

CGI Script Invocation
or JDBC Invocation
Web Server
Internet

Web Server are
Stateless
DB Interactions Tend
to be Stateful
Invoking a CGI
Script on Each DB
Interaction is Very
Expensive, Mainly
Due to the Cost of
DB Open
Browser
Chaps25.47
Connecting More Efficiently

CSE
4701
DBMS
Helper
Processes
CGI Script
or JDBC
Invocation

Web Server
Internet

To Avoid Cost of
Opening Database, One
can Use Helper
Processes that Always
Keep Database Open
and Outlive Web
Connection
Newly Invoked CGI
Scripts Connect to a
Preexisting Helper
Process
System is Still Stateless
Browser
Chaps25.48
DB-Internet Architecture
CSE
4701
WWW Client
(Netscape)
WWW client
(Info. Explore)
WWW Client
(HotJava)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Gateway
DBWeb
Dispatcher
DBWeb Gateway
Chaps25.49
EJB Architecture
CSE
4701
Chaps25.50
Technology Push

CSE
4701


Computer/Communication Technology (Almost Free)
 Plenty of Affordable CPU, Memory, Disk,
Network Bandwidth
 Next Generation Internet: Gigabit Now
 Wireless: Ubiquitous, High Bandwidth
Information Growth
 Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors
 Disk Capacity on the Order of Peta-bytes
Small, Handy Devices to Access Information
The focus is to make information
available to users, in the right form, at
the right time, in the appropriate place.
Chaps25.51
Research Challenges

CSE
4701
Ubiquitous/Pervasive
Many computers and information
appliances everywhere,
networked together

Inherent Complexity:
 Coping with Latency (Sometimes
Unpredictable)
 Failure Detection and Recovery
(Partial Failure)
 Concurrency, Load Balancing,
Availability, Scale
 Service Partitioning
 Ordering of Distributed Events
“Accidental” Complexity:
 Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All
Local Heterogeneity in Spades.
 Autonomy: Change and Evolve
Autonomously
 Tool Deficiencies: Language Support
(Sockets,rpc), Debugging, Etc.
Chaps25.52
Infosphere
Problem: too many sources,too much information
CSE
4701
Internet:
Information Jungle
Infopipes
Clean, Reliable,
Timely Information,
Anywhere
Digital
Earth
Personalized
Filtering &
Info. Delivery
Sensors
Chaps25.53
Current State-of-Art – Has Mobile Changed This?
CSE
4701
Web
Server
Mainframe
Database
Server
Thin
Client
Chaps25.54
Infosphere Scenario – Where Does Mobile Fit?
CSE
4701
Infotaps &
Fat Clients
Sensors
Variety
of Servers
Many sources
Database
Server
Chaps25.55
Heterogeneity and Autonomy

CSE
4701
Heterogeneity:
 How Much can we Really Integrate?
 Syntactic Integration
 Different Formats and Models
 XML/JSON/RDF/OWL/SQL Query Languages

Semantic Interoperability
 Basic Research on Ontology, Etc.
 DoD Maps (Grid, True, and Magnetic North)

Autonomy
 No Central DBA on the Net
 Independent Evolution of Schema and Content
 Interoperation is Voluntary
 Interface Technology DCOM: Microsoft Standard
 CORBA, Etc...
Chaps25.56
Security and Data Quality

CSE
4701
Security
 System Security in the Broad Sense
 Attacks: Penetrations, Denial of Service
 System (and Information) Survivability
 Security Fault Tolerance
 Replication for Performance, Availability, and
Survivability

Data Quality
 Web Data Quality Problems




Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise
Chaps25.57
Legacy Data Challenge

CSE
4701

Legacy Applications and Data
 Definition: Important and Difficult to Replace
 Typically, Mainframe Mission Critical Code
 Most are OLTP and Database Applications
Evolution of Legacy Databases
 Client-server Architectures
 Wrappers
 Expensive and Gradual in Any Case
Chaps25.58
Potential Value Added/Jumping on Bandwagon

CSE
4701




Sophisticated Query Capability
 Combining SQL with Keyword Queries
Consistent Updates
 Atomic Transactions and Beyond
But Everything has to be in a Database!
 Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
 Interoperable Query Processing
 Extended Transaction Updates
Commodities DB Software
 A Little Help is Still Good If it is Cheap
 Internet Facilitates Software Distribution
 Databases as Middleware
Chaps25.59
Concluding Remarks

CSE
4701

Goals of Distributed DBS
 Support User Distribution Across Multiple Sites
 Provide Location Transparency
 Provide Transaction Control Akin to Centralized
Case
 Minimize Communications Cost
Advantages of Distributed DBS
 Transparency
 Increased Reliability and Availability
 Improved Performance
 Easier Expansion
Chaps25.60
Download