Information Resources Management April 17, 2001 Agenda Administrivia Database Architectures Administrivia Homework #8 Database Architectures Centralized Client-Server Parallel - single site Distributed - multiple sites Database Architectures Centralized Client-Server Distributed (Parallel) Function Data Centralized PC, Mini, or Mainframe Single Database Single Database Manager One or More Users Data and Function in One Place Client-Server PCs to Mainframes to Minis PC to PC Mainframe to Mainframe Use Desktop Processing Power Better User Interface Greater Functionality Retain Centralized Control of Data Client-Server: Basic Model Client Client Request Result Server Client Client Client Servers Supercomputer Mainframe Mini PC Server All retain all data Client-Server Architecture Data Function Thin Fat Client Server (Back-End) Client Client (Front-End) Functionality Presentation I/O Processing Validation Business Rules Application Logic Data Management Validation Error Handling “Thin” Client Presentation Services Only Accept Input Format Output Display Server does all processing “Fat” Client Presentation Validation Application Logic - Programs Data Management Send SQL to Server Server is just DBMS “In Between” Client Client Presentation Some Application Logic Server Some Applicaton Logic Data Management and Services Benefits of Client-Server Use Local Processing Power Better User Interface Some Functionality if System Down Use Sunk Costs of PCs Support Reengineering Support Intranets Flexibility, Scalability, Customizeability Challenges of Client-Server Cost of (Upgraded) PCs Network Reliance Distributing Application Updates Management of Complex System Problem Identification & Resolution Application Partitioning Other Client-Server Architectures Traditional is Two-Tiered (client-server) Three-Tiered Client-Application Server-DB Server (PC - Mini - Mainframe) (PC - PC Server - Mainframe) Beyond Three PC - PC Server - Web Server - Mini - Mainframe Client-Server vs. Distributed Client-Server: Application Distribution Distributed: Data Distribution Often, “client-server” is used to refer to either application distribution or data distribution or both. Middleware What if Multiple databases (sources) need to be accessed from a single client? Different kinds of clients? Mix of clients and servers? Want to take advantage of existing base of applications (legacy systems)? Middleware Fat Clients just send SQL transactions Other types of transactions may be needed based on the server (system) Middleware Software that shields applications from the complexity of the operating environment. Client Client Client Middleware System System (Legacy) (Legacy) Types of Middleware Transaction Process (TP) Monitor Database Middleware Remote Procedure Call (RPC) Message-Oriented Middleware (MOM) Object-Request Brokers (CORBA - ORB) TP Monitor Synchronous - sender must wait Queuing Message Delivery Insured Delivery Either Direction Database Middleware Variety of Clients/Platforms Variety of Servers/DBMSs/Platforms Specific to DB transactions (SQL) Message-Oriented Middleware (MOM) Asynchronous - clients do not wait Queues & Queue Management/Recovery Message Delivery Insured Delivery Either Direction (like email or EDI only transactions) Advantages of Middleware Leverage sunk costs (legacy systems) Reduce development cost Reduce development time Increase responsiveness Improve overall systems management Consolidate diffuse information Challenges of Middleware Cost Session management - Transaction state Security Network reliance Diversity of systems - lack of standards Constant technology change Availability of talent Middleware Management Parallel and Distributed Client-Server is an attempt to improve performance Reduce time to execute a transaction Parallel Reduce time to get the data Distributed Parallel Systems Single site for data Very Large databases Operations performed simultaneously Parallel Database Architecures Shared Memory Shared Disk Shared Nothing Hierarchical Shared Memory P P P M Shared Memory Advantages Extremely efficient communications Disadvantages Max of 32/64 processors Bus becomes bottleneck Shared Disk M P M P M P Shared Disk Advantages No bus bottleneck Fault tolerance provided Disadvantages Disk access becomes bottleneck Shared Nothing M P P M P M Shared Nothing Advantages No disk bottleneck Highly scaleable Disadvantages High communication overhead/cost Between processors To another processor’s data Hierarchical P M P M P P P M Hierarchical Advantages Best of all worlds Disadvantages Worst of all worlds Some high communcation overhead/cost Between subsystems Complexity Distributed Databases Client-Server - distribute functionality What about distributing data? Distributed Databases Overview Distributed Storage Distributed Queries Distributed Transactions Multidatabase (Middleware) Distributed Databases Multiple locations Single logical database Several physical databases Network connections Advantages Sharing across locations Local control Availability Challenges Development costs People & Equipment Testing Problem identification & resolution Technical expertise Network dependence Increased processing overhead Distributed Data Storage Replication Fragmentation Both Replication Data is repeated Spectrum of options available Temporary replication of specific rows Replicate infrequently changed data Replicate by site Central site - all / each local site their data only Full replication Everything everywhere Concerns with Replication Availability needed Amount of parallelism in reads Overhead of updates Keeping replicas updated Conflicting updates Fragmentation Partitioning Divide data into subsets based on need Have to be able to pull back together to get original tables Fragmentation Horizontal by rows specified conditions Vertical by column each requires primary key (or created key) Mixed by row and column Fragmentation & Replication Repeat as necessary: Replicate fragments Fragment replicas Don’t lose track of what you have and where it is! Network Transparency Distributing data should not require that the user know where or how it’s been distributed. The database should be seen as a single entity no matter how fragmented and replicated it becomes. Network Transparency Some DBMSs are starting to provide this level of functionality so transparency exists even at the program level, but in many cases this “transparency” must be programmed into the applications. It must always be designed into the database. Distributed Queries How do you query data that is everywhere? Effeciency vs. Overhead Splitting the query apart Keeping track of the data/locations Making sure everything gets executed Putting the results back together Generating network traffic Handling partial results Distributed Queries Full replication can avoid the overhead Huge increase in update overhead Parallel execution no longer possible Additional costs of replication Example 5 sites - NY, Pgh, Chicago, Dallas, Los Angeles Data fragmented by site - no replication Query (in Pgh): SELECT Name, Max (Salary) from Employee Option 1 - High Bandwidth 1. Have all sites send their full employee tables to Pgh. 2. Build a temporary employee table. 3. Run the query against this table. Option 2 Not so High Bandwidth 1. Examine the query and determine it can be run separately at each location and the results combined. 2. Submit just the query to each location. 3. Wait for the results from each city. 4. As results return, build a temporary table (5 rows only). 5. Find the max using the temporary table. Distributed Transactions Transaction Types Coordinators Commit Protocols Concurrency Controls Deadlocks Transaction Types Local - transaction only needs local data Global - transaction uses non-local data My global becomes someone else’s local Either type of transaction must still have ACID properties - global is the concern System Structure Things to do: 1. Process local transactions (transaction manager) 2. Process and track global transactions (transaction coordinator) Global Processing 1. Recognize as global 2. Break up transaction 3. Distribute pieces 4. Assemble results 5. Coordinate termination 6. Handle problems Coordinator of Coordinators Coordinate among sites Detect problems Attempt to fix Share status with others Coordinator Failure Backup Coordinator receives all messages - maintains state monitors coordinator automatically takes over if coordinator down avoids delays - increases overhead Election highest pre-assigned number Commit Protocols Two-Phase Three-Phase All sites must commit or all sites have to rollback Replicated data only Two-Phase Commit Phase 1 Send PREPARE to all sites Sites respond READY or ABORT Phase 2 If all sites READY, COMMIT locally - Send COMMITs If not READY or time expires ROLLBACK locally - Send ROLLBACK Two-Phase Commit Coordinator Site Site Site requests commit Site Two-Phase Commit Phase 1 Coordinator Site Site Site Send PREPARE - all sites Two-Phase Commit Phase 1 Coordinator Site Site Sites respond READY Site Two-Phase Commit Phase 2 Coordinator Site Site COMMIT locally Site Two-Phase Commit Phase 2 Coordinator Site Site Send COMMIT - all sites Site Two-Phase Commit Phase 1 Coordinator Site Site Site responds ABORT or does not respond Site Two-Phase Commit Phase 2 Coordinator Site Site ROLLBACK locally Site Two-Phase Commit Phase 2 Coordinator Site Site Site Send ROLLBACK - all sites Site Failure - Recovery COMMIT and ROLLBACK as normal If READY only Check with coordinator or other sites Either COMMIT or ROLLBACK If no one found, ROLLBACK Coordinator Failure Ask the sites If one has COMMIT, then REDO If one has ROLLBACK, then UNDO If one doesn’t have READY, UNDO If all READY only Coordinator must decide Sites must wait and locks are held “Blocking” occurs Three-Phase Commit Phase 1 Sent PREPARE Sites respond READY or ABORT Phase 2 If all sites READY, send PRECOMMIT Else, ROLLBACK Sites must ACKNOWLEDGE Phase 3 If at least K sites ACKNOWLEDGE, send COMMIT Coordinator Failure Three-Phase Commit prevents blocking If coordinator fails New coordinator is selected Sites queried to determine status New coordinator resumes Network Partitioning Network split creates two separate networks Each “half” selects a coordinator Coordinators make independent decisions Result could be different decisions Resolution of network problem may create need to resolve database problems Concurrency Control Single Lock Manager Multiple Lock Managers Single Lock Manager One site for all locking All other sites must go to it Can read from anywhere Updates must be to all copies Advantages: Simple, Easy deadlock detection Disadvantages: Bottleneck, Vulnerability Simple Multiple Lock Mgrs Each site locks a unique partition of the data non-replicated data Advantages: Fairly simple, reduced bottlenecks Disadvantages: Complicated deadlock detection Majority Protocol Each site locks its own data replication possible Request owner for lock on data that isn’t local When multiple owners, n/2 + 1 (majority) must provide the lock Advantages: No bottlenecks Disadvantages: More messages sent, Complicated deadlock detection, More deadlocks (each gets 1/2) Biased Protocol Reduced form of Majority Protocol For a READ, only need any single lock For a WRITE, need all locks Advantages: No bottle necks, Reduced traffic Disadvantages: Update traffic, Deadlocks Primary Copy Site designated to hold “primary” copy Multiple sites Replicated Data All locks through that site Advantages: Fairly simple, reduced bottlenecks Disadvantages: Vulnerability, Complicated deadlock detection Other Than Locking Timestamps Centralized generation Local generation Timestamp tests determine ability to read or write Deadlocks & Distributed Data Centralized One Site Distributed Centralized - same advantages and disadvantages as other centralized control (database or locking) Distributed Deadlock Detection Each site tracks all transactions accessing its own data Dummy transaction for transactions that originated here but are executing elsewhere If deadlock found that includes dummy transaction Must send deadlock information to other sites They check for deadlock May have to pass on to another site Homework #9 Continuuing with the Carnegie Library Client/Server Distrributed Database