Distributed Streams - University of Warwick

advertisement
CS346:
Advanced Databases
Alexandra I. Cristea
A.I.Cristea@warwick.ac.uk
Distributed Databases,
BASE, CAP & NoSQL
Outline
Chapter: “Distributed Databases” in Elmasri and Navathe (chapter
25 in 6th Edition)





What are distributed databases?
Architectural choices
ACID vs BASE
Consistency, Availability, Partition tolerance: CAP
NoSQL systems
Why?
 As data gets larger, must move to distributed data management
2
 Tech companies (Google,
Facebook etc.) rely on distributed data
CS346 Advanced Databases
Distributed Databases
 When data gets large and processing is slow, use distribution
A distributed database (DDB) managed by a distributed DBMS
– Goal: split the processing into smaller pieces and spread them
–
 DDB technology combines databases with OS/Networks
–
Manage concurrent access to replicated data
 DDB is quite different to e.g. the world-wide web
Similarities: many machines, distributed around the world
– Different: each website is (mostly) independent of others
 Facebook and YouTube are managed independently
– However: many large websites use DDB technology
 Facebook can be seen as a massive distributed database
–
3
CS346 Advanced Databases
Distributed Databases: Pros and Cons
 DDB can be (in principle) more available
–
If one machine fails, others can take over
 DDB can (in principle) be faster
–
Parallelize computation, combine results
 DDB is (in principle) easier to expand
–
Just add more machines/storage
 “In principle” isn’t always the case
DDB is more complicated to manage
– Performance/availability may worsen in unpredictable ways
–
4
CS346 Advanced Databases
Additional functionality of DDB
The DDB has additional or expanded roles to perform:
 Keeping track of data distribution: where’s my data?
 Distributed query processing: break up a query into pieces
 Distributed transaction management: data items are distributed
 Replicated data management: keep distribute copies of the data
 Distributed database recovery: manage machine failures
 Security: manage security of distributed data
 Distributed catalog management: keep the metadata
Saw some of these issues in Hadoop/MapReduce
5
CS346 Advanced Databases
Distributed Architectures
 Many possible levels of sharing:
Shared memory: multiple processors (cores) share disk, memory
– Shared disk: multiple cores share disk, but have separate memory
– Shared nothing: no common storage, communicate over network
–
 ‘Shared nothing’ is the model for large distributed systems
–
Hadoop follows a shared nothing architecture
 Shared nothing pros and cons:
Can be slower: network is slower than local disk (is it? fibre is fast)
– Easy to expand: add more machines to the network
– Allows fragmentation (sharding): breaking the database into pieces
–
6
CS346 Advanced Databases
Fragmentation and Replication
 How to split the data up among sites?
Horizontal fragmentation: subset of tuples on each machine
 E.g. break up the EMPLOYEE relation by Dno
– Vertical fragmentation: different columns on each machine
 Name, Bdate, Address on one, Ssn, Salary, Dno on another
– Mixed: break up by both horizontal and vertical
–
 How to replicate data around the system?
No replication: a unique copy exists
– Fully replicated: data is copied everywhere
– Partial replication: in between these two extremes
 E.g. HDFS, default number of replicas is 3
–
7
CS346 Advanced Databases
ACID vs BASE systems
 Recall the ACID properties of transactions
–
Atomicity, Consistency, Isolation, Durability
 Not every system requires this level of guarantee
–
Can trade-off guarantees for perfomance
 “BASE”: Basically Available, Soft-State, Eventually Consistent
(coined by Eric Brewer, founder of Inktomi, 2000)
A weaker set of requirements
– Drop consistency and isolation to improve availability, performance
– Suits distributed settings without much competition for resources
–
 ACID vs BASE is a spectrum of possible design points
–
8
“Real internet systems are a mixture of ACID and BASE subsystems”
CS346 Advanced Databases
CAP concepts
 Consistency: all processes/transactions see the same data
Equivalent to having a single, up-to-date copy of the data
– Not easy to provide, hence much effort on concurrency
–
 Availability: is the system up and responsive to requests?
All processes can find some version of the data they need
– Formally: does every request receive a response (allowing fails)
–
 Partition-tolerance: what happens when the network breaks?
Network partition: something breaks and the network divides
 E.g. a router fails/crashes: messages can’t traverse the router
– Does the system still operate even if messages are lost?
–
9
CS346 Advanced Databases
Points of Comparison
 Consistency: strong (ACID) or weak consistency (BASE)?
Weak: processes can see operations in different orders
– Weak: synchronization points bring processes into agreement
–
 Eventual consistency: system eventually reaches a consistent state
–
If no updates are made to an item, then reads will give same value
 Compared to ACID, the BASE approach is:
–
–
–
–
–
10
More focused on availability of resources
Tolerates approximate answers rather than exact
More aggressive (optimistic concurrency control)
Aims to be simpler, faster
Provides ‘best effort’ rather than guarantees
CS346 Advanced Databases
The CAP Conjecture / Theorem
 Brewer made a famous “CAP conjecture” in 2000
Consistency, Availability, Partition Tolerance: pick any two
– I.e. it is impractical to build a distributed system with all three
–
 Lynch and Gilbert “proved” a CAP theorem in 2002
–
For a specific set of distributed scenarios
 An example of a ‘pick two’ (from three) choice
For university: Good grades, enough sleep or a social life
– For products: fast, good or cheap
–
11
CS346 Advanced Databases
 Last time: distributed databases, ACID vs. BASE, CAP
 Next time: CAP continuation; NoSQL
12
CS346 Advanced Databases
Partition Tolerance
Consequences of CAP Theorem
Consistency
Availability
Obtain different results from different choices:
 Forfeit partition tolerance (obtain consistency and availability)
–
E.g. traditional centralized DBMS
 Forfeit availability (obtain partition tolerance and consistency)
–
E.g. distributed databases, protocols based on majority agreement
 Forfeit consistency (obtain partition tolerance and availability)
–
E.g. Emerging NoSQL systems
 These concepts cut across many aspects of computer science:
The OS and network provide availability, but no consistency
– Databases are better at consistency than availability
– Distributed databases want both
–
13
CS346 Advanced Databases
NoSQL systems
 NoSQL systems drop support for the full relational model
Do not provide same level of reliability/availability
– Do not necessarily support rich languages like SQL
– Aim to have simpler design, better scaling via distribution
 Often support analysis via query language or MapReduce on top
– Systems primarily support data storage and retrieval
–
14
CS910 Foundations of Data Analytics
Types of NoSQL systems
 Key-value store: stores and retrieves data in the form (key, value)
E.g. store demographic data (values) for each user (by key)
– Data is distributed, and replicated for resilience, e.g. Memcached
–
 Column store: stores data organized by column (instead of row)
Allows faster access to particular entries when data is sparse
– Implemented in HBase (database component of Hadoop system)
–
 Document store: to store and retrieve document data
E.g. to store information for very large websites (Amazon, eBay)
– Each “document” can be an arbitrary collection of information
– Examples include MongoDB and Apache Cassandra
–
15
CS910 Foundations of Data Analytics
NoSQL systems: pros and cons
 NoSQL systems are highly popular at the moment
Scale to truly massive amounts of data
– Allow analytics on top via MapReduce/Hadoop
– Can be very fast to retrieve data
–
 But they also have limitations
Systems still under development, hard to make use of
– Some quite primitive: just provide data storage/retrieval
– Currently have to write and debug code to implement applications
– Can be overkill when your data is not massive
–
16
CS910 Foundations of Data Analytics
Summary
 Motivations for Distributed Databases
 Architectural choices for distributed databases
 What is shared? How much replication?
 ACID/BASE (Basically Available, Soft-State, Eventually Consistent)
 Consistency, Availability, Partition tolerance: CAP
 Pick any two
 NoSQL systems: key-value, column, document store
Recommended reading: Brewer’s PODC’00 Keynote
Chapter: “Distributed Databases” in Elmasri and Navathe
17
CS346 Advanced Databases
Download