Database paradigms for large-scale data processing Stratis D. Viglas School of Informatics

advertisement
Database paradigms for large-scale data
processing
Stratis D. Viglas
School of Informatics
University of Edinburgh
15 March 2010
Stratis D. Viglas
slide 1 of 16
www.inf.ed.ac.uk
What is a database?
Source
Top 15 web definitions (by Google); words with frequency > 1, excluding “database”
Stratis D. Viglas
slide 2 of 16
www.inf.ed.ac.uk
The (R)DBMS view of things
• Data is organised in tables of rows
• Each table has a schema
• All rows of the table conform to the schema
• Queries expressed in SQL
• Declarative query language, procedural processing
• Query optimisation
• Complete control of storage
• What is stored where and how
• How it is accessed and in what mode
• Various abstractions to help with design and implementation
• Operators, iterators (query engine)
• Transactions, locks (persistence, concurrency)
• Mature technology, 30+ years in the making
Seemed to be enough for a while. . .
Stratis D. Viglas
slide 3 of 16
www.inf.ed.ac.uk
Parallel databases
• Seminal work in the area
• The Grace database machine (partition-based algorithms and
hash joins)
• The Gamma parallel DBMS (parallelisation of operator data flow)
• Both effectively assumed a shared-nothing architecture
• Conceptually: each node in the system with its own memory
and disk
• (Gamma) Two key operations to parallelise the flow
• Split to direct records to nodes
• Merge to combine records across nodes
Stratis D. Viglas
slide 4 of 16
www.inf.ed.ac.uk
The parallelism success story
• DBMSs are one of the few successful applications of
parallelism
• Teradata, Tandem vs. Thinking Machines, KSR,. . .
• Every major DBMS vendor has some parallel server
• Workstation manufacturers now depend on parallel DB server
sales
• Reasons for success
•
•
•
•
Stratis D. Viglas
Bulk-processing (partition parallelism)
Natural pipelining
Inexpensive hardware can do the trick
Users/app-programmers do not need to think in parallel
slide 5 of 16
www.inf.ed.ac.uk
The “Holy Grail” of parallel DB performance
throughput
Ideal
Speed-up
More resources means
proportionally less time for
given amount of data
level of parallelism
response
Scale-up
Ideal
If resources increase in
proportion to increase in data
size, time is constant
level of parallelism
Both were largely achieved by Gamma
Stratis D. Viglas
slide 6 of 16
www.inf.ed.ac.uk
Automatic data partitioning
Range
• Good for
equi-joins
Hash
Round-robin
• Good for
equi-joins
• Range-queries
• No range-queries
• Good for
• Problematic with
aggregation
Stratis D. Viglas
skew
slide 7 of 16
• Indifferent for
equi-joins
• Range-queries
complicated
• Load-balanced
www.inf.ed.ac.uk
Dataflow parallelisation: split and merge
C
C
Join
Join
Join
Join
Scan
A
Stratis D. Viglas
Scan
Scan
Scan
A1
A2
A3
Scan
B
slide 8 of 16
Scan
Scan
B1
B2
www.inf.ed.ac.uk
Data warehouses
• First problem of (R)DBMSs: large-scale analytics
• What if you don’t want to filter?
• Worse yet, what if you don’t necessarily know what you’re
looking for?
• Type of processing for which SQL was not (necessarily)
designed for
• Complex interactions between data
• Data mining, association rules
• Normalisation not a good thing
• Different way of organising data, different query operations
Stratis D. Viglas
slide 9 of 16
www.inf.ed.ac.uk
Operational
systems
F
A
Transactional
DB
C
T
S
Transactional
DB
External
data
Star schema
Stratis D. Viglas
Transformation
DBMS tools
Transactional
DBs
Dimensions
slide 10 of 16
Data
mart
Data
warehouse
Data
mart
Reporting/Analysis/
Mining
Typical organisation
Analytics
apps
www.inf.ed.ac.uk
Storage models
• Horizontal vs. vertical partitioning
• Fixed relational schema does not mean data should be stored
in rows
• Or, a table per file
• Column-oriented storage
•
•
•
•
Stratis D. Viglas
Turn storage approach on its head
Column per file
Huge I/O savings in highly projective queries
Efficient main-memory evaluation algorithms
slide 11 of 16
www.inf.ed.ac.uk
Horizontal and vertical partitioning
Disk pages
File
Relational table
...
...
Horizontal
partitioning
Vertical
partitioning
Stratis D. Viglas
...
...
slide 12 of 16
...
...
...
www.inf.ed.ac.uk
SQL
vs. NoSQL
• What if some of the fundamental database assumptions are
dropped?
• Lack of central control
• No schema – or a rudimentary one (e.g., key-value stores, RDF
triples)
• Only eventual data consistency
• Who needs ACID anyway if we operate in batch?
• Functional over declarative
• Generality through simplicity over expressiveness through
optimisation
Stratis D. Viglas
slide 13 of 16
www.inf.ed.ac.uk
SQL
vs. NoSQL
• What if some of the fundamental database assumptions are
dropped?
• Lack of central control
• No schema – or a rudimentary one (e.g., key-value stores, RDF
triples)
• Only eventual data consistency
• Who needs ACID anyway if we operate in batch?
• Functional over declarative
• Generality through simplicity over expressiveness through
optimisation
Disclaimer
I’m quite skeptical of why we need something new to address these
isssues – but I’m also quite biased
Stratis D. Viglas
slide 13 of 16
www.inf.ed.ac.uk
Map/Reduce
• Input is a distributed big table of key-value records
• list(hk , v i)
• Two basic functions
• Map: apply a mapping function on the key of each record
mapping it to some other key, group the results according to
key values
• map(list(hk , v i) → list(hk1 , list(v1 )i)
• Reduce: for each group, reduce its elements to a single value
• reduce(list(hk1 , list(v1 )i) → list(v2 )
• Massive parallelisation: each map and reduce application in
parallel
Stratis D. Viglas
slide 14 of 16
www.inf.ed.ac.uk
Map/Reduce dataflow
Input
Split
Map 1
Int 1
Reduce 1
Map 2
Int 2
Reduce 2
Map 3
Int 3
Map N
Int N
Merge
Sort
Redist
ribute
Reduce 3
Output
Reduce
N
Looks awfully familiar, doesn’t it?
Stratis D. Viglas
slide 15 of 16
www.inf.ed.ac.uk
The need for scale
• SQL DBs for vertical scaling
• NoSQL DBs for horizontal scaling
• SQL for filtering
• Map/Reduce for large-scale analytics
Stratis D. Viglas
slide 16 of 16
www.inf.ed.ac.uk
The need for scale
• SQL DBs for vertical scaling
• NoSQL DBs for horizontal scaling
• SQL for filtering
• Map/Reduce for large-scale analytics
Just one question
Since the goals are common and the dataflow paradigms similar,
why can’t we all just get along?
Stratis D. Viglas
slide 16 of 16
www.inf.ed.ac.uk
Download