Carey_Talk

advertisement
Data Management Research Forecast:
Hot With an Increasing
Chance of Clouds...
Michael Carey
Information Systems Group
CS Department
UC Irvine
Cloud-Related DB Bandwagons
• MapReduce and Hadoop
– Parallel programming for dummies
– But now Hive, Pig, Scope, Jaql, …
 MapReduce is the new runtime
• DFSs and HDFS (and CSS)
– Scalable, self-managed, Really Big Files
– But now BigTable, HBase, …
 HDFS (or CSS) is the new file storage
• Key-value stores
– Charter members of the “NoSQL movement”
– Includes S3, Dynamo, HBase, Cassandra, …
 Key-value stores are the new record managers
1
Perhaps Our Brains Are Clouded!
Thou Shalt
MapReduce!
Thou Shalt
Worship DFS
& BigTable
• Some of the possible symptoms include…
– A number of us are focused on tweaking Hadoop
– Most of us are compiling our declarative data analysis
languages into Hadoop jobs (map/reduce/combine)
– Some of us are building “cloud database systems” as a
layer on top of “givens” such as HDFS
– Some other possible signs of “cloud dementia” include…
• We keep forgetting where we put our schemas
• We don’t remember working on parallel DBs in the 80’s
• We’re about to learn some hard lessons all over again…
2
DB-ers: Let’s Do This Stuff “Right”!
• In my opinion:
– Yes, the OS/DS folks out-scaled us (oops!)
– But, we’d be “nuts” to simply build on their
current (i.e., first-generation) foundation
• Let’s bring our DB experience to the party…!
– Recall important past architectural lessons
• I’ll give some quick examples of this…
– Plus, it can be “windy” in the stratosphere
• I’ll give one quick example of this too…
3
Current Data Layers
Cloud DBMS?
Some data model?
Some query language?
HBase
Sets of records/columns
Key-based retrieval
Range partitioned
HDFS
Ordered byte streams
Append only
Randomly partitioned
4
…
…
5
6
7
Let’s Do This Stuff “Right”! (cont.)
• Identify the lessons, then do it “right”
– Open-source S/W on commodity H/W
– Non-monolithic software components
– Equal opportunity data access (external sources)
– Fault-tolerant query execution (for data analysis)
– Little pre-planning or DBA-type work required
– Tolerant of flexible / nested / absent schemas
– Types and declarative languages (duh…! )
8
What If We’d Meant To Do This?
• What is the “right” basis for analyzing
and managing the data of the future?
– Storage and data distribution layers?
– Query runtime layer (and division of labor)?
– ….
• Let’s explore how to build new information
management systems to live in the cloud that…
–
–
–
–
–
–
Scale to thousands of nodes (and beyond)
Make effective use of shared (and elastic) resources
Execute queries in the face of partial failures
Seamlessly support external data access
Don’t require five-star wizard administrators
….
9
In Short…
• Ask not what cloud software can do for you, but
what you can do for cloud software…!
– We are, after all, the data experts!
– We just have to remember our roots and bring our
experiences and past lessons to bear on cloud issues
• We’re asking this question at UCI (and UCSD/UCR):
– ASTERIX: A new data platform to ingest, store, manage,
index, query, analyze, and publish vast quantities of
semi-structured information
– Hyracks: A new foundation for scalable data analysis –
both a runtime system for ASTERIX and an extensible
parallel data analysis platform in its own right
– Open-source sharing planned for both (eventually)
10
The ASTERIX Project
• Semistructured data management
Semistructured
Data Management
– Core work exists
– XML & XQuery, JSON, …
– Time to parallelize and scale out
• Parallel database systems
– Research quiesced in mid-1990’s Parallel Database
Systems
– Renewed industrial interest
– Time to scale up and de-schema-tize
Data-Intensive
Computing
• Data-intensive computing
– MapReduce and Hadoop quite popular
– Language efforts even more popular (Pig, Hive, Jaql, …)
– Ripe for parallel DB ideas (e.g., for query processing)
and support for stored, indexed data sets
11
ASTERIX Project Overview
Data loads & feeds
from external sources
(XML, JSON, …)
AQL queries &
scripting requests
and programs
Data publishing
to external
sources and apps
Hi-Speed Interconnect
CPU(s)
CPU(s)
CPU(s)
Main
Memory
Main
Memory
Main
Memory
Disk
Disk
Disk
ADM
Data
ADM
Data
ADM
Data
ASTERIX Goal:
To ingest, digest,
persist, index,
manage, query,
analyze, and
publish massive
quantities of
semistructured
information…
(ADM =
ASTERIX
Data Model;
AQL =
ASTERIX
Query
Language)
12
ASTERIX User Model
• Data model (ADM) with
open and closed types
• Query language (AQL)
for nested and semistructured querying
for $user in dataset(‘User’)
where some $i in $user.interests
satisfies $i =“movies”
return {“name”: $user.name };
• Support for both stored
and external datasets
13
Hyracks Runtime
• Partitioned-parallel platform for data-intensive computing
• Job = dataflow DAG of operators and connectors
– Operators consume/produce partitions of data
– Connectors repartition/route data between operators
• Hyracks vs. the “competition”
– Based on time-tested parallel database principles
– vs. Hadoop: More flexible model, less “pessimistic”implementation
– vs. Dryad: Supports data as a first-class citizen
14
ASTERIX Research Issue Sampler
• Semistructured data modeling
– Open/closed types, type evolution, relationships, ….
– Efficient physical storage scheme(s)
• Scalable storage and indexing
– Self-managing scalable partitioned datasets
– Ditto for indexes (hash, range, spatial, fuzzy; combos)
• Large scale parallel query processing
–
–
–
–
Division of labor between compiler and runtime
Query plan decision-making timing and basis
Model-independent complex object algebra (AQUA)
Fuzzy matching as well as exact-match queries
• Multiuser workload management (scheduling)
– Uniformly cited: Facebook, Yahoo!, eBay, Teradata, ….
15
Closing Thoughts
• IMO, this is the most exciting time to be a data
management researcher in the last 20-30 years!
– Data coming out our ears (and other openings) thanks to
WWW, Twitter, FaceBook, ... (so-called “data exhaust”)
– Data models and query languages totally up for grabs
(witness the “NoSQL movement”)
– System layers and architectures also up for grabs ()
• Many interesting challenges to Cloud our thinking
–
–
–
–
Opportunities for DB, OS, and PL research to converge
Important to learn from the past as we plan for the future
Resource management at scale is one of the biggies
Consistency models / programming models are another
16
Download