Big Data Building Blocks

advertisement
Big Data: Analytics Platforms
Donald Kossmann
Systems Group, ETH Zurich
http://systems.ethz.ch
1
Why Big Data?
• because bigger is smarter
– answer tough questions
• because we can
– push the limits and good things will happen
2
bigger = smarter?
• Yes!
– tolerate errors
– discover the long tail and corner cases
– machine learning works much better
3
bigger = smarter?
• Yes!
– tolerate errors
– discover the long tail and corner cases
– machine learning works much better
• But!
– more data, more error (e.g., semantic heterogeneity)
– with enough data you can prove anything
– still need humans to ask right questions
4
Fundamental Problem of Big Data
• There is no ground truth
– gets more complicated with self-fulfilling prophecies
• e.g., stock market predictions change behavior of people
• e.g., Web search engines determine behavior of people
5
Fundamental Problem of Big Data
• There is no ground truth
– gets more complicated with self-fulfilling prophecies
• Hard to debug: takes human out of the loop
– Example: How to play lottery in Napoli
• Step 1: You visit “oracles” who predict numbers to play
• Step 2: You visit “interpreters” who explain predictions
• Step 3: After you lost, “analysts” tell you that “oracles” and
“interpreters” were right and that it was your fault.
– [Luciano de Crescenzo: Thus Spake Bellavista]
6
Why Big Data?
• because bigger is smarter
– answer tough questions
• because we can
– push the limits and good things will happen
7
Because we can… Really?
• Yes!
– all data is digitally born
– storage capacity is increasing
– counting is embarrassingly parallel
8
Because we can… Really?
• Yes!
– all data is digitally born
– storage capacity is increasing
– counting is embarrassingly parallel
• But,
– data grows faster than energy on chip
– value / cost tradeoff unknown
– ownership of data unclear (aggregate vs. individual)
• I believe that all these “but’s” can be addressed
9
Utiliy & Cost Functions of Data
Utility
Cost
Noise / Error
Noise / Error
10
Utiliy & Cost Functions of Data
Utility
Cost
curated
curated
malicious
random
Noise / Error
random
malicious
Noise / Error
11
Best Utility/Cost Tradeoff
Utility
Cost
malicious
malicious
Noise / Error
Noise / Error
12
What is good enough?
Utility
Cost
curated
Noise / Error
curated
Noise / Error
13
What about platforms?
• Relational Databases
– great for 20% of the data
– not great for 80% of the data
• Hadoop
– great for nothing
– good enough for (almost) everything (if tweaked)
14
Why is Hadoop so popular?
•
•
•
•
•
availability: open source and free
proven technology: nothing new & simple
works for all data and queries
branding: the big guys use it
it has the right abstractions
– MR abstracts “counting” (= machine learning)
• it is an eco-system - it is NOT a platform
– HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, …
– relational database systems
– turned into a platform depending on app / problem 15
Example: Amadeus Log Service
• HDFS for compressed logs
• HBase to index by timestamp and session id
• SOLR for full text search
• Hadoop (MR) for usage stats & disasters
• Oracle to store meta-data (e.g., user information)
• Disclaimer: under construction & evaluation!!!
– current production system is proprietary
16
Some things Hadoop got wrong?
• performance: huge start-up time & overheads
• productivity: e.g., joins, configuration knobs
• SLAs: no response time guarantees, no real time
• Essentially ignored 40 years of DB research 
17
Some things Hadoop got right
• scales without (much) thinking
• moves the computation to the data
• fault tolerance, load balance, …
18
How to improve on Hadoop
• Option 1: Push our knowledge into Hadoop?
– implement joins, recursion, …
• Option 2: Push Hadoop into RDBMS?
– build a Hadoop-enabled database system
• Option 3: Build new Hadoop components
– real-time, etc.
• Option 4: Patterns to compose components
– log service, machine learning, …
– but, do not build a “super-Hadoop”
19
Conclusion
• Focus on “because we can…” part
– help data scientists to make everything work
• Stick to our guns
– develop clever algorithms & data structures
– develop modeling tools and languages
– develop abstractions for data, errors, failures, …
– develop “glue”; get the plumbing right
• Package our results correctly
– find the right abstractions (=> APIs of building blocks)
20
Download