Big Data Building Blocks

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch 1 Why Big Data? • because bigger is smarter – answer tough questions • because we can – push the limits and good things will happen 2 bigger = smarter? • Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better 3 bigger = smarter? • Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better • But! – more data, more error (e.g., semantic heterogeneity) – with enough data you can prove anything – still need humans to ask right questions 4 Fundamental Problem of Big Data • There is no ground truth – gets more complicated with self-fulfilling prophecies • e.g., stock market predictions change behavior of people • e.g., Web search engines determine behavior of people 5 Fundamental Problem of Big Data • There is no ground truth – gets more complicated with self-fulfilling prophecies • Hard to debug: takes human out of the loop – Example: How to play lottery in Napoli • Step 1: You visit “oracles” who predict numbers to play • Step 2: You visit “interpreters” who explain predictions • Step 3: After you lost, “analysts” tell you that “oracles” and “interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista] 6 Why Big Data? • because bigger is smarter – answer tough questions • because we can – push the limits and good things will happen 7 Because we can… Really? • Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel 8 Because we can… Really? • Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel • But, – data grows faster than energy on chip – value / cost tradeoff unknown – ownership of data unclear (aggregate vs. individual) • I believe that all these “but’s” can be addressed 9 Utiliy & Cost Functions of Data Utility Cost Noise / Error Noise / Error 10 Utiliy & Cost Functions of Data Utility Cost curated curated malicious random Noise / Error random malicious Noise / Error 11 Best Utility/Cost Tradeoff Utility Cost malicious malicious Noise / Error Noise / Error 12 What is good enough? Utility Cost curated Noise / Error curated Noise / Error 13 What about platforms? • Relational Databases – great for 20% of the data – not great for 80% of the data • Hadoop – great for nothing – good enough for (almost) everything (if tweaked) 14 Why is Hadoop so popular? • • • • • availability: open source and free proven technology: nothing new & simple works for all data and queries branding: the big guys use it it has the right abstractions – MR abstracts “counting” (= machine learning) • it is an eco-system - it is NOT a platform – HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, … – relational database systems – turned into a platform depending on app / problem 15 Example: Amadeus Log Service • HDFS for compressed logs • HBase to index by timestamp and session id • SOLR for full text search • Hadoop (MR) for usage stats & disasters • Oracle to store meta-data (e.g., user information) • Disclaimer: under construction & evaluation!!! – current production system is proprietary 16 Some things Hadoop got wrong? • performance: huge start-up time & overheads • productivity: e.g., joins, configuration knobs • SLAs: no response time guarantees, no real time • Essentially ignored 40 years of DB research  17 Some things Hadoop got right • scales without (much) thinking • moves the computation to the data • fault tolerance, load balance, … 18 How to improve on Hadoop • Option 1: Push our knowledge into Hadoop? – implement joins, recursion, … • Option 2: Push Hadoop into RDBMS? – build a Hadoop-enabled database system • Option 3: Build new Hadoop components – real-time, etc. • Option 4: Patterns to compose components – log service, machine learning, … – but, do not build a “super-Hadoop” 19 Conclusion • Focus on “because we can…” part – help data scientists to make everything work • Stick to our guns – develop clever algorithms & data structures – develop modeling tools and languages – develop abstractions for data, errors, failures, … – develop “glue”; get the plumbing right • Package our results correctly – find the right abstractions (=> APIs of building blocks) 20

Big Data Building Blocks

Related documents

Products

Support

Big Data Building Blocks

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib