Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch 1 Why Big Data? • because bigger is smarter – answer tough questions • because we can – push the limits and good things will happen 2 bigger = smarter? • Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better 3 bigger = smarter? • Yes! – tolerate errors – discover the long tail and corner cases – machine learning works much better • But! – more data, more error (e.g., semantic heterogeneity) – with enough data you can prove anything – still need humans to ask right questions 4 Fundamental Problem of Big Data • There is no ground truth – gets more complicated with self-fulfilling prophecies • e.g., stock market predictions change behavior of people • e.g., Web search engines determine behavior of people 5 Fundamental Problem of Big Data • There is no ground truth – gets more complicated with self-fulfilling prophecies • Hard to debug: takes human out of the loop – Example: How to play lottery in Napoli • Step 1: You visit “oracles” who predict numbers to play • Step 2: You visit “interpreters” who explain predictions • Step 3: After you lost, “analysts” tell you that “oracles” and “interpreters” were right and that it was your fault. – [Luciano de Crescenzo: Thus Spake Bellavista] 6 Why Big Data? • because bigger is smarter – answer tough questions • because we can – push the limits and good things will happen 7 Because we can… Really? • Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel 8 Because we can… Really? • Yes! – all data is digitally born – storage capacity is increasing – counting is embarrassingly parallel • But, – data grows faster than energy on chip – value / cost tradeoff unknown – ownership of data unclear (aggregate vs. individual) • I believe that all these “but’s” can be addressed 9 Utiliy & Cost Functions of Data Utility Cost Noise / Error Noise / Error 10 Utiliy & Cost Functions of Data Utility Cost curated curated malicious random Noise / Error random malicious Noise / Error 11 Best Utility/Cost Tradeoff Utility Cost malicious malicious Noise / Error Noise / Error 12 What is good enough? Utility Cost curated Noise / Error curated Noise / Error 13 What about platforms? • Relational Databases – great for 20% of the data – not great for 80% of the data • Hadoop – great for nothing – good enough for (almost) everything (if tweaked) 14 Why is Hadoop so popular? • • • • • availability: open source and free proven technology: nothing new & simple works for all data and queries branding: the big guys use it it has the right abstractions – MR abstracts “counting” (= machine learning) • it is an eco-system - it is NOT a platform – HDFS, HBase, Hive, Pig, Zookeeper, SOLR, Mahout, … – relational database systems – turned into a platform depending on app / problem 15 Example: Amadeus Log Service • HDFS for compressed logs • HBase to index by timestamp and session id • SOLR for full text search • Hadoop (MR) for usage stats & disasters • Oracle to store meta-data (e.g., user information) • Disclaimer: under construction & evaluation!!! – current production system is proprietary 16 Some things Hadoop got wrong? • performance: huge start-up time & overheads • productivity: e.g., joins, configuration knobs • SLAs: no response time guarantees, no real time • Essentially ignored 40 years of DB research 17 Some things Hadoop got right • scales without (much) thinking • moves the computation to the data • fault tolerance, load balance, … 18 How to improve on Hadoop • Option 1: Push our knowledge into Hadoop? – implement joins, recursion, … • Option 2: Push Hadoop into RDBMS? – build a Hadoop-enabled database system • Option 3: Build new Hadoop components – real-time, etc. • Option 4: Patterns to compose components – log service, machine learning, … – but, do not build a “super-Hadoop” 19 Conclusion • Focus on “because we can…” part – help data scientists to make everything work • Stick to our guns – develop clever algorithms & data structures – develop modeling tools and languages – develop abstractions for data, errors, failures, … – develop “glue”; get the plumbing right • Package our results correctly – find the right abstractions (=> APIs of building blocks) 20