Big Data Processing, 2014/15 Lecture 1: Introduction! ! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course organisation • 2 lectures a week (Tuesday, Thursday) for 7 weeks • Content: • Data streaming algorithms • MapReduce (bulk of the course) • Approaches to iterative big-data problems • One “free” lecture in January (suggestions?) • e.g. Big data visualisations, Industry talk, etc. • Course style: interactive when possible 2 Sm all Course content • Introduction! • Data streams 1 & 2 • The MapReduce paradigm! • Looking behind the scenes of MapReduce: HDFS & Scheduling! • Algorithm design for MapReduce! • A high-level language for MapReduce: Pig 1 & 2! • MapReduce is not a database, but HBase nearly is • Lets iterate a bit: Graph algorithms & Giraph! • How does all of this work together? ZooKeeper/Yarn 3 ch an ge s ar ep os s ibl e. Assignments • A mix of pen-and-paper and programming • Individual work • 5 lab sessions on Hadoop & Co between Nov. 24 and Jan. 5 on Mondays 8.45am - 12.45pm • Lab attendance is not compulsory ! • Two lab assistants will help you with your lab work (Kilian Grashoff and Robert Carosi) • Assignments are handed out on Thursdays, and due the following Thursday (6 in total); 4 Grading & exams • Course grade: 25% assignments, 75% final exam • Weekly quizzes: up to 1 additional grade point • +0.5 if you answer >50% correctly Assignments and quizzes?? Yes! • +1 if you answer >75% correctly • You can pass without handing in the assignments (not recommended!) • Final exam: January 28, 2-5pm (MC & open questions) • Resit: April 15, 2-5pm (MC & open questions) 5 Questions? • Questions are always welcome • Contact preferably via mail: ti2736b-ewi@tudelft.nl • Questions and answers may be posted on blackboard if they are relevant for others as well • If the pace is too slow/fast, please let me know 6 Reading material (lectures) • Recommended material are either book chapters or papers, usually available through the TU Delft campus network • Course covers a recent topic, no single book available that covers all angles! • MapReduce: Data-Intensive Text Processing with MapReduce by Lin, Dyer and Hirst, Morgan and Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/ 7 Reading material (assignments) • A book on Java if you are not yet comfortable with it! • Hadoop: Hadoop: The Definite Guide by Tom White, O’Reilly Media, 2012 (3rd edition) 8 Big Data Course objectives • Explain the ideas behind the “big data buzz” • Understand and describe the three different paradigms covered in class • Code productively in one of the most important big data software frameworks we have to date: Hadoop (and tools building on it) • Transform big data problems into sensible algorithmic solutions 10 Today’s learning objectives • Explain and recognise the V’s of big data in use case scenarios • Explain the main differences between data streaming and MapReduce algorithms • Identify the correct approach (streaming vs. MapReduce) to be taken in an application setting 11 What is “big data”? • A buzz word, fuzzy boundaries ! ! ! ! • “Massive amounts of diverse, unstructured data produced by high-performance applications.” “Data too large & complex to be effectively handled by standard database technologies currently found in most organisations.” Requires novel infrastructure to support storage and processing 12 Large-scale computing is not new • Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching ECMWF Source 13 Large-scale computing is not new • Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching ECMWF Source 14 Big data processing • So-called big data technologies are about discovering patterns (in semi/unstructured data) • Main focus is on how to make computations on big data feasible, i.e. without a supercomputer • We use cluster(s) of commodity hardware! • Next quarter: Data Mining course (will use Hadoop*) focuses on how to discover those patterns *probably 15 Just an academic exercise? • Cloud computing: “Anything running inside a browser that gathers and stores user-generated content” • Utility computing • Computing as a metered service! • A “cloud user” buys any amount of computing power from a “cloud provider” (pay-per-use) • Virtual machine instances • IaaS: infrastructure as a service! • Amazon Web Services is the dominant provider 16 No !! Just an academic exercise? You can run your own big data experiments! 17 AWS EC2 Pricing Progress often driven by industry • Development of big data standards & (open source) software commonly driven by companies such as Google, Facebook, Twitter, Yahoo! … • Why do they care about big data? ! • More data More knowledge More knowledge leads to • better customer engagement • fraud prevention • new products 18 More money Big data analytics: IBM pitch https://www.youtube.com/watch?v=1RYKgj-QK4I 19 A concrete example: big data vs. small data but with a lot of data … perfect quality complex algorithm with little training simple algorithms fail Task: confusion set disambiguation then vs. than to vs. two vs. too poor little training data a lot 20 Scaling to very very large corpora for ! natural language disambiguation.! M. Banko and E. Brill, 2001. The 3 V’s • Volume: large amounts of data • Variety: data comes in many different forms from diverse sources • Velocity: the content is changing quickly 21 The 5 V’s • Volume: large amounts of data • Variety: data comes in many different forms from diverse sources • Velocity: the content is changing quickly • Value: data alone is not enough; how can value be derived from it? • Veracity: can we trust the data? how accurate is it? 22 3 co /5 V mm ’s on mo ly st us ! ed The 7 V’s • Volume: large amounts of data • Variety: data comes in many different forms from diverse sources • Velocity: the content is changing quickly • Value: data alone is not enough; how can value be derived from it? • Veracity: can we trust the data? how accurate is it? • Validity: ensure that the interpreted data is sound • Visibility: data from diverse sources need to be stitched together 23 Question: how do these attributes apply to the use case of Flickr? The 5 V’s • Volume: large amounts of data • Variety: data comes in many different forms from diverse sources • Velocity: the content is changing quickly • Value: data alone is not enough; how can value be derived from it? • Veracity: can we trust the data? how accurate is it? 24 Instantiations 25 Terminology • Batch processing: running a series of computer programs without human intervention • Near real-time: brief delay between the data becoming available and it being processed • Real-time: guaranteed responses between the data becoming available and it being processed 26 Terminology contd. standard in the past • Structured data (well defined fields) • Semi-structured data • Unstructured data (by humans for humans) most common today 27 Unstructured text • To get value out of unstructured text we need to impose structure automatically! • Parse text • Extract meaning from it (can be easy or difficult) • Amount of data we create is more than doubling every two years, most new data is unstructured or at most semi-structured • Text is not everything - images, video, audio, etc. 28 Extracting meaning easy difficult 29 IBM Watson: How it works https://www.youtube.com/watch?v=_Xcmh1LQB9I 8 minutes worth your time - even if not in class. ! Contains a nice piece about the use of unstructured text. 30 Examples of Volume & Velocity: Twitter • >500 million tweets a day • On average >5700 tweets a second • Peaks of >100,000 tweets a second • Super Bowl • US election • New Year’s Eve • Football World Cup (672M tweets in total #WorldCup) ! • • Messages are instantly accessible for search! Messages are used in post-hoc analyses to gather insights 31 #WorldCup http://bl.ocks.org/anonymous/raw/0c64880b3a791dffb6e4/ 32 Examples of Volume: the Large Hadron Collider @ CERN • The world’s largest particle accelerator buried in a tunnel with 27km circumference • Enormous amounts of sensors register the passage of particles • 40 million events/second (1MB raw data per event) • Generates 15 Petabytes a year (15-25 million GB) Image src 33 Examples of Velocity: targeted advertising on the Web • US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide • whether to show an ad • which ad to show • Users willing at best to wait 2 seconds for their search results • Feedback loop via user clicks, user searches, mouse movements, etc. 34 Examples of Velocity: targeted advertising on the Web • US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide • whether to show an ad • which ad to show • Users willing at best to wait 2 seconds for their search results • Feedback loop via user clicks, user searches, mouse movements, etc. 35 More interactions/data on the Web • YouTube: 4 billion views a day, one hour of video uploaded every second • Facebook: 483 million daily active users (Dec. 2011); 300 Petabytes of data • Google: >1 billion searches per day (March 2011) • Google processed 100 Terabytes of data per day in 2004 and 20 Petabytes data/day in 2008 • Internet Archive: contains 2 Petabytes of data, grows 20 Terabytes per month (2011) 36 Question: how can we come up with a Use case: movie suggestion for Frank?! recommendations Think about data-intensive and non-intensive approaches. Bob Alice ? Tom Jane ? ? ? ? ? ? ? ? ? ? ? ? Frank ? ? Joe ? ? ? 37 Bob likes X-MEN. ? Jane dislikes X-MEN. ? Should we suggest X-MEN to Frank? ? Use case contd. • Ignore the data, use experts instead (movie reviewers); assumes no large subscriber/reviewer divergence • Use all data but ignore individual preferences; assumes that most users are close to the average • Lump people into preference groups based on shared likes/dislikes; compute group-based average score per movie • Focus computational effort on difficult movies (some are universally liked/disliked) A whole research field is concerned with this 38 question: recommender systems. Use case contd. • Netflix Prize: open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings (>100 million ratings by 0.5 million users for ~17,000 movies) • First competitor to improve over Netflix’s baseline by 10% receives $1,000,000 • Competition started in 2006, price money was paid out in 2009 (winner was 20 minutes quicker than the runner up - same performance!) Many research teams competed - 39innovation driven by industry again! Question: whatof kind of data do you need for Example Variety: this task? Restaurant Locator • Task: given a person’s location, list the top five restaurants in the neighbourhood • Required data: • World map • List of all restaurants in the world (opening hours, GPS coordinates, menu, special offers) • Reviews/ratings • Optional: social media stream(s) Data is continuously changing (restaurants close, new ones open, data formats change, etc.) • 40 Society can benefit too, not just companies • Accurate predictions of natural disasters and diseases • Better responses to disaster discovery • Timely & effective decisions • Provide resources where need the most ! • • Complete disease/genomics databases to enable biomedical discoveries Accurate models to support forecasting of ecosystem developments 41 Idea: earthquake warnings • Social sensors: users (humans) that use Twitter, Facebook, Instagram, i.e. portals with real-time posting abilities ! Earthquake! occurs ! ! Warn people further away people in the area tweet about it Challenges: how to detect when a tweet is about Question: do you think thisearthquake is possible? an actual earthquake, which is it If about and where is the what are thecentre challenges? • 42 so, Idea: earthquake warnings • Social sensors: users (humans) that use Twitter, Facebook, Instagram, i.e. portals with real-time posting abilities ! Earthquake! occurs ! ! • Warn people further away people in the area tweet about it Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre 43 It i s Idea: earthquake warnings contd. • Goal: warning should reach people earlier than the seismic waves • Travel times of seismic waves: 3-7km/s; arrival time of a wave 100km away: 20 seconds! • Performance of an existing system (Sakaki et al., Twitter-based 2010): earthquake 44 po ss ibl e!! traditional warning system A brief introduction to Streaming & MapReduce A brief introduction to Streaming Data streaming scenario • Continuous and rapid input of data (“stream of data”) • Limited memory to store the data - less than linear in the input size • Limited time to process each data item - sequential access • Algorithms have one (or very few passes) over the data We go for the practical setup! • Can be approached from a practical or mathematical point of view: metric embedding, pseudo-random computations, sparse approx. theory … 47 Data streaming example 3 6 5 4 1 8 2 What is the missing number? stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data Solution 1: memorise all number seen so far; memory requirements: n bit (impractical for large n) 48 Data streaming example 3 6 5 4 1 8 2 What is the missing number? stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data subtract seen numbers Solution 2: (closed form) sum of all numbers from 1 to n 49 memory: 2logn Data streaming contd. 3 4 9 4 9 8 2 What is the average?! What is the median? we are allowed one pass over the data, can only store 3 numbers • Average: can be computed by keeping track of two numbers (sum and #numbers seen) ! • Median: sample data points - but how? 50 Data streaming contd. • Typically: simple functions of the stream are computed and used as input to other algorithms • Median • Number of distinct elements • Longest increasing sequence • … • Closed form solutions are rare • Common approaches are approximations of the true value: sampling, hashing 51 A brief introduction to MapReduce MapReduce is an industry is the open-source implementation of t standard Hadoop MapReduce framework. QCon 2013 (San Francisco) “QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference …” 53 Industry is moving fast 54 QCon 2014 (San Francisco) MapReduce • Designed for batch processing over large data sets • No limits on the number of passes, memory or time • Programming model for distributed computations inspired by the functional programming paradigm 55 MapReduce example WordCount: given an input text, determine the frequency of each word. The “Hello World” of the MapReduce realm. Input text: The dog walks around the house. The dog is in the house. text 1 Input: text text 2 text 3 Mapper Mapper sort Reducer! (add) (dog,1)! Mapper (in,1)! (dog,1)! (the,1)! (walks,1)! dog: 2! (house,1) (house,1)! walks: 1! (house,1)! house: 2! 56 … … MapReduce example We implement the Mapper and the Reducer. Hadoop (and other tools) are responsible for the “rest”. text 1 Input: text text 2 text 3 Mapper Mapper sort Reducer! (add) (dog,1)! Mapper (in,1)! (dog,1)! (the,1)! (walks,1)! dog: 2! (house,1) (house,1)! walks: 1! (house,1)! house: 2! 57 … … Summary • What are the characteristics of “big data”? • Example use cases of big data • Hopefully a convincing argument why you should care • A brief introduction of data streams and MapReduce 58 Reading material Required reading! None. ! Recommended reading! Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information by Jules Berman. Chapters 1, 14 and 15. 59 THE END