COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University Student Self-Introduction • Name – I will try to remember your names. But if you have a Long name, please let me know how should I call you • Anything you want us to know COP6727 2 Course Overview • Meeting time – Tuesday and Thursday 12:30pm – 13:45pm • Office hours: – Thursday 2:30pm – 4:30pm or by appointment • Course Webpage: – http://www.cs.fiu.edu/~taoli/class/CAP6727S13/index.html COP6727 3 Course Objectives • This is an advanced database course – Already taken COP5725 • Assume knowledge of the fundamental concepts of relational databases. • Cover the core principles and techniques of data and information management • Discuss advanced techniques that can be applied to traditional database systems in order to provide efficient support of new emerging applications. COP6727 4 Tentative Topics • • • • • • • • • Query processing and optimization Transaction management Database tuning Data stream systems Spatial databases XML Information retrieval and Web data management Scalable data processing Readings in recent developments in database systems and applications – – – – – – – – SQL vs. non-SQL database Nearest neighbor queries High-dimensional indexing Database retrieval and ranking Stream processing Big Data Incremental and online query processing Mobile database COP6727 5 Assignments and Grading • • • • • • Reading/Written Assignments Programing Projects Midterm Exam Final Project/Presentations Class attendance is mandatory. Evaluation will be a subjective process – Effort is very important component • Regular In-class Students – – – – Quizzes and Class Participation: 5% Midterm Exam: 30% Final Project: 30% Assignments and Projects: 35% • Online Students – Midterm Exam: 30% – Final Project: 30% – Homework Assignments: 40% COP6727 6 Text and References Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Third Edition, McGraw Hill, 2003. ISBN: 0-07-246563-8. Links to Textbook Homepage . In addition, the course materials will also be drawn from recent research literature. COP6727 7 Lecture 1 & 2 • Lecture 1 & 2: Introduction To MapReduce (Most of slides are adapted from Bill Graham, Spiros Papadimitriou, Cloudera Tutorials) COP6727 8 Outline • • • • Motivation for MapReduce What is MapReduce? What is Hadoop? What is Hive? COP6727 9 Motivation for MapReduce • The Big Data • How to handle big data? COP6727 10 The Big Data • Big data is everywhere • Documents – Blogs (77 million Tumblr and 56.6 million WordPress as of 2012 ), Micro blogs, News, Reviews • Images – Instagram, Flickr (more than 6 billion images) • Videos – Youtube, All broadcast • Others – Map (Google Map) – Human Genome – aeronautics and space data COP6727 11 Another view on “big” • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day COP6727 12 Why do we care about those data? • • • • • • • • Modeling and predicting information flow Recommend/predict links in social networks Relevance classification / information filtering Sentiment analysis and opinion mining Topic modeling and evolution Measuring influence in social networks Concept mapping Search • … COP6727 13 Big data analysis • Scalability (with reasonable cost) – Algorithms improvement – Intuitive way: divide and conquer COP6727 14 Divide and Conquer COP6727 15 Challenges • Parallel processing is complicated – How do we assign tasks to workers? – What if we have more tasks than slots? – What happens when tasks fail? – How do you handle distributed synchronization? COP6727 16 Challenges – Con’t • Data storage is not trivial – Traditional database is not reliable • Data volumes are massive • Reliably storing PBs of data is challenging – Disk/hardware/network failures – Probability of failure event increases with number of machines • For example: – 1000 hosts, each with 10 disks, a disk lasts 3 year – how many failures per day? COP6727 17 What is MapReduce? • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop COP6727 18 Workflow of Large Data Problem Map • Iterate over a large number of records • Extract something of interest from each Reduce Shuffle and sort intermedi ate results COP6727 • Aggregate intermediate results • Generate final output 19 MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer COP6727 20 MapReduce Flow COP6727 21 An Example COP6727 22 MapReduce paradigm – Con’t • There’s more! • Partioners decide what key goes to what reducer – partition(k’, numPartitions) -> partNumber – Divides key space into parallel reducers chunks – Default is hash-based • Combiners can combine Mapper output before sending to reducer – Reduce(k2, list(v2)) -> list(v3) COP6727 23 MapReduce Flow COP6727 24 MapReduce additional details • • • • • • Reduce starts after all mappers complete Mapper output gets written to disk Intermediate data can be copied sooner Reducer gets keys in sorted order Keys not sorted across reducers Global sort requires 1 reducer or smart partitioning COP6727 25 MapReduce is good at • • • • Embarrassingly parallel algorithms Summing, grouping, filtering, joining Off-line batch jobs on massive data sets Analyzing an entire large dataset COP6727 26 MapReduce can do • Iterative jobs (e.g., PageRank, K-means Clustering) – Each iteration must read/write data to disk – IO and latency cost of an iteration is high COP6727 27 MapReduce is not good at • Jobs that need shared state/coordination – Tasks are shared-nothing – Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records COP6727 28 Summary of MapReduce • Simple programming model • Scalable, fault-tolerant • Ideal for (pre-)processing large volumes of data COP6727 29 What is Hadoop? • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 COP6727 30 Hadoop provides • Redundant, fault-tolerant data storage • Parallel computation framework • Job coordination COP6727 31 Hadoop Stack COP6727 32 Who uses Hadoop? • • • • • • • Yahoo! Facebook Last.fm Rackspace Digg Apache Nutch ... COP6727 33 HDFS • The Hadoop Distributed File System • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts COP6727 34 Some Concepts about HDFS • • • • Files are stored as a collection of blocks Blocks are 64 MB chunks of a file (configurable) Blocks are replicated on 3 nodes (configurable) The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks COP6727 35 Write COP6727 36 Read COP6727 37 If a datanode failures • DNs check in with the NN to report health • Upon failure NN orders DNs to replicate under- replicated blocks COP6727 38 Jobs and Tasks in Hadoop • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task – Failed tasks get retried automatically – Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation • TaskTrackers (TT) ask for work and execute tasks COP6727 39 Architecture COP6727 40 How to handle failed tasks? • • • • JT will retry failed tasks up to N attempts After N failed attempts for a task, job fails Some tasks are slower than other Speculative execution is JT starting up multiple of the same task • First one to complete wins, other is killed COP6727 41 Data locality • Move computation to the data • Moving data between nodes has a cost • Hadoop tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN COP6727 42 Hadoop execution environment • Local machine (standalone or pseudodistributed) • Virtual machine • Cloud (e.g. Amazon EC2) • Own cluster COP6727 43 Demo: word count • Demo COP6727 44 Homework • Write a Hadoop program to index the words within the text document dataset – Example: • Input: – Doc1: Hello World! – Doc2: Hello Java! • Expected output: – Hello \t Doc1 Doc2 – World \t Doc1 – Java \t Doc2 • Due: beginning of the class on 01/10 • If you have any questions, send emails to Jingxuan Li (jli003@cs.fiu.edu) COP6727 45 Login Info • Below is the login information for our Hadoop cluster – Server: datamining-node03.cs.fiu.edu – U:dbstudent p:******* (announced during the class) – Gaining the access to the working directory in HDFS (Do not modify or remove the other directories!): hadoop fs -ls /user/dbstudent • Input dataset for the homework (every one will be working on this dataset, so do not modify it!): /user/dbstudent/dataset • Output directory (including the source code, the indexing results) format: /user/dbstudent/output-PID COP6727 46 What is Hive? • Data warehousing tool on top of Hadoop • Originally developed at Facebook – Now a Hadoop sub-project • Data warehouse infrastructure – Execution: MapReduce – Storage: HDFS files • Large datasets, e.g. Facebook daily logs – 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009) • Hive QL: SQL-like query language COP6727 47 Motivation • Missing components when using Hadoop MapReduce jobs to process data – Command-line interface for “end users” – Ad-hoc query support – … without writing full MapReduce jobs – Schema information COP6727 48 Hive Applications • Log processing • Text mining • Document indexing • Customer-facing business intelligence (e.g., Google Analytics) • Predictive modeling, hypothesis testing COP6727 49 Hive Components • Shell: allows interactive queries like MySQL shell connected to database – Also supports web and JDBC clients • Driver: session handles, fetch, execute • Compiler: parse, plan, optimize • Execution engine: DAG of stages (M/R, HDFS, or metadata) • Metastore: schema, location in HDFS COP6727 50 Data Model • Tables – Typed columns (int, float, string, date, boolean) – Also, list: map (for JSON-like data) • Partitions – e.g., to range-partition tables by date • Buckets – Hash partitions within ranges (useful for sampling, join optimization) COP6727 51 Metastore • Database: namespace containing a set of Tables • Holds table definitions (column types, physical layout) • Partition data • Uses JPOX ORM for implementation; can be stored in Derby, MySQL, many other relational databases COP6727 52 Physical Layout • Warehouse directory in HDFS – e.g., /home/hive/warehouse • Tables stored in subdirectories of warehouse – Partitions, buckets form subdirectories of tables • Actual data stored in flat files – Control char-delimited text, or SequenceFiles – With custom SerDe, can use arbitrary format COP6727 53 Useful command examples • Start Hive: bin/hive • Show all the tables: SHOW TABLES • Create a new table: CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT ELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE • Loading data into the table: LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare COP6727 54 Useful command examples – Con’t • Select data: SELECT * FROM shakespeare WHERE freq > 100 SORT BY freq ASC LIMIT 10 • Join: INSERT OVERWRITE TABLE merged SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 COP6727 55 Summary of Hive • Supports rapid iteration of ad-hoc queries • Can perform complex joins with minimal code • Scales to handle much more data than many similar systems COP6727 56 References • • • • White, T., Hadoop: The definitive guide, 2012 http://hadoop.apache.org/ http://hive.apache.org/ MapReduce tutorial: http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Exampl e%3A+WordCount+v1.0 • Bill Graham, http://blogs.ischool.berkeley.edu/i290-abdts12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf • Spiros Papadimitriou, Jimeng Sun, and Rong Yan, http://cs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slide s/07-1.pdf • Cloudera, http://blog.cloudera.com/wp-content/uploads/2010/01/6IntroToHive.pdf COP6727 57 Exercises • To be announced COP6727 58