Introduction to Hive Liyin Tang liyintan@usc.edu Outline Motivation Overview Data Model / Metadata Architecture Performance Cons and Pros Application Related Work 4/13/2015 Introduction to Hive 2 Motivation Realtime Hadoop Cluster Web Servers Oracle RAC Scribe MidTier Hadoop Hive Warehouse Scribe Writers MySQL http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html 4/13/2015 Introduction to Hive 3 Motivation Limitation of MR Have to use M/R model Not Reusable Error prone For complex jobs: Multiple stage of Map/Reduce functions Just like ask dev to write specify physical execution plan in the database 4/13/2015 Introduction to Hive 4 Overview Intuitive Make the unstructured data looks like tables regardless how it really lay out SQL based query can be directly against these tables Generate specify execution plan for this query What’s Hive A data warehousing system to store structured data on Hadoop file system Provide an easy query these data by execution Hadoop MapReduce plans 4/13/2015 Introduction to Hive 5 Data Model Tables Basic type columns (int, float, boolean) Complex type: List / Map ( associate array) Partitions Buckets CREATE TABLE sales( id INT, items ARRAY<STRUCT<id:INT,name:STRING> ) PARITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS; SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) 4/13/2015 Introduction to Hive 6 Metadata Database namespace Table definitions schema info, physical location In HDFS Partition data ORM Framework All the metadata can be stored in Derby by default Any database with JDBC can be configed 4/13/2015 Introduction to Hive 7 Architecture Map Reduce Web UI + Hive CLI + JDBC/ODBC User-defined Map-reduce Scripts HDFS Browse, Query, DDL Hive QL MetaStore UDF/UDAF substr sum average Parser Planner Execution Thrift API Optimizer SerDe CSV Thrift Regex FileFormats TextFile SequenceFile RCFile http://www.slideshare.net/cloudera/hw09-hadoop-development-atfacebook-hive-and-hdfs 8 Performance GROUP BY operation Efficient execution plans based on: Data skew: how evenly distributed data across a number of physical nodes bottleneck VS load balance Partial aggregation: Group the data with the same group by value as soon as possible In memory hash-table for mapper Earlier than combiner 4/13/2015 Introduction to Hive 9 Performance JOIN operation Traditional Map-Reduce Join Early Map-side Join very efficient for joining a small table with a large table Keep smaller table data in memory first Join with a chunk of larger table data each time Space complexity for time complexity 7/20/2010 Introduction to Hive 10 Performance Ser/De Describe how to load the data from the file into a representation that make it looks like a table; Lazy load Create the field object when necessary Reduce the overhead to create unnecessary objects in Hive Java is expensive to create objects Increase performance 7/20/2010 Introduction to Hive 11 Hive – Performance Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec 8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec QueryA: SELECT count(1) FROM t; QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t; QueryC: SELECT * FROM t; map-side time only (incl. GzipCodec for comp/decompression) * These two features need to be tested with other queries. http://www.slideshare.net/cloudera/hw09-hadoop-development-atfacebook-hive-and-hdfs Pros Pros A easy way to process large scale data Support SQL-based queries Provide more user defined interfaces to extend Programmability Efficient execution plans for performance Interoperability with other database tools 4/13/2015 Introduction to Hive 13 Cons Cons No easy way to append data Files in HDFS are immutable Future work Views / Variables More operator In/Exists semantic More future work in the mail list 4/13/2015 Introduction to Hive 14 Application Log processing Daily Report User Activity Measurement Data/Text mining Machine learning (Training Data) Business intelligence Advertising Delivery Spam Detection 7/20/2010 Introduction to Hive 15 Related Work Parallel databases: Gamma, Bubba, Volcano Google: Sawzall Yahoo: Pig IBM: JAQL Microsoft: DradLINQ , SCOPE 7/20/2010 Introduction to Hive 16 Reference [1] A.Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of VLDB09', 2009. [2] Hadoop 2009: http://www.slideshare.net/cloudera/hw09-hadoopdevelopment-at-facebook-hive-and-hdfs [4] Facebook Data Team: http://www.slideshare.net/zshao/hive-datawarehousing-analytics-on-hadoop-presentation [3] Cloudera: http://www.cloudera.com/videos/introduction_to_hi ve 7/20/2010 Introduction to Hive 17 Q&A Thank you Back up Hive Components Shell Interface: Like the MySQL shell Driver: Session handles, fetch, exeucition Complier: Prarse,plan,optimzie Execution Engine: DAG stage Run map or reduce 7/20/2010 Introduction to Hive 20 Motivation MapReduce Motivation Data processing: > 1 TB Massively parallel Locality Fault Tolerant 7/20/2010 Introduction to Hive 21 Hive Usage hive> show tables; hive> create table SHAKESPEARE (freq INT,word STRING) row format delimited fields terminated by ‘\t’ stored as textfile hive> load data inpath “shakespeare_freq” into table shakespeare; Introduction to Hive 22 Hive Usage hive> load data inpath “shakespeare_freq” into table shakespeare; hive> select * from shakespeare where freq>100 sort by freq asc limit 10; Introduction to Hive 23 Hive Usage @ Facebook Statistics per day: 4 TB of compressed new data added per day 135TB of compressed data scanned per day 7500+ Hive jobs on per day Hive simplifies Hadoop: ~200 people/month run jobs on Hadoop/Hive Analysts (non-engineers) use Hadoop through Hive 95% of jobs are Hive Jobs http://www.slideshare.net/cloudera/hw09-hadoop-developmentat-facebook-hive-and-hdfs 7/20/2010 Introduction to Hive 24