Berkley Data Analysis Stack Shark, Bagel Previous Presentation Summary • Mesos, Spark, Spark Streaming New apps: AMP-Genomics, Carat, … Application Data Processing • in-memory processing • trade between time, quality, and cost Data Management Storage Efficient data sharing across frameworks Resource Infrastructure Management Share infrastructure across frameworks (multi-programming for datacenters) 2 Previous Presentation Summary • Mesos, Spark, Spark Streaming 3 Spark Example: Log Mining • Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) messages = errors.map(_.split(‘\t’)(2)) Worker results errors = lines.filter(_.startsWith(“ERROR”)) cachedMsgs = messages.cache() Cache 1 BaseTransformed RDD RDD tasks Block 1 Driver Cached RDD Parallel operation cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worker . . . Cache 3 Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Worker Block 3 Block 2 Logistic Regression Performance val data = spark.textFile(...).map(rea dPoint).cache() 127 s / iteration var w = Vector.random(D) for (i <- 1 to ITERATIONS) { … } println("Final w: " + w) first iteration 174 s further iterations 6 s Mgmt. Web UI HIVE: Components Map Reduce Hive CLI Browsing Thrift API Queries DDL Parser Execution Planner Hive QL SerDe MetaStore Thrift Jute JSON.. HDFS Data Model Hive Entity Sample Metastore Entity Sample HDFS Location Table T /wh/T Partition date=d1 /wh/T/date=d1 Bucketing userid column /wh/T/date=d1/part-0000 … /wh/T/date=d1/part-1000 (hashed on userid) External Table /wh2/existing/dir (arbitrary location) extT Hive/Shark flowchart (Insert into table) Two ways to do this. 1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. 2. Load “Buckets” directly. The user is responsible for creating buckets. CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; Creates the table directory. Hive/Shark flowchart (Insert into table) Two ways to do this. 1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Step 1 CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; Step 2 hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view Hive/Shark flowchart (Insert into table) Two ways to do this. 1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS. Step 3 FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'; Hive Hive Operator Mapper Hive Operator Reducer ObjectInspector Hierarchical Object Hierarchical Object Java Object Object of a Java Class SerDe Writable Hierarchical Hierarchical Object Object Object Standard Use ArrayList for struct and array Use HashMap for map Text(‘1.0 3 54’) // UTF8 encoded Writable Writable Writable Writable Writable Writable BytesWritable(\x3F\x64\x72\x00) Hierarchical Object LazyObject Lazily-deserialized Writable FileFormat / Hadoop Serialization File on HDFS Stream Stream thrift_record<…> thrift_record<…> thrift_record<…> thrift_record<…> User Script 1.0 3 54 0.2 1 33 2.2 8 212 0.7 2 22 User defined SerDes per ROW Map Output File File on HDFS SerDe, ObjectInspector and TypeInfo “av” String Object int ObjectInspector3 getType string string int struct getMapValue Hierarchical Object getMapValueOI ObjectInspector2 HashMap(“a” “av”,getType “b” “bv”), HashMap<String, String> a, map getStructField List ( HashMap(“a” “av”, “b” “bv”), Hierarchical getFieldOI 23, ObjectInspector1 Object getType List(List(1,null),List(2,4),List(5,null)), “abcd” ) deserialize serialize SerDe getOI Writable Writable Text(‘a=av:b=bv 23 1:2=4:5 abcd’) int list string class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d; } StructClass ClassC { Integer a, Integer b; TypeInfo } BytesWritable(\x3F\x64\x72\x00)