Berkley Data Analysis Stack

advertisement
Berkley Data Analysis Stack
Shark, Bagel
Previous Presentation Summary
• Mesos, Spark, Spark Streaming
New apps: AMP-Genomics, Carat, …
Application
Data Processing
• in-memory processing
• trade between time, quality, and cost
Data Management
Storage
Efficient data sharing across frameworks
Resource
Infrastructure
Management
Share infrastructure across frameworks
(multi-programming for datacenters)
2
Previous Presentation Summary
• Mesos, Spark, Spark Streaming
3
Spark Example: Log Mining
• Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
messages = errors.map(_.split(‘\t’)(2))
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
cachedMsgs = messages.cache()
Cache 1
BaseTransformed
RDD
RDD
tasks
Block 1
Driver
Cached RDD
Parallel operation
cachedMsgs.filter(_.contains(“foo”)).count
Cache 2
cachedMsgs.filter(_.contains(“bar”)).count
Worker
. . .
Cache 3
Result: full-text search of Wikipedia in <1 sec (vs
20 sec for on-disk data)
Worker
Block 3
Block 2
Logistic Regression Performance
val data =
spark.textFile(...).map(rea
dPoint).cache()
127 s / iteration
var w = Vector.random(D)
for (i <- 1 to ITERATIONS)
{
…
}
println("Final w: " + w)
first iteration 174 s
further iterations 6 s
Mgmt. Web UI
HIVE: Components
Map Reduce
Hive CLI
Browsing
Thrift API
Queries
DDL
Parser
Execution
Planner
Hive QL
SerDe
MetaStore
Thrift Jute JSON..
HDFS
Data Model
Hive Entity
Sample
Metastore Entity
Sample HDFS Location
Table
T
/wh/T
Partition
date=d1
/wh/T/date=d1
Bucketing
userid
column
/wh/T/date=d1/part-0000
…
/wh/T/date=d1/part-1000
(hashed on userid)
External
Table
/wh2/existing/dir
(arbitrary location)
extT
Hive/Shark flowchart (Insert into table)
Two ways to do this.
1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.
2. Load “Buckets” directly. The user is responsible for creating buckets.
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING
COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country
STRING) STORED AS SEQUENCEFILE;
Creates the table directory.
Hive/Shark flowchart (Insert into table)
Two ways to do this.
1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.
Step 1
CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING,
referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT
'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT
DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
Step 2
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
Hive/Shark flowchart (Insert into table)
Two ways to do this.
1. Load from “external table”. Query the external table for each “bucket” and write that bucket to HDFS.
Step 3
FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';
Hive
Hive Operator
Mapper
Hive Operator
Reducer
ObjectInspector
Hierarchical
Object
Hierarchical
Object
Java Object
Object of a Java Class
SerDe
Writable
Hierarchical
Hierarchical
Object Object
Object
Standard
Use ArrayList for struct and array
Use HashMap for map
Text(‘1.0 3 54’) // UTF8 encoded
Writable
Writable
Writable
Writable
Writable
Writable
BytesWritable(\x3F\x64\x72\x00)
Hierarchical
Object
LazyObject
Lazily-deserialized
Writable
FileFormat / Hadoop Serialization
File on
HDFS
Stream
Stream
thrift_record<…>
thrift_record<…>
thrift_record<…>
thrift_record<…>
User Script
1.0 3 54
0.2 1 33
2.2 8 212
0.7 2 22
User defined SerDes per ROW
Map
Output File
File on
HDFS
SerDe, ObjectInspector and TypeInfo
“av”
String Object
int
ObjectInspector3
getType
string
string
int
struct
getMapValue
Hierarchical
Object
getMapValueOI
ObjectInspector2
HashMap(“a”  “av”,getType
“b”  “bv”),
HashMap<String, String> a,
map
getStructField
List (
HashMap(“a”  “av”, “b”  “bv”),
Hierarchical
getFieldOI
23,
ObjectInspector1
Object
getType
List(List(1,null),List(2,4),List(5,null)),
“abcd”
)
deserialize serialize
SerDe
getOI
Writable
Writable
Text(‘a=av:b=bv 23 1:2=4:5 abcd’)
int
list
string
class HO {
HashMap<String, String> a,
Integer b,
List<ClassC> c,
String d;
}
StructClass ClassC {
Integer a,
Integer b;
TypeInfo
}
BytesWritable(\x3F\x64\x72\x00)
Download