Assignment step

advertisement
HBase and Bigtable Storage
Xiaoming Gao
Judy Qiu
Hui Li
Outline
• HBase and Bigtable Storage
• HBase Use Cases
• Hands-on: Load CSV file to Hbase table with
MapReduce
• Demo Search Engine System with MapReduce
Technologies (Hadoop/HDFS/HBase/Pig)
HBase Introduction
• HBase is an open source, distributed, sorted
map modeled after Google’s BigTable
• HBase is built on Hadoop:
– Fault tolerance
– Scalability
– Batch processing with MapReduce
• HBase uses HDFS for storage
HBase Cluster Architecture
• Tables split into regions and served by region servers
• Regions vertically divided by column families into “stores”
• Stores saved as files on HDFS
Data Model: A Big Sorted Map
•
•
•
•
A Big Sorted Map
Not a relational database, no sql,
Tables consist of rows, each of which has a primary key (row key)
Each row has any number of columns:
sortedMap<rowKey, List(sortedMap(Column, List(Value,TimeStamp))))>
HBase VS. RDBMS
RDBMS
HBase
Data layout
Row-oriented
Column-family-oriented
Indexes
On row and columns
On row
Hardware requirement
Large arrays of fast and
expensive disks
Designed for commodity
hardware
Max data size
TBs
~1PB
Read/write throughput
1000s queries/second
Millions of queries/second
Query language
SQL (Join, Group)
Get/Put/
Easy of use
Relational data
modeling, easy to learn
A sorted Map, significant
learning curve, communities
and tools are increasing
When to Use HBase
• Dataset Scale
– Indexing huge amount of web pages in internet or genome
data
– Need data mining large social media data sets
• Read/Write Scale
– reads/writes are distributed as tables are distributed
across nodes
– Writes are extremely fast and require no index updates
• Batch Analysis
– Massive and convoluted SQL queries can be executed in
parallel via MapReduce jobs
Use Cases:
• Facebook Analytics
– Real-time counters of URLs shared, preferred links
• Twitter
– 25 TB of message every month
• Mozilla
– Store crashes report, 2.5 million per day.
Programming with HBase
1. HBase shell
– Scan, List, Create
2. Native Java API
– Get(byte[] row, byte[] column, long ts, int version)
3. Non-Java Clients
– Thrift server (Ruby, C++, PHP)
– REST server
4. HBase MapReduce API
– hbase.mapreduce.TableMapper;
– hbase.mapreduce.TableReducer;
5. High Level Interface
– Pig, Hive
Hands-on HBase MapReduce Programming
• HBase MapReduce API
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.util.Bytes;
Hands-on: load CSV file into HBase
table with MapReduce
• CSV represent for comma separate values
• CSV file is common file in many scientific fields
such as flow cytometry in bioinformatics
Hands-on: load CSV file into HBase
table with MapReduce
• Main entry point of program
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if(otherArgs.length != 2) {
System.err.println("Wrong number of arguments: " + otherArgs.length);
System.err.println("Usage: <csv file> <hbase table name>");
System.exit(-1);
}//end if
Job job = configureJob(conf, otherArgs);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}//main
Hands-on: load CSV file into HBase
table with MapReduce
• Configure HBase MapReduce job
public static Job configureJob(Configuration conf, String [] args) throws IOException {
Path inputPath = new Path(args[0]);
String tableName = args[1];
Job job = new Job(conf, tableName);
job.setJarByClass(CSV2HBase.class);
FileInputFormat.setInputPaths(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(CSV2HBase.class);
TableMapReduceUtil.initTableReducerJob(tableName, null, job);
job.setNumReduceTasks(0);
return job;
}//public static Job configure
Hands-on: load CSV file into HBase
table with MapReduce
• The map function
public void map(LongWritable key, Text line, Context context) throws IOException {
// Input is a CSV file Each map() is a single line, where the key is the line number
// Each line is comma-delimited; row,family,qualifier,value
String [] values = line.toString().split(",");
if(values.length != 4) {
return;
}
byte [] row = Bytes.toBytes(values[0]);
byte [] family = Bytes.toBytes(values[1]);
byte [] qualifier = Bytes.toBytes(values[2]);
byte [] value = Bytes.toBytes(values[3]);
Put put = new Put(row);
put.add(family, qualifier, value);
try {
context.write(new ImmutableBytesWritable(row), put);
} catch (InterruptedException e) {
e.printStackTrace();
}
if(++count % checkpoint == 0) {
context.setStatus("Emitting Put " + count);
} } }
Hands-on: steps to load CSV file into HBase
table with MapReduce
1. Check Hbase installation in Ubuntu Sandbox
1. http://salsahpc.indiana.edu/ScienceCloud/virtualbox_appliance_guide.html
2. Echo $HBASE_HOME
2. Start Hadoop and Hbase cluster
1. Start-all.sh
2. Start-hbase.sh
3. Create hbase table with specified data schema
1. Hbase shell
2. Create “csv2hbase”,”f1”
4. Compile the program with Ant
1. cd “hbasetutorial”
2. Ant
5. Upload input.csv into HDFS
1. Hadoop dfs –mkdir input
2. Hadoop dfs –copyFromLocal input.csv input/input.csv
6. Run the program:
/bin/hadoop jar dist/lib/cglHBaseSummerSchool.jar iu.pti.hbaseapp.CSV2HBase
input/input.csv “csv2hbase”
7. Check inserted records in Hbase table
1. Hbase shell
2. Scan “csv2hbase”
Hands-on: load CSV file into HBase
table with MapReduce
Extension: set HBase table as Input
Using TableInputFormat and TableMapReduceUtil to use an
HTable as input to a map/reduce job
public static Job configureJob (Configuration conf, String [] args) throws IOException {
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(new Scan()));
conf.set(TableInputFormat.INPUT_TABLE, tableName);
conf.set("index.tablename", tableName);
conf.set("index.familyname", columnFamily);
String[] fields = new String[args.length - 2];
for(int i = 0; i < fields.length; i++) { fields[i] = args[i + 2]; }
conf.setStrings("index.fields", fields);
conf.set("index.familyname", "attributes");
Job job = new Job(conf, tableName);
job.setJarByClass(IndexBuilder.class);
job.setMapperClass(Map.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(TableInputFormat.class);
job.setOutputFormatClass(MultiTableOutputFormat.class);
return job; }
Extension: write output to HBase table
public static class Map extends
Mapper<ImmutableBytesWritable, Result, ImmutableBytesWritable, Writable> {
private byte[] family;
private HashMap<byte[], ImmutableBytesWritable> indexes;
protected void map(ImmutableBytesWritable rowKey, Result result, Context context)
throws IOException, InterruptedException {
for(java.util.Map.Entry<byte[], ImmutableBytesWritable> index : indexes.entrySet()) {
byte[] qualifier = index.getKey();
ImmutableBytesWritable tableName = index.getValue();
byte[] value = result.getValue(family, qualifier);
if (value != null) {
Put put = new Put(value);
put.add(INDEX_COLUMN, INDEX_QUALIFIER, rowKey.get());
context.write(tableName, put);
}//if }//for }//map
Big Data Challenge
Peta 10^15
Tera 10^12
Giga 10^9
Mega 10^6
Search Engine System with
MapReduce Technologies
1. Search Engine System for Summer School
2. To give an example of how to use
MapReduce technologies to solve big data
challenge.
3. Using Hadoop/HDFS/HBase/Pig
4. Indexed 656K web pages (540MB in size)
selected from Clueweb09 data set.
5. Calculate ranking values for 2 million web
sites.
Architecture for SESSS
Apache Lucene
Inverted Indexing
System
PHP script
Web UI
Hive/Pig script
Apache Server
on Salsa Portal
Thrift client
HBase
HBase Tables
1. inverted index table
2. page rank table
Thrift server
Pig script
Hadoop Cluster
on FutureGrid
Ranking
System
Demo Search Engine System for
Summer School
build-index-demo.exe (build index with HBase)
pagerank-demo.exe (compute page rank with Pig)
http://salsahpc.indiana.edu/sesss/index.php
High Level Language: Pig Latin
Hui Li
Judy Qiu
Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012
What is Pig
• Framework for analyzing large un-structured and semistructured data on top of Hadoop.
– Pig Engine Parses, compiles Pig Latin scripts into MapReduce
jobs run on top of Hadoop.
– Pig Latin is simple but powerful data flow language similar to
scripting languages.
Motivation of Using Pig
• Faster development
– Fewer lines of code (Writing map reduce like writing SQL queries)
– Re-use the code (Pig library, Piggy bank)
• One test: Find the top 5 words with most high frequency
– 10 lines of Pig Latin V.S 200 lines in Java
– 15 minutes in Pig Latin V.S 4 hours in Java
Pig Latin
Java
Pig Latin
300
300
250
250
150
minutes
200
200
150
100
100
50
50
0
0
Java
Word Count using MapReduce
Pig performance VS MapReduce
• Pigmix : pig vs mapreduce
Word Count using Pig
Lines=LOAD ‘input/hadoop.log’ AS (line: chararray);
Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
Groups = GROUP Words BY word;
Counts = FOREACH Groups GENERATE group, COUNT(Words);
Results = ORDER Words BY Counts DESC;
Top5 = LIMIT Results 5;
STORE Top5 INTO /output/top5words;
Who uses Pig for What
• 70% of production jobs at Yahoo (10ks per
day)
• Twitter, LinkedIn, Ebay, AOL,…
• Used to
– Process web logs
– Build user behavior models
– Process images
– Build maps of the web
– Do research on raw data sets
Pig Tutorial
• Accessing Pig
• Basic Pig knowledge: (Word Count)
– Pig Data Types
– Pig Operations
– How to run Pig Scripts
• Advanced Pig features: (Kmeans Clustering)
– Embedding Pig within Python
– User Defined Function
Accessing Pig
• Accessing approaches:
– Batch mode: submit a script directly
– Interactive mode: Grunt, the pig shell
– PigServer Java class, a JDBC like interface
• Execution mode:
– Local mode: pig –x local
– Mapreduce mode: pig –x mapreduce
Pig Data Types
• Concepts: fields, tuples, bags, relations,
–
–
–
–
A Field is a piece of data
A Tuple is an ordered set of fields
A Bag is a collection of tuples
A Relation is a bag
• Simple Types
– Int, long, float, double, boolean,nul, chararray, bytearry,
• Complex types
– Tuple  Row in Database
• ( 0002576169, Tome, 21, “Male”)
– Data Bag  Table or View in Database
{(0002576169 , Tome, 21, “Male”),
(0002576170, Mike, 20, “Male”),
(0002576171 Lucy, 20, “Female”)…. }
Pig Operations
• Loading data
– LOAD loads input data
– Lines=LOAD ‘input/access.log’ AS (line: chararray);
• Projection
– FOREACH … GENERTE … (similar to SELECT)
– takes a set of expressions and applies them to every record.
• Grouping
– GROUP collects together records with the same key
• Dump/Store
– Dump displays results to screen, Store save results to file system
• Aggregation
– AVG, COUNT, COUNT_STAR, MAX, MIN, SUM
How to run Pig Latin scripts
• Local mode
– Local host and local file system is used
– Neither Hadoop nor HDFS is required
– Useful for prototyping and debugging
• MapReduce mode
– Run on a Hadoop cluster and HDFS
• Batch mode - run a script directly
– Pig –x local my_pig_script.pig
– Pig –x mapreduce my_pig_script.pig
• Interactive mode use the Pig shell to run script
– Grunt> Lines = LOAD ‘/input/input.txt’ AS (line:chararray);
– Grunt> Unique = DISTINCT Lines;
– Grunt> DUMP Unique;
Hands-on: Word Count using Pig Latin
1.
2.
3.
4.
5.
6.
7.
8.
9.
cd pigtutorial/pig-hands-on/
tar –xf pig-wordcount.tar
cd pig-wordcount
pig –x local
grunt> Lines=LOAD ‘input.txt’ AS (line: chararray);
grunt>Words = FOREACH Lines GENERATE FLATTEN(TOKENIZE(line))
AS word;
grunt>Groups = GROUP Words BY word;
grunt>counts = FOREACH Groups GENERATE group, COUNT(Words);
grunt>DUMP counts;
Sample: Kmeans using Pig Latin
A method of cluster analysis which aims to partition n
observations into k clusters in which each observation
belongs to the cluster with the nearest mean.
Assignment step: Assign each observation to the cluster
with the closest mean
Update step: Calculate the new means to be the
centroid of the observations in the cluster.
Reference: http://en.wikipedia.org/wiki/K-means_clustering
Kmeans Using Pig Latin
PC = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
raw = load 'student.txt' as (name:chararray, age:int, gpa:double);
centroided = foreach raw generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = Foreach grouped Generate group, AVG(centroided.gpa);
store result into 'output';
""")
Kmeans Using Pig Latin
while iter_num<MAX_ITERATION:
PCB = PC.bind({'centroids':initial_centroids})
results = PCB.runSingle()
iter = results.result("result").iterator()
centroids = [None] * v
distance_move = 0.0
# get new centroid of this iteration, calculate the moving distance with last
iteration
for i in range(v):
tuple = iter.next()
centroids[i] = float(str(tuple.get(1)))
distance_move = distance_move + fabs(last_centroids[i]-centroids[i])
distance_move = distance_move / v;
if distance_move<tolerance:
converged = True
break
……
Embedding Python scripts with Pig Statements
• Pig does not support flow control statement: if/else,
while loop, for loop, etc.
• Pig embedding API can leverage all language features
provided by Python including control flow:
– Loop and exit criteria
– Similar to the database embedding API
– Easier parameter passing
• JavaScript is available as well
• The framework is extensible. Any JVM implementation
of a language could be integrated
User Defined Function
• What is UDF
– Way to do an operation on a field or fields
– Called from within a pig script
– Currently all done in Java
• Why use UDF
– You need to do more than grouping or filtering
– Actually filtering is a UDF
– Maybe more comfortable in Java land than in
SQL/Pig Latin
P = Pig.compile("""register udf.jar
DEFINE find_centroid FindCentroid('$centroids');
Hands-on Run Pig Latin Kmeans
1.
2.
3.
4.
export PIG_CLASSPATH= /opt/pig/lib/jython-2.5.0.jar
Hadoop dfs –copyFromLocal input.txt ./input.txt
pig –x mapreduce kmeans.py
pig—x local kmeans.py
Hands-on Run Pig Latin Kmeans
2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to run:
register udf.jar
DEFINE find_centroid FindCentroid('0.0:1.0:2.0:3.0');
raw = load 'student.txt' as (name:chararray, age:int, gpa:double);
centroided = foreach raw generate gpa, find_centroid(gpa) as centroid;
grouped = group centroided by centroid;
result = foreach grouped generate group, AVG(centroided.gpa);
store result into 'output';
Input(s): Successfully read 10000 records (219190 bytes) from:
"hdfs://iw-ubuntu/user/developer/student.txt"
Output(s): Successfully stored 4 records (134 bytes) in:
"hdfs://iw-ubuntu/user/developer/output“
last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]
References:
1.
2.
3.
4.
5.
6.
http://pig.apache.org (Pig official site)
http://en.wikipedia.org/wiki/K-means_clustering
Docs http://pig.apache.org/docs/r0.9.0
Papers: http://wiki.apache.org/pig/PigTalksPapers
http://en.wikipedia.org/wiki/Pig_Latin
Slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012
• Questions?
Acknowledgement
HBase Cluster Architecture
• Tables split into regions and served by region servers
• Regions vertically divided by column families into “stores”
• Stores saved as files on HDFS
Download