Hadoop and its Real-world Applications

advertisement
Hadoop and its Real-world
Applications
Xiaoxiao Shi, Guan Wang
Experience: work at Yahoo! in 2010 summer,
on developing hadoop-based machine learning models.
Contents
•
•
•
•
•
•
Motivation of Hadoop
History of Hadoop
The current applications of Hadoop
Programming examples
Research with Hadoop
Conclusions
Motivation of Hadoop
• How do you scale up applications?
– Run jobs processing 100’s of terabytes of data
– Takes 11 days to read on 1 computer
• Need lots of cheap computers
– Fixes speed problem (15 minutes on 1000 computers),
but…
– Reliability problems
• In large clusters, computers fail every day
• Cluster size is not fixed
• Need common infrastructure
– Must be efficient and reliable
Motivation of Hadoop
• Open Source Apache Project
• Hadoop Core includes:
– Distributed File System - distributes data
– Map/Reduce - distributes application
• Written in Java
• Runs on
– Linux, Mac OS/X, Windows, and Solaris
– Commodity hardware
Fun Fact of Hadoop
"The name my kid gave a stuffed yellow
elephant. Short, relatively easy to spell
and pronounce, meaningless, and not used
elsewhere: those are my naming
criteria. Kids are good at generating such.
Googol is a kid’s term."
---- Doug Cutting, Hadoop project
creator
History of Hadoop
“It is an important technique!”
“Map-reduce”
2004
Doug Cutting
Extended
Apache Nutch
The great journey begins…
History of Hadoop
• Yahoo! became the primary contributor in
2006
History of Hadoop
• Yahoo! deployed large scale science clusters in
2007.
• Tons of Yahoo! Research papers emerge:
– WWW
– CIKM
– SIGIR
– VLDB
– ……
• Yahoo! began running major production jobs
in Q1 2008.
• Nowadays…
Nowadays…
• When you visit
yahoo, you are
interacting
with data
processed with
Hadoop!
Nowadays…
Content
Optimization
Search Index
Ads
Optimization
Content Feed
Processing
• When you visit
yahoo, you are
interacting
with data
processed with
Hadoop!
Nowadays…
Content
Optimization
Search Index
Machine
Learning
(e.g. Spam filters)
Ads
Optimization
Content Feed
Processing
• When you visit
yahoo, you are
interacting
with data
processed with
Hadoop!
Nowadays…
•
•
•
•
Yahoo! has ~20,000 machines running Hadoop
The largest clusters are currently 2000 nodes
Several petabytes of user data (compressed, unreplicated)
Yahoo! runs hundreds of thousands of jobs every month
Nowadays…
• Who use Hadoop?
• Amazon/A9
• AOL
• Facebook
• Fox interactive media
• Google
• IBM
• New York Times
• PowerSet (now Microsoft)
• Quantcast
• Rackspace/Mailtrust
• Veoh
• Yahoo!
• More at http://wiki.apache.org/hadoop/PoweredBy
Nowadays (job market on Nov 15th)…
•
•
Software Developer Intern - IBM - Somers, NY +3 locations- Agile development - Big data / Hadoop /
data analytics a plus
Software Developer - IBM - San Jose, CA +4 locations - include Hadoop-powered distributed parallel data
processing system, big data analytics ... multiple technologies, including Hadoop
It is important
• Details…
Nowadays…
• Hadoop Core
– Distributed File System
– MapReduce Framework
• Pig (initiated by Yahoo!)
– Parallel Programming Language and Runtime
• Hbase (initiated by Powerset)
– Table storage for semi-structured data
• Zookeeper (initiated by Yahoo!)
– Coordinating distributed systems
• Hive (initiated by Facebook)
– SQL-like query language and metastore
HDFS
Hadoop's Distributed File System is designed to reliably store
very large files across machines in a large cluster. It is
inspired by the Google File System. Hadoop DFS stores each
file as a sequence of blocks, all blocks in a file except the last
block are the same size. Blocks belonging to a file are
replicated for fault tolerance. The block size and replication
factor are configurable per file. Files in HDFS are "write once"
and have strictly one writer at any time.
Hadoop Distributed File System – Goals:
• Store large data sets
• Cope with hardware failure
• Emphasize streaming data access
Typical Hadoop Structure
• Commodity hardware
– Linux PCs with local 4 disks
• Typically in 2 level architecture
– 40 nodes/rack
– Uplink from rack is 8 gigabit
– Rack-internal is 1 gigabit all-to-all
Hadoop structure
• Single namespace for entire cluster
– Managed by a single namenode.
– Files are single-writer and append-only.
– Optimized for streaming reads of large files.
• Files are broken in to large blocks.
– Typically 128 MB
– Replicated to several datanodes, for reliability
• Client talks to both namenode and datanodes
– Data is not sent through the namenode.
– Throughput of file system scales nearly linearly with
the number of nodes.
• Access from Java, C, or command line.
Hadoop Structure
• Java and C++ APIs
– In Java use Objects, while in C++ bytes
• Each task can process data sets larger than RAM
• Automatic re-execution on failure
– In a large cluster, some nodes are always slow or
flaky
– Framework re-executes failed tasks
• Locality optimizations
– Map-Reduce queries HDFS for locations of input data
– Map tasks are scheduled close to the inputs when
possible
Example of Hadoop Programming
• Word Count:
• “I ike parallel computing. I also took courses
on parallel computing… …”
– Parallel: 2
– Computing: 2
– I: 2
– Like: 1
– ……
Example of Hadoop Programming
• Intuition: design <key, value>
• Assume each node will process a paragraph…
• Map:
– What is the key?
– What is the value?
• Reduce:
– What to collect?
– What to reduce?
Word Count Example
public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
out.collect(new text(itr.nextToken()), ONE);
}
}
}
Word Count Example
public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
Word Count Example
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setOutputKeyClass(Text.class); // out keys are words (strings)
conf.setOutputValueClass(IntWritable.class); // values are counts
JobClient.runJob(conf);
}c
Hadoop in Yahoo!
•
•
•
Database for Search Assist™ is built using Hadoop.
3 years of log-data
20-steps of map-reduce
Before Hadoop
After Hadoop
Time
26 days
20 minutes
Language
C++
Python
Development Time
2-3 weeks
2-3 days
29
Related research of hadoop
All just this year! 2011!
• Conference Tutorial:
–
–
–
KDD Tutorial: “Modeling with Hadoop”, KDD 2011 (top conference in data mining)
Strta Tutorial: “How to Develop Big Data Applications for Hadoop”
OSCON Tutorial: “Introduction to Hadoop”,
• Papers:
–
–
–
–
–
–
–
–
–
–
Scalable distributed inference of dynamic user interests for behavioral targeting. KDD 2011: 114-122
Yucheng Low, Deepak Agarwal, Alexander J. Smola: Multiple domain user personalization. KDD 2011: 123-131
Shuang-Hong Yang, Bo Long, Alexander J. Smola, Hongyuan Zha, Zhaohui Zheng: Collaborative competitive filtering: learning
recommender using context of user choice. SIGIR 2011: 295-304
Srinivas Vadrevu, Choon Hui Teo, Suju Rajan, Kunal Punera, Byron Dom, Alexander J. Smola, Yi Chang, Zhaohui Zheng: Scalable
clustering of news search results. WSDM 2011: 675-684
Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, Hongyuan Zha: Like like alike: joint
friendship and interest propagation in social networks. WWW 2011: 537-546
Amr Ahmed, Alexander J. Smola: WWW 2011 invited tutorial overview: latent variable models on the internet. WWW
(Companion Volume) 2011: 281-282
Daniel Hsu, Nikos Karampatziakis, John Langford, Alexander J. Smola: Parallel Online Learning CoRR abs/1103.4204: (2011)
Neethu Mohandas, Sabu M. Thampi: Improving Hadoop Performance in Handling Small Files. ACC 2011:187-194
Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong: Performance Analysis of Hadoop for Query Processing. AINA Workshops
2011:507-513
……
• For more information:
– http://hadoop.apache.org/
– http://developer.yahoo.com/hadoop/
• Who uses Hadoop?:
– http://wiki.apache.org/hadoop/PoweredBy
Download