PPT

advertisement
Paula Ta-Shma, IBM Haifa Research
Big Data and Map Reduce
Paula Ta-Shma
IBM Haifa Research
Storage Systems
1/5/2013
1
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Outline
 Historical Context behind Map Reduce
 What is Big Data ?
 The Map Reduce Framework
 Connections with Storage Cloud
2
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Historical Context
 Relational Database Management Systems
(RDBMS)
– Researched in 70s, products in 80s and beyond
– Relational (tabular) data model
– Query Language : SQL
- Efficient Query Processing: Indexing, Query
Evaluation Strategies
– Transactions, Consistency
– Concurrency Control
– Security and Authorization
– Can be implemented on top of file systems
- Provide higher level of abstraction and
functionality than file systems
 Example Use Cases
Accounts
Name
Balance ($)
Bob
5000.00
Alice
-389.27
Fred
-800.00
Alice
2980000.00
SELECT Name
FROM Accounts
GROUP BY Name
HAVING SUM(Balance) < 0
– Banking, Stock trading, Personnel Management,
Inventory Management, Manfuacturing Data, etc.
– The list is very long
3
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Historical Context Cont.
 Business Intelligence
– Extract value from large amounts of data
– Banking use case example
- Identify and actively retain and pursue profitable customers
- Analyze the performance of sales personnel, tellers and account managers
- etc.
– Massive query processing to analyze data across multiple dimensions
- Requires read access to large amounts of data
- Typically long running queries, can interfere with transactions
– Work on a snapshot of data
- Deployed as physically separate Data Warehousing systems
- Mission critical
- Data warehousing products in early 90s
4
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
New Requirements in Internet Era




5
Massive amounts of data
Unstructured (e.g. text) and semi-structured data (e.g. XML)
Analysis capabilities beyond what is possible in SQL
LOW COST
$$$
Capital Expenses
Operational Expenses
Hardware
Use commodity hardware,
scale out instead of scale
up.
Make it easy to manage
hardware which will fail
often. Treat failure case
as the norm, automatic
failover.
Software
DBMS software is
complex and expensive,
transactions, concurrency
control etc. not needed for
many tasks
Make it easy to write
‘queries’ on a distributed
infrastructure.
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Map Reduce
 Invented by Google
– Inspired by functional programming languages map and reduce functions
– Seminal paper: Dean, Jeffrey & Ghemawat, Sanjay (OSDI 2004), "MapReduce:
Simplified Data Processing on Large Clusters"
 Used at Google to completely regenerate Google's index of the World Wide
Web.
– It replaced the old ad hoc programs that updated the index and ran the various analyses.
 Uses:
– distributed pattern-based searching, distributed sorting, web link-graph reversal, termvector per host, web access log stats, inverted index construction, document clustering,
machine learning, statistical machine translation
 Hadoop:
– Open source implementation which matches Google’s specifications
6
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Source: IBM InfoSphere BigInsights slides, by Bruce Brown
https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf
7
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Source: IBM InfoSphere BigInsights slides, by Bruce Brown
https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf
8
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Map Reduce In Detail
 Map Reduce material taken from Distributed Systems Course, MapReduce
lecture by Paul Krzyzanowski
– http://www.seas.gwu.edu/~gparmer/courses/f12_3411/distrib-5-mapreduce.pdf
9
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
HDFS Architecture
Source http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html
10
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Integrating Hadoop with Object Storage
HBase, Jaql,…
Application
Hadoop Map Reduce
 Implement Hadoop
FileSystem API
 Leave MapReduce framework
unchanged
–
invokes
–
=> no changes needed for user
applications
=> work with Hadoop based
technologies
- Hive, Pig Latin, HBase,
Jaql, and others
Hadoop FileSystem API
(create,open,close,read,write,seek,get block locations…)
implements
Hadoop
Distributed
File
System
(HDFS)
11
S3FileSystem
CDMI
FileSystem
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
Amazon Elastic Map Reduce
Source: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html
12
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Paula Ta-Shma, IBM Haifa Research
The End
13
“Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University
http://www.eng.tau.ac.il/semcom
Download