Slides for COP5992 - Florida International University

advertisement
COP 6727:
Advanced Database Systems
Spring 2013
Dr. Tao Li
Florida International University
Student Self-Introduction
• Name
– I will try to remember your names. But if you
have a Long name, please let me know how
should I call you 
• Anything you want us to know
COP6727
2
Course Overview
• Meeting time
– Tuesday and Thursday 12:30pm – 13:45pm
• Office hours:
– Thursday 2:30pm – 4:30pm or by
appointment
• Course Webpage:
– http://www.cs.fiu.edu/~taoli/class/CAP6727S13/index.html
COP6727
3
Course Objectives
• This is an advanced database course
– Already taken COP5725
• Assume knowledge of the fundamental concepts of relational
databases.
• Cover the core principles and techniques of data and
information management
• Discuss advanced techniques that can be applied to traditional
database systems in order to provide efficient support of new
emerging applications.
COP6727
4
Tentative Topics
•
•
•
•
•
•
•
•
•
Query processing and optimization
Transaction management
Database tuning
Data stream systems
Spatial databases
XML
Information retrieval and Web data management
Scalable data processing
Readings in recent developments in database systems and applications
–
–
–
–
–
–
–
–
SQL vs. non-SQL database
Nearest neighbor queries
High-dimensional indexing
Database retrieval and ranking
Stream processing
Big Data
Incremental and online query processing
Mobile database
COP6727
5
Assignments and Grading
•
•
•
•
•
•
Reading/Written Assignments
Programing Projects
Midterm Exam
Final Project/Presentations
Class attendance is mandatory.
Evaluation will be a subjective process
– Effort is very important component
• Regular In-class Students
–
–
–
–
Quizzes and Class Participation: 5%
Midterm Exam: 30%
Final Project: 30%
Assignments and Projects: 35%
• Online Students
– Midterm Exam: 30%
– Final Project: 30%
– Homework Assignments: 40%
COP6727
6
Text and References
Raghu Ramakrishnan and Johannes Gehrke. Database
Management Systems. Third Edition, McGraw Hill,
2003. ISBN: 0-07-246563-8. Links to Textbook
Homepage .
In addition, the course materials will also be drawn
from recent research literature.
COP6727
7
Lecture 1 & 2
• Lecture 1 & 2: Introduction To
MapReduce
(Most of slides are adapted from Bill Graham, Spiros Papadimitriou,
Cloudera Tutorials)
COP6727
8
Outline
•
•
•
•
Motivation for MapReduce
What is MapReduce?
What is Hadoop?
What is Hive?
COP6727
9
Motivation for MapReduce
• The Big Data
• How to handle big data?
COP6727
10
The Big Data
• Big data is everywhere
• Documents
– Blogs (77 million Tumblr and 56.6 million WordPress as of 2012
), Micro blogs, News, Reviews
• Images
– Instagram, Flickr (more than 6 billion images)
• Videos
– Youtube, All broadcast
• Others
– Map (Google Map)
– Human Genome
– aeronautics and space data
COP6727
11
Another view on “big”
• 2008: Google processes 20 PB a day
• 2009: Facebook has 2.5 PB user data +
15 TB/ day
• 2009: eBay has 6.5 PB user data + 50
TB/day
• 2011: Yahoo! has 180-200 PB of data
• 2012: Facebook ingests 500 TB/day
COP6727
12
Why do we care about
those data?
•
•
•
•
•
•
•
•
Modeling and predicting information flow
Recommend/predict links in social networks
Relevance classification / information filtering
Sentiment analysis and opinion mining
Topic modeling and evolution
Measuring influence in social networks
Concept mapping
Search
• …
COP6727
13
Big data analysis
• Scalability (with reasonable cost)
– Algorithms improvement
– Intuitive way: divide and conquer
COP6727
14
Divide and Conquer
COP6727
15
Challenges
• Parallel processing is complicated
– How do we assign tasks to workers?
– What if we have more tasks than slots?
– What happens when tasks fail?
– How do you handle distributed
synchronization?
COP6727
16
Challenges – Con’t
• Data storage is not trivial
– Traditional database is not reliable
• Data volumes are massive
• Reliably storing PBs of data is challenging
– Disk/hardware/network failures
– Probability of failure event increases with number of
machines
• For example:
– 1000 hosts, each with 10 disks, a disk lasts 3 year
– how many failures per day?
COP6727
17
What is MapReduce?
• A programming model for expressing
distributed computations at a massive
scale
• An execution framework for organizing
and performing such computations
• An open-source implementation called
Hadoop
COP6727
18
Workflow of Large Data
Problem
Map
• Iterate over a
large number of
records
• Extract
something of
interest from
each
Reduce
Shuffle
and sort
intermedi
ate
results
COP6727
• Aggregate
intermediate
results
• Generate final
output
19
MapReduce paradigm
• Implement two functions:
Map(k1, v1) -> list(k2, v2) Reduce(k2,
list(v2)) -> list(v3)
• Framework handles everything else*
• Value with same key go to same reducer
COP6727
20
MapReduce Flow
COP6727
21
An Example
COP6727
22
MapReduce paradigm
– Con’t
• There’s more!
• Partioners decide what key goes to what
reducer
– partition(k’, numPartitions) -> partNumber
– Divides key space into parallel reducers chunks
– Default is hash-based
• Combiners can combine Mapper output
before sending to reducer
– Reduce(k2, list(v2)) -> list(v3)
COP6727
23
MapReduce Flow
COP6727
24
MapReduce additional
details
•
•
•
•
•
•
Reduce starts after all mappers complete
Mapper output gets written to disk
Intermediate data can be copied sooner
Reducer gets keys in sorted order
Keys not sorted across reducers
Global sort requires 1 reducer or smart
partitioning
COP6727
25
MapReduce is good at
•
•
•
•
Embarrassingly parallel algorithms
Summing, grouping, filtering, joining
Off-line batch jobs on massive data sets
Analyzing an entire large dataset
COP6727
26
MapReduce can do
• Iterative jobs (e.g., PageRank, K-means
Clustering)
– Each iteration must read/write data to disk
– IO and latency cost of an iteration is high
COP6727
27
MapReduce is not good at
• Jobs that need shared state/coordination
– Tasks are shared-nothing
– Shared-state requires scalable state store
• Low-latency jobs
• Jobs on small datasets
• Finding individual records
COP6727
28
Summary of MapReduce
• Simple programming model
• Scalable, fault-tolerant
• Ideal for (pre-)processing large volumes of
data
COP6727
29
What is Hadoop?
• Hadoop is an open-source implementation
based on GFS and MapReduce from
Google
• Sanjay Ghemawat, Howard Gobioff, and
Shun- Tak Leung. (2003) The Google File
System
• Jeffrey Dean and Sanjay Ghemawat.
(2004) MapReduce: Simplified Data
Processing on Large Clusters. OSDI 2004
COP6727
30
Hadoop provides
• Redundant, fault-tolerant data storage
• Parallel computation framework
• Job coordination
COP6727
31
Hadoop Stack
COP6727
32
Who uses Hadoop?
•
•
•
•
•
•
•
Yahoo!
Facebook
Last.fm
Rackspace
Digg
Apache Nutch
...
COP6727
33
HDFS
• The Hadoop Distributed File System
• Redundant storage
• Designed to reliably store data using
commodity hardware
• Designed to expect hardware failures
• Intended for large files
• Designed for batch inserts
COP6727
34
Some Concepts about
HDFS
•
•
•
•
Files are stored as a collection of blocks
Blocks are 64 MB chunks of a file (configurable)
Blocks are replicated on 3 nodes (configurable)
The NameNode (NN) manages metadata about
files and blocks
• The SecondaryNameNode (SNN) holds a
backup of the NN data
• DataNodes (DN) store and serve blocks
COP6727
35
Write
COP6727
36
Read
COP6727
37
If a datanode failures
• DNs check in with the NN to report health
• Upon failure NN orders DNs to replicate
under- replicated blocks
COP6727
38
Jobs and Tasks in Hadoop
• Job: a user-submitted map and reduce
implementation to apply to a data set
• Task: a single mapper or reducer task
– Failed tasks get retried automatically
– Tasks run local to their data, ideally
• JobTracker (JT) manages job submission
and task delegation
• TaskTrackers (TT) ask for work and
execute tasks
COP6727
39
Architecture
COP6727
40
How to handle failed
tasks?
•
•
•
•
JT will retry failed tasks up to N attempts
After N failed attempts for a task, job fails
Some tasks are slower than other
Speculative execution is JT starting up
multiple of the same task
• First one to complete wins, other is killed
COP6727
41
Data locality
• Move computation to the data
• Moving data between nodes has a cost
• Hadoop tries to schedule tasks on nodes
with the data
• When not possible TT has to fetch data
from DN
COP6727
42
Hadoop execution
environment
• Local machine (standalone or pseudodistributed)
• Virtual machine
• Cloud (e.g. Amazon EC2)
• Own cluster
COP6727
43
Demo: word count
• Demo
COP6727
44
Homework
• Write a Hadoop program to index the words within the
text document dataset
– Example:
• Input:
– Doc1: Hello World!
– Doc2: Hello Java!
• Expected output:
– Hello \t Doc1 Doc2
– World \t Doc1
– Java \t Doc2
• Due: beginning of the class on 01/10
• If you have any questions, send emails to Jingxuan Li
(jli003@cs.fiu.edu)
COP6727
45
Login Info
• Below is the login information for our Hadoop cluster
– Server: datamining-node03.cs.fiu.edu
– U:dbstudent p:******* (announced during the class)
– Gaining the access to the working directory in HDFS (Do not
modify or remove the other directories!): hadoop fs -ls
/user/dbstudent
• Input dataset for the homework (every one will be
working on this dataset, so do not modify it!):
/user/dbstudent/dataset
• Output directory (including the source code, the indexing
results) format: /user/dbstudent/output-PID
COP6727
46
What is Hive?
• Data warehousing tool on top of Hadoop
• Originally developed at Facebook
– Now a Hadoop sub-project
• Data warehouse infrastructure
– Execution: MapReduce
– Storage: HDFS files
• Large datasets, e.g. Facebook daily logs
– 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009)
• Hive QL: SQL-like query language
COP6727
47
Motivation
• Missing components when using Hadoop
MapReduce jobs to process data
– Command-line interface for “end users”
– Ad-hoc query support
– … without writing full MapReduce jobs
– Schema information
COP6727
48
Hive Applications
• Log processing
• Text mining
• Document indexing
• Customer-facing business intelligence
(e.g., Google Analytics)
• Predictive modeling, hypothesis testing
COP6727
49
Hive Components
• Shell: allows interactive queries like
MySQL shell connected to database
– Also supports web and JDBC clients
• Driver: session handles, fetch, execute
• Compiler: parse, plan, optimize
• Execution engine: DAG of stages (M/R, HDFS,
or metadata)
• Metastore: schema, location in HDFS
COP6727
50
Data Model
• Tables
– Typed columns (int, float, string, date,
boolean)
– Also, list: map (for JSON-like data)
• Partitions
– e.g., to range-partition tables by date
• Buckets
– Hash partitions within ranges (useful for
sampling, join optimization)
COP6727
51
Metastore
• Database: namespace containing a set of
Tables
• Holds table definitions (column types,
physical layout)
• Partition data
• Uses JPOX ORM for implementation; can
be stored in Derby, MySQL, many other
relational databases
COP6727
52
Physical Layout
• Warehouse directory in HDFS
– e.g., /home/hive/warehouse
• Tables stored in subdirectories of
warehouse
– Partitions, buckets form subdirectories of
tables
• Actual data stored in flat files
– Control char-delimited text, or SequenceFiles
– With custom SerDe, can use arbitrary format
COP6727
53
Useful command examples
• Start Hive: bin/hive
• Show all the tables: SHOW TABLES
• Create a new table: CREATE TABLE
shakespeare (freq INT, word STRING) ROW
FORMAT ELIMITED FIELDS TERMINATED BY
‘\t’ STORED AS TEXTFILE
• Loading data into the table: LOAD DATA
INPATH “shakespeare_freq” INTO TABLE
shakespeare
COP6727
54
Useful command examples
– Con’t
• Select data: SELECT * FROM
shakespeare WHERE freq > 100 SORT
BY freq ASC LIMIT 10
• Join: INSERT OVERWRITE TABLE
merged SELECT s.word, s.freq, k.freq
FROM shakespeare s JOIN kjv k ON
(s.word = k.word) WHERE s.freq >= 1
AND k.freq >= 1
COP6727
55
Summary of Hive
• Supports rapid iteration of ad-hoc queries
• Can perform complex joins with minimal
code
• Scales to handle much more data than
many similar systems
COP6727
56
References
•
•
•
•
White, T., Hadoop: The definitive guide, 2012
http://hadoop.apache.org/
http://hive.apache.org/
MapReduce tutorial:
http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Exampl
e%3A+WordCount+v1.0
• Bill Graham, http://blogs.ischool.berkeley.edu/i290-abdts12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf
• Spiros Papadimitriou, Jimeng Sun, and Rong Yan,
http://cs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slide
s/07-1.pdf
• Cloudera, http://blog.cloudera.com/wp-content/uploads/2010/01/6IntroToHive.pdf
COP6727
57
Exercises
• To be announced
COP6727
58
Download