Data Management CS910: Foundations of Data Analytics Graham Cormode

advertisement
CS910: Foundations
of Data Analytics
Graham Cormode
G.Cormode@warwick.ac.uk
Data Management
Objectives
 Study different methods for data management
 Compare databases, datawarehouses, and NoSQL systems
–
Understand which are suitable for what kind of data and analysis
 Understand the MapReduce model for massive data analysis
 See how MapReduce can be applied to graph computations
 See Streaming data systems
2
CS910 Foundations of Data Analytics
Flat Files
 Many small (and not so small) data sets stored in simple form
–
Flat files: each row is a record (example), with all attributes
 Flat files can be easy to work with
Many software tools can load them directly
– Manipulate using command line (unix) tools
–
 Flat files have their limitations
Limited support for indexing to allow fast access
– No explicit way to encode links across files
– No support for updates, concurrency control, error management
–
 So large systems use more powerful data management tools
3
CS910 Foundations of Data Analytics
Database Management Systems (DBMS)
 Historically, much data has been stored in DBMSs
DBMS: Database Management System
– DBMS: a system to manage storage and access to data
– Database: the logical organization of data
– Each database formed of structured tables
 Each table corresponds to the kind of data set studied so far
–
 Databases commonly follow the relational model
Each table describes a relation: collection of records with values
– Underlying mathematical theory describes how data is manipulated
 Relational algebra: operations based on first-order logic
– Analyze data using Structured Query Language (SQL)
–
4
CS910 Foundations of Data Analytics
Typical uses for database systems







5
Personnel / payroll records (salary, address, job type)
Bank records (account data, transactions)
Library system (catalogue, borrower, loan records)
Supply chain management (item types, supplies, depots)
Supermarket sales (loyalty cards, purchases)
Hotel booking system (customers, rooms, bookings)
…
CS910 Foundations of Data Analytics
Example eCommerce Database Design
 Underlined attributes are keys
–
A key uniquely identifies an item
 Dotted underline indicates
“foreign keys”
–
Links to a key in another relation
 Order_ID and Product_ID
together form key for Order line
6
CS910 Foundations of Data Analytics
Example eCommerce Database
Cust_ID
Customer_Name
Address
City
ZIP
C1
Fred Bloggs
123 Elm St
Springfield
12345
C2
Jane Doe
29 Oak Ln
Springfield
23456
C4
Chen Xiaoming
14 Beech Rd
Smallville
34567
C5
Wanjiku
29 Acacia Ave
Gotham
45678
Order_Id
Order_Date
Cust_ID
O2
2/2/12
C2
O4
6/6/13
C4
Order_ID
Product_ID
Quantity
O2
P22
1
O2
P19
12
O2
P123
2
O4
P19
3
Product_ID
Product_
Description
Product_
Finish
Standard_Price
On_Hand
P19
Widget
Matt
$9.98
180
P22
Sprocket
Matt
$1000
5
7
P123
Gadget
Gloss
$24.75
CS910 Foundations
of Data Analytics 167
Structured Query Language (SQL)
 Access and analyze data via Structured Query Language (SQL)
 A declarative language: describe the result, not how to do it
 Basic usage: recover information about records in data
Cust_ID Customer_Name
SELECT Cust_ID, Customer_Name
C1
Fred Bloggs
FROM Customer
WHERE `City=Springfield’;
C2
Jane Doe
– SELECT * FROM Order_Line, Product
WHERE Order_Line.Product_ID = Product.Product_ID;
–
Order_ID
Product_ID
Quantity
Product_
Product_
Description Finish
Standard
_Price
On_Hand
O2
P22
1
Sprocket
Matt
$1000
5
O2
P19
12
Widget
Matt
$9.98
180
O2
P123
2
Gadget
Gloss
$24.75
167
P19
3
$9.98
180
O4
8
CS910
Foundations of Data
Analytics
Widget
Matt
Aggregation in SQL
Product_ID
Product_
Description
Product_
Finish
Standard_Price
On_Hand
P19
Widget
Matt
$9.98
180
P22
Sprocket
Matt
$1000
5
P123
Gadget
Gloss
$24.75
167
 Aggregation with functions MIN, MAX, SUM, COUNT, AVG
Order_Id
Order_Date
Cust_ID
SELECT COUNT(*) FROM Order_Line
O2
2/2/12
C2
WHERE Order_ID = ‘O2’
O4
6/6/13
C4
 Result: 3 (number of lines in this order)
Order_ID Product_ID Quantity
O2
P22
1
– SELECT AVG(Standard_Price) FROM Product
O2
P19
12
WHERE Product_Finish = ‘Matt’;
O2
P123
2
P19
3
 Result: $504.99 (average cost of matt items) O4
– SELECT Order_ID, SUM(Standard_Price*Quantity) as “Total”
FROM Order_Line, Product
WHERE Order_Line.Product_ID = Product.Product_ID
GROUPBY Order_ID;
Order_ID Total
 Result: costs of each order,
–
9
O2
$1169.26
O4
$29.94
CS910 Foundations of Data Analytics
Pros and Cons of DBMS systems
 Using DBMS for Data Management has several advantages:
Many systems implement DBMS (SQL Server, Oracle, MySQL)
– Able to process complex queries over large data
 Easily handle millions of records
– Can represent and process most data in relational model
– Resilient to failure (ensure consistency via logging, locking)
–
 But some limitations:
Limited support for analytics (classification, regression etc.)
 Almost impossible to express in SQL
– Does not scale well to truly massive data
– Can be a lot of overhead for analytic tasks
–
10
CS910 Foundations of Data Analytics
Data Warehouses
 Data Warehouses were introduced to handle large data stores
Typically, historical data for business analytic purposes
– Separate the organization’s “live” operational database
 “A data warehouse is a subject-oriented, integrated, time-variant,
non-volatile collection of data” W. Inmon, “father of data warehouse”
– Subject-oriented: focused on one topic (e.g. all sales records)
– Integrated: data brought together from many sources, cleaned
– Time-variant: covers a long history of data (e.g. last decade)
– Nonvolatile: only periodically updated, not “live” data
–
 Data warehouse products from Oracle, IBM, Microsoft, Teradata
11
CS910 Foundations of Data Analytics
OLAP and Data Cubes
 Warehouses often support Online Analytical Processing (OLAP)
A multidimensional view of data instead of tables
– Represents data as a data cube
– Explored by aggregating or refining dimensions in the data
–
12
CS910 Foundations of Data Analytics
Aggregating Multidimensional Data
 E.g. Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City Month
Office
Month
13
Week
Day
A Sample Data Cube
TV
PC
DVD
sum
1Qtr
2Qtr
3Qtr
4Qtr
sum
Total annual sales
of TVs in U.S.A.
U.S.A
Canada
Mexico
sum
* (all)
14
Country
Date
OLAP Operations
 Roll up (drill-up): summarize data
–
by climbing up hierarchy or by dimension reduction
 Drill down (roll down): inverse of roll-up
–
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
 Slice and dice: project and select
–
Zoom in on particular value, or drop some attributes
 Apply aggregation: on a given dimension
–
15
Count, Sum, Min, Max, Average, Variance, Median, Mode
CS910 Foundations of Data Analytics
Data Warehouses: Pros and Cons
 Data warehouses have many strengths for data analytics:
Support fast exploration and aggregation of data
– Designed to handle very large data sets (TBs / billions of records)
– Additional software supports analytics on top
 Clustering, Regression, Classification
–
 However, they have their limitations:
Can be costly to maintain (time-consuming to clean and load data)
– May be difficult to apply new analytic tools on top
– Do not stretch to truly massive data sets in some environments
–
16
CS910 Foundations of Data Analytics
NoSQL systems
 For truly massive data, the overhead of DBMS/Warehouse
systems can be too high to allow useful analysis
 NoSQL systems drop support for the full relational model
Do not provide same level of reliability/availability
– Do not necessarily support rich languages like SQL
– Aim to have simpler design, better scaling via distribution
 Often support analysis via query language or MapReduce on top
– Systems primarily support data storage and retrieval
–
17
CS910 Foundations of Data Analytics
Types of NoSQL systems
 Key-value store: stores and retrieves data in the form (key, value)
E.g. store demographic data (values) for each user (by key)
– Data is distributed, and replicated for resilience, e.g. Memcached
–
 Column store: stores data organized by column (instead of row)
Allows faster access to particular entries when data is sparse
– Implemented in Hbase (database component of Hadoop system)
–
 Document store: to store and retrieve document data
E.g. to store information for very large websites (Amazon, eBay)
– Each “document” can be an arbitrary collection of information
– Examples include MongoDB and Apache Cassandra
–
18
CS910 Foundations of Data Analytics
NoSQL systems: pros and cons
 NoSQL systems are highly popular at the moment
Scale to truly massive amounts of data
– Allow analytics on top via MapReduce/Hadoop
– Can be very fast to retrieve data
–
 But they also have limitations
Systems still under development, hard to make use of
– Some quite primitive: just provide data storage/retrieval
– Currently have to write and debug code to implement analytics
– Can be overkill when your data is not massive
–
19
CS910 Foundations of Data Analytics
Exercises
 Consider a data set you are familiar with
–
E.g. the data you plan to use for your project
 How would it be represented in a DBMS? In a data warehouse?
In a NoSQL system?
Could you express the queries you want to answer in SQL?
– Could you express the analysis you want to do using OLAP tools?
–
 Suppose the data set was 100 times larger: what system would
best cope with this scale? What about a million times larger?
20
CS910 Foundations of Data Analytics
Massive Data Analysis
 Many modern examples of very large data sets
Data from scientific experiments (Large Hadron Collider)
– Activity data on social networks, email, phone calls
– Sensors throughout urban environments, Internet of Things
–
 Must perform analytics on this massive data:
Scientific research (monitor environment, species)
– System management (spot faults, drops, failures)
– Customer research (association rules, new offers)
– For revenue protection (phone fraud, service abuse)
–
 Else, why even measure this data?
21
CS910 Foundations of Data Analytics
MapReduce and Big Data
 MapReduce is a popular paradigm for analyzing massive data
When the data is much too big for one machine
– Allows the parallelization of computations over many machines
–
 Introduced by Jeffrey Dean and Sanjay Ghemawat early 2000s
For fun, check out the “Jeff Dean facts”
https://www.quora.com/What-are-all-the-Jeff-Dean-facts
– MapReduce model implemented by MapReduce system at Google
– Open source Apache Hadoop implements same ideas

 Allows a large computation to be distributed over many machines
Brings the computation to the data, not vice-versa
– System manages data movement, machine failures, errors
– User just has to specify what to do with each piece of data
–
22
CS910 Foundations of Data Analytics
Motivating MapReduce
 Many computations over big data follow a common outline:
–
–
–
–
–
The data formed of many (many) simple records
Iterate over each record and extract a value
Group together intermediate results with same properties
Aggregate these groups to get final results
Possibly, repeat this process with different functions
 MapReduce framework abstracts this outline
Iterate over records = Map
– Aggregate the groups = Reduce
–
23
CS910 Foundations of Data Analytics
What is MapReduce?
 MapReduce draws inspiration from functional programming
Map: apply the “map” function to every piece of data
– Reduce: form the mapped data into groups and apply a function
–
 Designed for efficiency
Process the data in whatever order it is stored, avoid random access
 Random access can be very slow over large data
– Split the computation over many machines
 Can Map the data in parallel, and Reduce each group in parallel
–
24
CS910 Foundations of Data Analytics
Programming in MapReduce
 Data is assumed to be in the form of (key, value) pairs
E.g. (key = “CS910”, value = “Foundations of Data Analytics”)
– E.g. (key = “111-222-3333”, value = “(male, 29 years, married…)”
–
 Abstract view of programming MapReduce. Specify:
Map function: take a (k, v) pair, output some number of (k’,v’) pairs
– Reduce function: take all (k’, v’) pairs with same k’ key, and output a
new set of (k’’, v’’) pairs
– The “type” of output (key, value) pairs can be different to the input
–
 Many other options/parameters in practice:
Can specify a “partition” function for how to map k’ to reducers
– Can specify a “combine” function that aggregates output of Map
–
25
CS910 Foundations of Data Analytics
MapReduce schematic
k1 v1
k2 v2
map
a 1
k3 v3
k4 v4
map
b 2
c 3
k5 v5
k6 v6
map
c 6
a 5
map
c 2
b 7
c 8
Shuffle and Sort: aggregate values by keys
a
1 5
b
2 7
c
2 3 6 8
reduce
reduce
reduce
r1 s1
r2 s2
r3 s3
MapReduce and Graphs
 MapReduce is a powerful way of handling big graph data
Graph: a network of nodes linked by edges
– Many big graphs: the web, (social network) friendship, citations
 Often have millions of nodes, billions of edges
 Facebook: > 1billion nodes, 100 billion edges
–
 Many complex calculations over graphs
Rank importance of nodes (for search)
– Predict which links will be added soon / suggest links
– Label nodes based on classification over graphs
–
 MapReduce allows computation over big graphs
–
27
Represent each edge as a value in a key-value pair
CS910 Foundations of Data Analytics
MapReduce example: compute degree
 The degree of a node is the number of edges incident on it
–
Here, assume undirected edges
A
D
B
C
(E1, (A,B)) Map
(E2, (A, C))
(E3, (A, D))
(E4, (B,C))
 To compute degree in MapReduce:
(A, 1)
(B, 1)
(A, 1)
(C, 1)
(A, 1)
(D, 1)
(B, 1)
(C, 1)
Shuffle (A, (1, 1, 1))
Map: for edge (E, (v, w)) output (v, 1), (w, 1)
– Reduce: for (v, (c1, c2, … cn)) output (v, i=1n ci )
–
(B, (1, 1))
(C, (1, 1))
(D, (1))
Reduce
(A, 3)
(B, 2)
(C, 2)
(D, 1)
 Advanced: could use “combine” to compute partial sums
–
28
E.g. Combine ((A, 1), (A, 1), (B, 1)) = ((A, 2), (B,1))
CS910 Foundations of Data Analytics
MapReduce Analytics
 MapReduce well-suited for computing counts/sums over data
–
Hence, good for building a naïve Bayes classifier. Why?
 Naïve Bayes classifier: defined by conditional probabilities
Conditional probabilities computed from counts of values
– Common case: few attributes (10s – 100s), much data (106 – 1012)
– Also applies to Bayesian networks, if model is given
–
 Other Classifiers/Regression may not be so easy
E.g. Support Vector Machines: need to solve complex optimization
– E.g. linear regression: use MapReduce to compute (XTX) and (XTy)
 Exercise: design MapReduce scheme for matrix product
 (XTX) is d x d in size, so can invert on a single machine
– Doing analytics in MapReduce is an active research topic
–
29
CS910 Foundations of Data Analytics
PageRank in MapReduce
 PageRank is well-suited to computation in MapReduce
Computing product of M with vector r
– M is sparse: only represent/transmit the non-zero entries
– Each iteration of MapReduce computes ri+1 from ri
–
 Consider as a MapReduce computation
Initial input: (nodeid v, (ri[v]; edge list for v))
 Initialize: r0[v] = 1/n, compute degree of each node d[v]
– Map: For each edge (v, w) in edge list of v, emit (w, ri[v]/d[v])
– Reduce: for each w, sum up the ri[v]/d[v] values to get R
 Emit (w,  R + (1-))
– Iterate for some fixed number of rounds of MapReduce
 Could test for convergence by measuring |ri+1 – ri|
–
30
CS910 Foundations of Data Analytics
Clustering in MapReduce
 The main steps in k-means clustering are quite parallel
Assign each point to its closest cluster: Map
 Assume k is not too big, so k centroids known to each mapper
– Compute new cluster centroid from points: Reduce
 Need to keep running sum and count of points
 Suitable for a combine function
–
 Each iteration requires a full “round” of MapReduce
Can be slow if many iterations are needed
– Each “round” can take ~10-20 minutes on massive data
– May want to do some quick, coarse clustering of data to start
–
31
CS910 Foundations of Data Analytics
Exercises
 Suppose you want to implement the k-center furthest point
clustering in MapReduce. How would you do so?
What would the Map function compute?
– What would the Reduce function compute?
– Would a Combine function reduce the cost?
– How many rounds of MapReduce would you use?
–
 Suppose you want to implement the k-nearest neighbour
classifier in MapReduce. How would you do so?
What Map and Reduce functions would you use?
– What are the advantages and disadvantages of this approach?
–
32
CS910 Foundations of Data Analytics
Hadoop
 Hadoop is an open-source implementation of MapReduce
First developed by Doug Cutting (named after a toy elephant)
– Currently managed by the Apache Software Foundation
– Google’s original implementations are not publicly available
–
 Many tools/products are implemented in Hadoop
Hbase: (non-relational) distributed database
– Hive: data warehouse infrastructure developed at Facebook
– Pig: high-level language that compiles to hadoop, from Yahoo
– Mahout: machine learning algorithms in hadoop
–
 Hadoop widely used in technology-based businesses:
Facebook, LinkedIn, Twitter, IBM, Amazon, Adobe, Ebay
– Offered as part of: Amazon EC2, Cloudera, Microsoft Azure
–
33
CS346 Advanced Databases
Spark
 Hadoop has been criticized: batch + disk approach can be slow
–
Sometimes faster on well-engineered single-threaded coded
http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
 Next generation of big data computation is emerging: Spark
Developed by the AMPlab from UC Berkeley, now in Apache
– Sits on top of distributed data stores (HDFS, Cassandra)
– Provides support for SQL, Graph data analysis, machine learning
– Program in more languages (Java, Python, Scala, R)
–
 Can be many times faster than Hadoop
–
34
Keeps more data in memory rather than on disk
CS910 Foundations of Data Analytics
Streaming Data Analysis
 Data management tools so far have dealt with stored data
Some data is so large it is not feasible to store it all
– Sometimes we want to process data “live” as it is observed
–
 “Streams of data”: high volume sources of data
E.g. results from massive scientific experiments (LHC)
– E.g. telecommunications data
–
 Streaming Data Analysis aims to deal with such data
Streaming Data Algorithms / Streaming Data Analytics
– Streaming Data Management Systems
–
35
CS910 Foundations of Data Analytics
Example: Network Data
 Networks are sources of massive data:
the metadata per hour per router is gigabytes
–
Fundamental problem of data stream analysis:
Too much information to store or transmit
 Process data as it arrives in one pass: the data stream model
–
Sometimes give only approximate answers instead of exact
 Many smart algorithms for data analysis
Approximately count how many different items seen
– Find which items seen most often
– Summarize massive feature vectors
–
36
CS910 Foundations of Data Analytics
Streaming Data Systems
 Allow queries over streaming data in suitable language
–
Streaming generalizations of SQL
 Compose operators on streaming data
Apply a selection predicate, find matching tuples in the stream
– Apply functions e.g. summarization of stream data
–
 Allow combination of streaming data with stored data
–
Often effective to only consider a recent window of stream data
 Some streaming data systems:
Microsoft StreamInsight
– IBM Infosphere Streams
– Apache Storm
–
37
CS910 Foundations of Data Analytics
Storm
 The storm system is built out of spouts and bolts
Spout: handles an incoming stream of data
– Bolt: transforms the stream it receives, produces a new stream
– Streams are sequences of simple tuples
–
 Examples of possible bolts:
Select tuples matching a pattern
– Count up number of tuples received
– Join matching tuples from two input streams
– Track number of unique tuples seen in a sliding window
–
 Application: track real-time purchases on an e-commerce site
–
38
Can e.g. map each sale to a category via a database look-up
CS910 Foundations of Data Analytics
Summary of Data Management
 Many ways to manage data in big systems
–
Flat files, Databases, Data warehouses, NoSQL systems
 Different models have different data-access paradigms
–
SQL (DBMS), OLAP (warehouse), Store/retrieve (NoSQL)
 MapReduce is a popular model for very large data analysis
–
Has been applied for handling large graph data
 Many big computations are possible in MapReduce
–
PageRank, Classification, Clustering, Link-prediction
 Online data may be best processed by streaming analysis
39
CS910 Foundations of Data Analytics
Further reading
 Databases and data warehouses: Chapters 4 (Data
Warehouses) and 5 (Data Cube) of "Data Mining: Concepts
and Techniques" (Han, Kamber, Pei).
 NoSQL systems: NoSQL: An Overview of NoSQL Databases
 MapReduce and page rank: Data-Intensive Information
Processing Applications (lectures 1, 3 and 5)
 Introduction to Stream Data Management, Nauman Chaudry,
Chapter 1 of “Stream Data Management”, Springer
40
CS910 Foundations of Data Analytics
Download