CS910: Foundations of Data Analytics Graham Cormode G.Cormode@warwick.ac.uk Data Management Objectives Study different methods for data management Compare databases, datawarehouses, and NoSQL systems – Understand which are suitable for what kind of data and analysis Understand the MapReduce model for massive data analysis See how MapReduce can be applied to graph computations See Streaming data systems 2 CS910 Foundations of Data Analytics Flat Files Many small (and not so small) data sets stored in simple form – Flat files: each row is a record (example), with all attributes Flat files can be easy to work with Many software tools can load them directly – Manipulate using command line (unix) tools – Flat files have their limitations Limited support for indexing to allow fast access – No explicit way to encode links across files – No support for updates, concurrency control, error management – So large systems use more powerful data management tools 3 CS910 Foundations of Data Analytics Database Management Systems (DBMS) Historically, much data has been stored in DBMSs DBMS: Database Management System – DBMS: a system to manage storage and access to data – Database: the logical organization of data – Each database formed of structured tables Each table corresponds to the kind of data set studied so far – Databases commonly follow the relational model Each table describes a relation: collection of records with values – Underlying mathematical theory describes how data is manipulated Relational algebra: operations based on first-order logic – Analyze data using Structured Query Language (SQL) – 4 CS910 Foundations of Data Analytics Typical uses for database systems 5 Personnel / payroll records (salary, address, job type) Bank records (account data, transactions) Library system (catalogue, borrower, loan records) Supply chain management (item types, supplies, depots) Supermarket sales (loyalty cards, purchases) Hotel booking system (customers, rooms, bookings) … CS910 Foundations of Data Analytics Example eCommerce Database Design Underlined attributes are keys – A key uniquely identifies an item Dotted underline indicates “foreign keys” – Links to a key in another relation Order_ID and Product_ID together form key for Order line 6 CS910 Foundations of Data Analytics Example eCommerce Database Cust_ID Customer_Name Address City ZIP C1 Fred Bloggs 123 Elm St Springfield 12345 C2 Jane Doe 29 Oak Ln Springfield 23456 C4 Chen Xiaoming 14 Beech Rd Smallville 34567 C5 Wanjiku 29 Acacia Ave Gotham 45678 Order_Id Order_Date Cust_ID O2 2/2/12 C2 O4 6/6/13 C4 Order_ID Product_ID Quantity O2 P22 1 O2 P19 12 O2 P123 2 O4 P19 3 Product_ID Product_ Description Product_ Finish Standard_Price On_Hand P19 Widget Matt $9.98 180 P22 Sprocket Matt $1000 5 7 P123 Gadget Gloss $24.75 CS910 Foundations of Data Analytics 167 Structured Query Language (SQL) Access and analyze data via Structured Query Language (SQL) A declarative language: describe the result, not how to do it Basic usage: recover information about records in data Cust_ID Customer_Name SELECT Cust_ID, Customer_Name C1 Fred Bloggs FROM Customer WHERE `City=Springfield’; C2 Jane Doe – SELECT * FROM Order_Line, Product WHERE Order_Line.Product_ID = Product.Product_ID; – Order_ID Product_ID Quantity Product_ Product_ Description Finish Standard _Price On_Hand O2 P22 1 Sprocket Matt $1000 5 O2 P19 12 Widget Matt $9.98 180 O2 P123 2 Gadget Gloss $24.75 167 P19 3 $9.98 180 O4 8 CS910 Foundations of Data Analytics Widget Matt Aggregation in SQL Product_ID Product_ Description Product_ Finish Standard_Price On_Hand P19 Widget Matt $9.98 180 P22 Sprocket Matt $1000 5 P123 Gadget Gloss $24.75 167 Aggregation with functions MIN, MAX, SUM, COUNT, AVG Order_Id Order_Date Cust_ID SELECT COUNT(*) FROM Order_Line O2 2/2/12 C2 WHERE Order_ID = ‘O2’ O4 6/6/13 C4 Result: 3 (number of lines in this order) Order_ID Product_ID Quantity O2 P22 1 – SELECT AVG(Standard_Price) FROM Product O2 P19 12 WHERE Product_Finish = ‘Matt’; O2 P123 2 P19 3 Result: $504.99 (average cost of matt items) O4 – SELECT Order_ID, SUM(Standard_Price*Quantity) as “Total” FROM Order_Line, Product WHERE Order_Line.Product_ID = Product.Product_ID GROUPBY Order_ID; Order_ID Total Result: costs of each order, – 9 O2 $1169.26 O4 $29.94 CS910 Foundations of Data Analytics Pros and Cons of DBMS systems Using DBMS for Data Management has several advantages: Many systems implement DBMS (SQL Server, Oracle, MySQL) – Able to process complex queries over large data Easily handle millions of records – Can represent and process most data in relational model – Resilient to failure (ensure consistency via logging, locking) – But some limitations: Limited support for analytics (classification, regression etc.) Almost impossible to express in SQL – Does not scale well to truly massive data – Can be a lot of overhead for analytic tasks – 10 CS910 Foundations of Data Analytics Data Warehouses Data Warehouses were introduced to handle large data stores Typically, historical data for business analytic purposes – Separate the organization’s “live” operational database “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data” W. Inmon, “father of data warehouse” – Subject-oriented: focused on one topic (e.g. all sales records) – Integrated: data brought together from many sources, cleaned – Time-variant: covers a long history of data (e.g. last decade) – Nonvolatile: only periodically updated, not “live” data – Data warehouse products from Oracle, IBM, Microsoft, Teradata 11 CS910 Foundations of Data Analytics OLAP and Data Cubes Warehouses often support Online Analytical Processing (OLAP) A multidimensional view of data instead of tables – Represents data as a data cube – Explored by aggregating or refining dimensions in the data – 12 CS910 Foundations of Data Analytics Aggregating Multidimensional Data E.g. Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Month Office Month 13 Week Day A Sample Data Cube TV PC DVD sum 1Qtr 2Qtr 3Qtr 4Qtr sum Total annual sales of TVs in U.S.A. U.S.A Canada Mexico sum * (all) 14 Country Date OLAP Operations Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction Drill down (roll down): inverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select – Zoom in on particular value, or drop some attributes Apply aggregation: on a given dimension – 15 Count, Sum, Min, Max, Average, Variance, Median, Mode CS910 Foundations of Data Analytics Data Warehouses: Pros and Cons Data warehouses have many strengths for data analytics: Support fast exploration and aggregation of data – Designed to handle very large data sets (TBs / billions of records) – Additional software supports analytics on top Clustering, Regression, Classification – However, they have their limitations: Can be costly to maintain (time-consuming to clean and load data) – May be difficult to apply new analytic tools on top – Do not stretch to truly massive data sets in some environments – 16 CS910 Foundations of Data Analytics NoSQL systems For truly massive data, the overhead of DBMS/Warehouse systems can be too high to allow useful analysis NoSQL systems drop support for the full relational model Do not provide same level of reliability/availability – Do not necessarily support rich languages like SQL – Aim to have simpler design, better scaling via distribution Often support analysis via query language or MapReduce on top – Systems primarily support data storage and retrieval – 17 CS910 Foundations of Data Analytics Types of NoSQL systems Key-value store: stores and retrieves data in the form (key, value) E.g. store demographic data (values) for each user (by key) – Data is distributed, and replicated for resilience, e.g. Memcached – Column store: stores data organized by column (instead of row) Allows faster access to particular entries when data is sparse – Implemented in Hbase (database component of Hadoop system) – Document store: to store and retrieve document data E.g. to store information for very large websites (Amazon, eBay) – Each “document” can be an arbitrary collection of information – Examples include MongoDB and Apache Cassandra – 18 CS910 Foundations of Data Analytics NoSQL systems: pros and cons NoSQL systems are highly popular at the moment Scale to truly massive amounts of data – Allow analytics on top via MapReduce/Hadoop – Can be very fast to retrieve data – But they also have limitations Systems still under development, hard to make use of – Some quite primitive: just provide data storage/retrieval – Currently have to write and debug code to implement analytics – Can be overkill when your data is not massive – 19 CS910 Foundations of Data Analytics Exercises Consider a data set you are familiar with – E.g. the data you plan to use for your project How would it be represented in a DBMS? In a data warehouse? In a NoSQL system? Could you express the queries you want to answer in SQL? – Could you express the analysis you want to do using OLAP tools? – Suppose the data set was 100 times larger: what system would best cope with this scale? What about a million times larger? 20 CS910 Foundations of Data Analytics Massive Data Analysis Many modern examples of very large data sets Data from scientific experiments (Large Hadron Collider) – Activity data on social networks, email, phone calls – Sensors throughout urban environments, Internet of Things – Must perform analytics on this massive data: Scientific research (monitor environment, species) – System management (spot faults, drops, failures) – Customer research (association rules, new offers) – For revenue protection (phone fraud, service abuse) – Else, why even measure this data? 21 CS910 Foundations of Data Analytics MapReduce and Big Data MapReduce is a popular paradigm for analyzing massive data When the data is much too big for one machine – Allows the parallelization of computations over many machines – Introduced by Jeffrey Dean and Sanjay Ghemawat early 2000s For fun, check out the “Jeff Dean facts” https://www.quora.com/What-are-all-the-Jeff-Dean-facts – MapReduce model implemented by MapReduce system at Google – Open source Apache Hadoop implements same ideas Allows a large computation to be distributed over many machines Brings the computation to the data, not vice-versa – System manages data movement, machine failures, errors – User just has to specify what to do with each piece of data – 22 CS910 Foundations of Data Analytics Motivating MapReduce Many computations over big data follow a common outline: – – – – – The data formed of many (many) simple records Iterate over each record and extract a value Group together intermediate results with same properties Aggregate these groups to get final results Possibly, repeat this process with different functions MapReduce framework abstracts this outline Iterate over records = Map – Aggregate the groups = Reduce – 23 CS910 Foundations of Data Analytics What is MapReduce? MapReduce draws inspiration from functional programming Map: apply the “map” function to every piece of data – Reduce: form the mapped data into groups and apply a function – Designed for efficiency Process the data in whatever order it is stored, avoid random access Random access can be very slow over large data – Split the computation over many machines Can Map the data in parallel, and Reduce each group in parallel – 24 CS910 Foundations of Data Analytics Programming in MapReduce Data is assumed to be in the form of (key, value) pairs E.g. (key = “CS910”, value = “Foundations of Data Analytics”) – E.g. (key = “111-222-3333”, value = “(male, 29 years, married…)” – Abstract view of programming MapReduce. Specify: Map function: take a (k, v) pair, output some number of (k’,v’) pairs – Reduce function: take all (k’, v’) pairs with same k’ key, and output a new set of (k’’, v’’) pairs – The “type” of output (key, value) pairs can be different to the input – Many other options/parameters in practice: Can specify a “partition” function for how to map k’ to reducers – Can specify a “combine” function that aggregates output of Map – 25 CS910 Foundations of Data Analytics MapReduce schematic k1 v1 k2 v2 map a 1 k3 v3 k4 v4 map b 2 c 3 k5 v5 k6 v6 map c 6 a 5 map c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 MapReduce and Graphs MapReduce is a powerful way of handling big graph data Graph: a network of nodes linked by edges – Many big graphs: the web, (social network) friendship, citations Often have millions of nodes, billions of edges Facebook: > 1billion nodes, 100 billion edges – Many complex calculations over graphs Rank importance of nodes (for search) – Predict which links will be added soon / suggest links – Label nodes based on classification over graphs – MapReduce allows computation over big graphs – 27 Represent each edge as a value in a key-value pair CS910 Foundations of Data Analytics MapReduce example: compute degree The degree of a node is the number of edges incident on it – Here, assume undirected edges A D B C (E1, (A,B)) Map (E2, (A, C)) (E3, (A, D)) (E4, (B,C)) To compute degree in MapReduce: (A, 1) (B, 1) (A, 1) (C, 1) (A, 1) (D, 1) (B, 1) (C, 1) Shuffle (A, (1, 1, 1)) Map: for edge (E, (v, w)) output (v, 1), (w, 1) – Reduce: for (v, (c1, c2, … cn)) output (v, i=1n ci ) – (B, (1, 1)) (C, (1, 1)) (D, (1)) Reduce (A, 3) (B, 2) (C, 2) (D, 1) Advanced: could use “combine” to compute partial sums – 28 E.g. Combine ((A, 1), (A, 1), (B, 1)) = ((A, 2), (B,1)) CS910 Foundations of Data Analytics MapReduce Analytics MapReduce well-suited for computing counts/sums over data – Hence, good for building a naïve Bayes classifier. Why? Naïve Bayes classifier: defined by conditional probabilities Conditional probabilities computed from counts of values – Common case: few attributes (10s – 100s), much data (106 – 1012) – Also applies to Bayesian networks, if model is given – Other Classifiers/Regression may not be so easy E.g. Support Vector Machines: need to solve complex optimization – E.g. linear regression: use MapReduce to compute (XTX) and (XTy) Exercise: design MapReduce scheme for matrix product (XTX) is d x d in size, so can invert on a single machine – Doing analytics in MapReduce is an active research topic – 29 CS910 Foundations of Data Analytics PageRank in MapReduce PageRank is well-suited to computation in MapReduce Computing product of M with vector r – M is sparse: only represent/transmit the non-zero entries – Each iteration of MapReduce computes ri+1 from ri – Consider as a MapReduce computation Initial input: (nodeid v, (ri[v]; edge list for v)) Initialize: r0[v] = 1/n, compute degree of each node d[v] – Map: For each edge (v, w) in edge list of v, emit (w, ri[v]/d[v]) – Reduce: for each w, sum up the ri[v]/d[v] values to get R Emit (w, R + (1-)) – Iterate for some fixed number of rounds of MapReduce Could test for convergence by measuring |ri+1 – ri| – 30 CS910 Foundations of Data Analytics Clustering in MapReduce The main steps in k-means clustering are quite parallel Assign each point to its closest cluster: Map Assume k is not too big, so k centroids known to each mapper – Compute new cluster centroid from points: Reduce Need to keep running sum and count of points Suitable for a combine function – Each iteration requires a full “round” of MapReduce Can be slow if many iterations are needed – Each “round” can take ~10-20 minutes on massive data – May want to do some quick, coarse clustering of data to start – 31 CS910 Foundations of Data Analytics Exercises Suppose you want to implement the k-center furthest point clustering in MapReduce. How would you do so? What would the Map function compute? – What would the Reduce function compute? – Would a Combine function reduce the cost? – How many rounds of MapReduce would you use? – Suppose you want to implement the k-nearest neighbour classifier in MapReduce. How would you do so? What Map and Reduce functions would you use? – What are the advantages and disadvantages of this approach? – 32 CS910 Foundations of Data Analytics Hadoop Hadoop is an open-source implementation of MapReduce First developed by Doug Cutting (named after a toy elephant) – Currently managed by the Apache Software Foundation – Google’s original implementations are not publicly available – Many tools/products are implemented in Hadoop Hbase: (non-relational) distributed database – Hive: data warehouse infrastructure developed at Facebook – Pig: high-level language that compiles to hadoop, from Yahoo – Mahout: machine learning algorithms in hadoop – Hadoop widely used in technology-based businesses: Facebook, LinkedIn, Twitter, IBM, Amazon, Adobe, Ebay – Offered as part of: Amazon EC2, Cloudera, Microsoft Azure – 33 CS346 Advanced Databases Spark Hadoop has been criticized: batch + disk approach can be slow – Sometimes faster on well-engineered single-threaded coded http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html Next generation of big data computation is emerging: Spark Developed by the AMPlab from UC Berkeley, now in Apache – Sits on top of distributed data stores (HDFS, Cassandra) – Provides support for SQL, Graph data analysis, machine learning – Program in more languages (Java, Python, Scala, R) – Can be many times faster than Hadoop – 34 Keeps more data in memory rather than on disk CS910 Foundations of Data Analytics Streaming Data Analysis Data management tools so far have dealt with stored data Some data is so large it is not feasible to store it all – Sometimes we want to process data “live” as it is observed – “Streams of data”: high volume sources of data E.g. results from massive scientific experiments (LHC) – E.g. telecommunications data – Streaming Data Analysis aims to deal with such data Streaming Data Algorithms / Streaming Data Analytics – Streaming Data Management Systems – 35 CS910 Foundations of Data Analytics Example: Network Data Networks are sources of massive data: the metadata per hour per router is gigabytes – Fundamental problem of data stream analysis: Too much information to store or transmit Process data as it arrives in one pass: the data stream model – Sometimes give only approximate answers instead of exact Many smart algorithms for data analysis Approximately count how many different items seen – Find which items seen most often – Summarize massive feature vectors – 36 CS910 Foundations of Data Analytics Streaming Data Systems Allow queries over streaming data in suitable language – Streaming generalizations of SQL Compose operators on streaming data Apply a selection predicate, find matching tuples in the stream – Apply functions e.g. summarization of stream data – Allow combination of streaming data with stored data – Often effective to only consider a recent window of stream data Some streaming data systems: Microsoft StreamInsight – IBM Infosphere Streams – Apache Storm – 37 CS910 Foundations of Data Analytics Storm The storm system is built out of spouts and bolts Spout: handles an incoming stream of data – Bolt: transforms the stream it receives, produces a new stream – Streams are sequences of simple tuples – Examples of possible bolts: Select tuples matching a pattern – Count up number of tuples received – Join matching tuples from two input streams – Track number of unique tuples seen in a sliding window – Application: track real-time purchases on an e-commerce site – 38 Can e.g. map each sale to a category via a database look-up CS910 Foundations of Data Analytics Summary of Data Management Many ways to manage data in big systems – Flat files, Databases, Data warehouses, NoSQL systems Different models have different data-access paradigms – SQL (DBMS), OLAP (warehouse), Store/retrieve (NoSQL) MapReduce is a popular model for very large data analysis – Has been applied for handling large graph data Many big computations are possible in MapReduce – PageRank, Classification, Clustering, Link-prediction Online data may be best processed by streaming analysis 39 CS910 Foundations of Data Analytics Further reading Databases and data warehouses: Chapters 4 (Data Warehouses) and 5 (Data Cube) of "Data Mining: Concepts and Techniques" (Han, Kamber, Pei). NoSQL systems: NoSQL: An Overview of NoSQL Databases MapReduce and page rank: Data-Intensive Information Processing Applications (lectures 1, 3 and 5) Introduction to Stream Data Management, Nauman Chaudry, Chapter 1 of “Stream Data Management”, Springer 40 CS910 Foundations of Data Analytics