COSC 526 Class 5 Streaming Data and Analytics Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.gov Projects 2 Projects • Three main areas: – Applications – Algorithms – Infrastructure • Datasets: – Healthcare claims (> 20 GB compressed) – Twitter (1 TB compressed over a year feed) – Molecular simulations (> 1TB compressed) – Cybersecurity (>60 GB) – Bioinformatics (>6 TB compressed) – Imaging (> 2 TB compressed) 3 Applications • Can you find patients with similar phenotypes knowing their treatment patterns? • Can you group behaviors of individuals based on their tweet patterns? • Can you predict time-points where an attack may occur within a network? • Can you piece together 100s of millions of basepairs against a reference genome? • Can you connect the cognitive thinking process associated with decision making? 4 Infrastructure • Scalable platform to host multiple types of datasets • GPU processing and implementation of various algorithms (ICA, hierarchical eigensolvers, tensor toolbox, etc.) • Scalable platform to host related datasets for coanalysis processes • Efficient transfer of data between CPU/GPU for specific ML algorithms 5 Algorithms • Machine learning algorithm implementation for CPU/GPU resources • How do we exploit Intel’s co-processors for doing ML algorithm design? • Multi-way Multi-task learning 6 Part 0: Introducing Streaming Data 7 Distributed Counting Stream and Sort Counting • example 1 • example 2 • example 3 • …. Standardized message routing logic Increment C[x] by D • C[x1] += D1 • C[x1] += D2 • …. Sort • Can we do this as more data becomes available? • Do we have to update any of the logic toLogic achieve to combine this? counter Counting logic • What are the necessary updates for ourupdates approach? Machine A Trivial to parallelize! 8 Machine C1..CK Machine B Easy to parallelize! Is Naïve Bayes a Streaming Algorithm? • Yes! • In ML, we call this Online Learning – Allows to model as the data becomes available – Adapts slowly as the data streams in • How are we doing this? – NB: slow updates to the model – Learn from the training data – For every example in th stream, we update it (learning rate) 9 What Attributes of Data have we seen? Large Feature Sets Streaming Data High Dimensional Map/Redu ce Moment Estimation Spectral Methods Request & Answer Counting Locally Sensitive Hashing Link Analysis Stream & Sort Querying Streams Clustering Page Rank/ Sim Rank Rocchio’s Algorithm Filtering Streams Dimensionality Reduction Community Detection 10 Graph Data What Hadoop/MapReduce can do? Can Do (Pros) Cannot Do (Cons) • High throughput: • Streaming/infinite data: – Batch optimized • Assumes data is resident (somewhere, some place) • Decision making is “slow” • Independent of storage, language (somewhat), flexible and fault-tolerant – Data is getting updated by the second • Decision making needs to be fast! • Single fixed data flow • Not I/O efficient Lee, K.Y., et al SIGMOD Record 2011 (Vol. 40 [4]) 11 Streaming Data • Data streams are “infinite” and “continuous” – Distribution(s) can evolve over time • Managing streaming data is important: – Input is coming extremely fast! – We don’t see all of the data – (sometimes) input rates are limited (e.g., Google queries, Twitter and Facebook status updates) • Key question in streaming data: – How do we make critical calculations about the stream using limited (secondary) memory? 12 General Stream Processing Model Ad-Hoc Queries Standing Queries . . . 1, 5, 2, 7, 0, 9, 3 Output . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering. Each is stream is composed of elements/tuples Processor Limited Working Storage Archival Storage J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 Requirements of Stream Processing Real-time Faulttolerant scalable processing OUTGOING Standing Queries Processor Limited Working Storage 14 Guaranteed delivery Archival Storage Destination Handle Data fails INCOMING Listeners Flexible data input Stream model • Input enter at a rapid rate: – O(million/min) events – at one or more ports • Elements are usually called tuples How to make critical calculations aboutTuple the stream with limited Tuple Tuple Tuple secondary memory? • We cannot store all of the data stream! 15 Questions that we are interested in • Sampling from a data stream: – construct a random sample • Queries over a sliding window: – Number of items of type x in the last k elements of the stream • Filtering a data stream: – Select elements with property x from the stream • Counting distinct element: – Number of distinct elements in the last k elements of the stream • Estimating Moments • Finding frequent items 16 Example Applications • Finding relevant products using clickstreams – Amazon, Google, all of them want to find relevant products to shop based on user interactions • Trending topics on Social media • Health monitoring using sensor networks • Monitoring molecular dynamics simulations A. Ramanathan, J.-O. Yoo and C.J. Langmead, On-the-fly identification of conformational sub-states from molecular dynamics simulations. J. Chem. Theory Comput. (2011). Vol. 7 (3): 778789. A. Ramanathan, P. K. Agarwal, M.G. Kurnikova, and C. J. Langmead, An online approach for mining collective behaviors from molecular dynamics simulations, J. Comp. Biol. (2010). Vol. 17 (3): 309-324. 17 Sampling from a Data stream • Practical consideration: we cannot store all of the data, but we can store a sample! • Two problems: 1. Sample a fixed proportion of elements in the stream (say 1 in 10) 2. Maintain a random sample of fixed size over a potentially infinite stream • Desirable Property: For all time steps k, each of the – At any time k, we would like a random sample of s k elements seen so far have equal probability of elements being sampled… 18 Part I: Sampling from a data stream 19 Sampling a fixed proportion • Scenario: Let’s look at a search engine query stream – Stream: <user, query, time> – Why is this interesting? – Answer questions like “how often did a user have to run the same query in a single day?” – We have only storage = 1/10th of the query stream • Simple solution: – Generate a random integer in [0..9] for each query – Store the query if the integer is 0, otherwise discard 20 What is wrong with our simple approach? • Suppose each user issues s queries and d queries twice = (s + 2d) total queries • What fraction of queries by an average search engine user are duplicates? d/(s+d) • if we keep only 10% of the queries: – Sample will contain s/10 of the singleton queries and 2d/10 of the duplicate queries at least once – But only d/100 pairs of duplicates • d/100 = 1/10 ∙ 1/10 ∙ d – Of d “duplicates” 18d/100 appear exactly once • 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d – Fraction appearing twice is (d/100) divided by (d/100 + s/10 + 18d/100): • d/(10s + 19d) 21 Instead of queries, let’s sample users… • Pick 1/10th of users and take all their searches in the sample • User a hash function that hashes the user name or user id uniformly into 10 buckets • Why is this a better solution? 22 Generalized Solution • Stream of tuples with keys: – Key is some subset of each tuple’s components • e.g., tuple is (user, search, time); key is user – Choice of key depends on application • To get a sample of a/b fraction of the stream: – Hash each tuple’s key uniformly into b buckets – Pick the tuple if its hash value is at most a Hash table with b buckets, pick the tuple if its hash value is at most a. How to generate a 30% sample? Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets 23 23 Maintaining a fixed size sample • Suppose we need to maintain a random sample S of size exactly s tuples – E.g., main memory size constraint • Don’t know length of stream in advance • Suppose at time n we have seen n items – Each item is in the sample S with equal prob. s/n How to think about the problem: say s = 2 Stream: a x c y z k c d e g… At n= 5, each of the first 5 tuples is included in the sample S with equal prob. At n= 7, each of the first 7 tuples is included in the sample S with equal prob. Impractical solution would be to store all the n tuples seen so far and out of them pick s at random 24 Fixed Size Sample • Algorithm (a.k.a. Reservoir Sampling) – Store all the first s elements of the stream to S – Suppose we have seen n-1 elements, and now the nth element arrives (n > s) • With probability s/n, keep the nth element, else discard it • If we picked the nth element, then it replaces one of the s elements in the sample S, picked uniformly at random • Claim: This algorithm maintains a sample S with the desired property: – After n elements, the sample contains each element seen so far with probability s/n 25 Proof: By Induction • We prove this by induction: – Assume that after n elements, the sample contains each element seen so far with probability s/n – We need to show that after seeing element n+1 the sample maintains the property • Sample contains each element seen so far with probability s/(n+1) • Base case: – After we see n=s elements the sample S has the desired property • Each out of n=s elements is in the sample with probability s/s = 1 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26 Proof: By Induction • Inductive hypothesis: After n elements, the sample S contains each element seen so far with prob. s/n • Now element n+1 arrives • Inductive step: For elements already in S, probability that the algorithm keeps it in S is: s s s 1 n 1 n 1 Element s inthe n 1 n 1 Element n+1 Element n+1 discarded not discarded sample not picked • So, at time n, tuples in S were there with prob. s/n • Time nn+1, tuple stayed in S with prob. n/(n+1) • So prob. tuple is in S at time n+1 = (n/n+1) (s/n) = s/n+1 27 Part II: Querying using a Sliding Window 28 Sliding Window • Queries are not about individual values, but instead about a window of length N: – N is the most recent elements received • What do we do if N is large that you cannot fit it into memory? – single stream is so large that you cannot fit it into memory – multiple streams so large that they don’t fit into memory 29 Example 1: A single stream may not fit into memory T0 T6000 T100 • Which regions of the molecule are moving together? • Often MD datasets are so large that they don’t fit • At which time-points are into memory (> several terabytes)the spatio-temporal • Questions are not about a “snapshot” but more patterns of motions about “time windows” changing? 0 1 x1, y1, z1 x2, y2, z2 xN, yN, zN xN, yN, zN T-1 T 30 … … … xN, yN, zN …, …, …, x1, y1, z1 x2, y2, z2 … … … Time = xN, yN, zN … … … 30 x1, y1, z1 x2, y2, z2 … … … Data x1, y1, z1 x2, y2, z2 Example 2: Multiple streams and so, they won’t fit into memory howstuffworks.com fujitsu.com • Computer networks: information streaming from a variety of computers on a network: <source, port, destination, no. of packets> • Sensor networks: multiple sensors reporting at different timescales… 31 Running example: Amazon stream • For every product X we keep a 0 or 1 stream: – 0: not sold – 1: sold in the nth transaction • How many times have we sold X in the last k sales? 010011011101010110110110 Past 32 Future Counting Bits • Given a stream of 0s and 1s: – How many 1s are in the last k bits? – k≤N • Simple solution: – Store only the most recent N bits… 010011011101010110110110 Past Future – When a new bit comes in, discard the N+1th bit 33 N=6 What’s the problem? • No exact answer without storing the actual window! – We need at least N bits to store to get the exact answer • Amazon sells approximately 400 products every second – 24 x 3600 x 400 ~= 34 million products a day – Let’s say N = 1 billion! – we quickly run out of memory 34 Can we approximate our solution? • With really large N, an approximate solution is Initialize 2 counters: • S: number of 1s from the acceptable beginning of the stream • Z: number of 0s fromthat the the bit stream is really • What if we assume beginning of the stream uniformly distributed? N 0 1 0 Assuming 0 1 1 1 0 0 0uniformity: 10100100010110110111001010110011010 Past Future • How many 1s in the last N • How do we count the number of 1s in the last N bits? bits? S N. S+Z • What is the caveat in our algorithm? – Are the distributions really uniform? 35 Datar-Gionis-Indyk-Motwani (DGIM) Algorithm • Does not assume uniformity • We store O(log2N) bits per stream • Solution gives approximate answer: – Never off by more than 50% – Error can be reduced to any fraction ε >0 with a more complicated algorithm (and more bits – still of O(log2N)) 36 Simple extension to previous idea: Use Exponential Windows • Summarize exponentially increasing regions of the stream, looking backward • Drop small regions if they begin at the same point as a larger region Window of width 16 has 6 1s 6 ? 10 4 3 2 2 1 1 0 010011100010100100010110110111001010110011010 N We can reconstruct the count of the last N bits, except we 37 are not sure how many of the last 6 1s are included in the N What works? (And What does not) What works? What does not work? • Stores only O(log2N) bits • Assumes that 1s are fairly evenly distributed • Updates are easy! • Error is no larger than the number of 1s in the unknown area 6 ? • If all the 1s are at the end of the unknown area: – You cannot bound the error 10 4 3 2 2 1 1 0 010011100010100100010110110111001010110011010 N 38 DGIM: Summarize blocks with specific number of 1s and associate a time-stamp • We now vary not only the window size, but also the block size! • When there are few 1s in the window, block sizes stay small, so errors are small 1001010110001011010101010101011010101010101110101010111010100010110010 N • Each bit has a time-stamp starting with 1, 2, … until modulo N – [total of log2N bits for a window] 39 DGIM: Buckets • Divide the window into buckets: 1. timestamp of its right (most recent end) 2. The number of 1s in the bucket • (1) stored in O(logN) bits; (2) stored in O(log log N) bits • Important constraint on buckets: – No. of 1s is a power of 2! 1001010110001011010101010101011010101010101110101010111010100010110010 N 40 Representing a data stream by buckets • Either one or two buckets with the same powerof-2 number of 1s • Buckets do not overlap in timestamps • Buckets are sorted by size – Earlier buckets are not smaller than later buckets • Buckets disappear when their end-time is > N time units in the past 41 How does a bucketized stream look like? At least 1 of size 16. Partially beyond window. 2 of size 8 2 of size 4 1 of size 2 2 of size 1 1001010110001011010101010101011010101010101110101010111010100010110010 N • Either one or two buckets with the same power-of-2 number of 1s • Buckets do not overlap in timestamps • Buckets are sorted by size 42 How do we update our bucket? • When a new bit comes in, drop the last (oldest) bucket if its end-time is prior to N time units before the current time • If the current bit is 0: no other changes are needed • If the current bit is 1: – Create a new bucket of size 1, for just this bit • End timestamp = current time – If there are now three buckets of size 1, combine the oldest two into a bucket of size 2 – If there are now three buckets of size 2, combine the oldest two into a bucket of size 4 – And so on … 43 Running through an example Current state of the stream: 1001010110001011010101010101011010101010101110101010111010100010110010 Bit of value 1 arrives 0010101100010110101010101010110101010101011101010101110101000101100101 Two orange buckets get merged into a yellow bucket 0010101100010110101010101010110101010101011101010101110101000101100101 Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1: 0101100010110101010101010110101010101011101010101110101000101100101101 Buckets get merged… 0101100010110101010101010110101010101011101010101110101000101100101101 State of the buckets after merging 0101100010110101010101010110101010101011101010101110101000101100101101 44 Querying our bucketized stream • To estimate the number of 1s in the most recent N bits: 1. Sum the sizes of all buckets but the last (note “size” means the number of 1s in the bucket) 2. Add half the size of the last bucket • 45 Remember: We do not know how many 1s of the last bucket are still within the wanted window Part III: Counting Distinct Elements 46 Problem Definition • Data stream consists of a universe of elements chosen from a set of N • Maintain a count of number of distinct items seen so far One Obvious Approach: • Maintain the set of elements seen so far • Keep a hashtable of all the distinct elements seen so far 47 Example • How many unique products we sold tonight? • How many unique requests on a website came in today? • How many different words did we find on a website? – Unusually large number of words could be indicative of spam 48 Now, let’s constrain ourselves with limited storage… • How to estimate (in an unbiased manner) the number of distinct elements seen? – Flajolet-Martin Approach • Pick a hash function that maps each of the N elements to at least log2N bits – For each stream element a, let r(a) be the number of trailing 0s in h(a) – r(a) = position of first 1 counting from the right • E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 – Record R = the maximum r(a) seen – R = maxa r(a), over all the items a seen so far – Estimated number of distinct elements = 2R 49 Part III: Counting Item Sets 50 Problem Definition • Given a stream, which items appear more than s times in the given window One Approach • Think of the stream of buckets as one binary stream per item • 1 if item is present; 0 if item is not • Use DGIM to estimate counts of 1s of all items 6 10 4 3 2 2 1 1 0 010011100010100100010110110111001010110011010 N 51 Exponentially Decaying Windows • A heuristic for selecting likely frequent item(sets) • Example: – What are “currently” the most popular movies? • Instead of computing the raw counts of the last N elements, we need a “smooth aggregate” on the whole stream 52 What do we mean by a “smoothed” aggregate? • If a stream has a1, a2, …, at, where a1 is the first element to arrive and at is the current element. • Let c be a small constant (10-6 to 10-9) • We define an exponential window to be: • When a new item at+1arrives: – multiply current sum by (1-c) and add at+1 53 Now, how do we count items… • If each ai is an item, we can compute the characteristic function of each possible item x as an exponentially decaying window: • if δi=1 if ai = x or 0 otherwise • Imagine that for each item x we have a binary stream (1 if x appears; 0 otherwise) • New item x appears: – Multiply all counts by (1-c) – Add +1 to the count of x • Call this sum the “weight of x” 54 Sliding vs. Decaying Windows 1/c • A property to remember: Sum over all weights 55 A practical application: Creating a story board from Molecular Dynamics (MD) Simulation • We do really long time-scale simulations: – 10-15 (femtosec) time-step to 10-6 (microsec) • Data sets can typically range from a few GB to O(100) TB!! – How to find “events” of interest within these datasets? – How to organize a “storyboard” of events for biologists to interpret results? 56 Using Higher-order Moments to capture events in MD simulations 57 Putting it all together… 58 Part IV: Engineering for Streaming Data • Apache Kafka • Apache Storm 59 Why do we need to engineer? • Supporting data integration: – Making “all” data available in a uniform way for all its services and systems • This is challenging: – Event data fire-hose: Log Data comes in various forms!! – Explosion of specialized data systems: e.g., OLAP, databases, etc. Evolution Dynamic Predictive Models Usability Data Management & Reporting Quality Accuracy, Validity, Completeness, etc. Availability/ Collection Data Capture, Access, Availability, etc. Jay Kreps, LinkedIn Engineering Blog: http://engineering.linkedin.com/distributed-systems/log-what-everysoftware-engineer-should-know-about-real-time-datas-unifying 60 Apache Kafka • Distributed Messaging System Real-time Faulttolerant scalable processing OUTGOING Standing Queries Processor Limited Working Storage 61 Guaranteed delivery Archival Storage Destination Handle Data fails INCOMING Listeners Flexible data input Three components of a messaging system • Publish/Subscribe model • Message Producer, Consumer, Broker – Durable message: one that is held on if a consumer becomes unavailable – Persistent message: one where the message delivery is guaranteed • Both messages cost! – Storing metadata about billions of messages creates overhead 62 – Storage structure (e.g., B-trees) can impose restrictions Messaging with Apache Kafka Application 1 Front End 2 Service 3 Apache Kafka Hadoop Clusters Real-time Monitoring Databases, etc. • Explicitly distributed: – Producers, Consumers, Brokers all run on a cluster that co-operate as a logical group • LinkedIn Engineering: to handle activity stream 63 Anatomy of Kafka: Topics • Topic: a category/ feed where messages are published – Ordered immutable sequence of messages contiguously appended to • Consumers subscribe to topics that they would like to read 64 Apache Storm • Topologies: basically like a graph, where: – “processes”/ “jobs” are nodes – “data flows” are edges • Data flows are essentially data streams: – Spout: is a source of a data stream. E.g., connecting to Twitter API and emitting a tweet stream – Bolt: process from one or more spouts and emit yet another stream! 65 Topologies run forever! [unless you stop them] Comparing MapReduce & Storm MapReduce/ Hadoop Storm • Run MapReduce Jobs • Run Topologies • Built to leverage batch processing: • Built to leverage streaming applications: – ordering is unimportant • Very simple interface: – tied to Java • Massive distributed processing is the end goal 66 – ordering is important • More complex (Spouts and Bolts: – has interfaces in Java, Python, Scala, etc. • Massive volume & velocity processing is the end goal Where can we combine the two? Switch 2 Switch 1 Switch 1 Switch 3 3 Switch Switch 2 Switch 4 Switch 4 • Which switch has an authentic problem? • How to predict which switch will go down in the near future? • Prediction is not easy: • Monitor “online” • Build model “offline” Validate Event Sequence Event Store Event Collector HDFS 67 Analytic model (NB) Summary of Streaming Data Analytics • Data Management issues are very different: – Streaming data storage and processing can constrain our analytic algorithm to think differently • Example applications: – Sampling a data stream – Counting discrete items/ frequent items – Querying with sliding windows • Engineering Perspective: – Apache Storm and Kafka 68