Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at http://www.stanford.edu/class/cs246/ 1 Infinite Data: Data Streams • Until now, we assumed that the data is finite – Data is crawled and stored – Queries are issued against index • However, sometimes data is infinite (is constantly being created) and therefore becomes too big to be stored – Examples of data streams? – Later: How can we query such data? 2 Sensor Data • Millions of sensors • Each sending data every (fraction of a) second • Analyze trends over time LOBO Observatory http://www.mbari.org/lobo/Intro.html 3 Image Data • Satellites send terabytes a day of data • Cameras have http://www.defenceweb.co.za/images/stories/AIR/Air_new/satellite.jpg lower resolution, but there are more of them… http://en.wikipedia.org/wiki/File:Three_Surveillance_cameras.jpg 4 Internet and Web Traffic • Streams of HTTP requests, IP Addresses, etc, can be used to monitor traffic for suspicious activity • Streams of Google queries can be used to determine query trends, or to understand disease spread – Can the spread of the flu be predicted accurately by Google queries? 5 Social Updates • Twitter • Facebook • Instagram • Flickr • … http://blog.socialmaximizer.com/wpcontent/uploads/2013/02/social_media.jpg6 The Stream Model • Tuples in the stream enter at a rapid rate • Have a fixed working memory size • Have a fixed (but larger) archival storage • How do you make critical calculations quickly, i.e., using only working memory? 7 The Stream Model: More details Ad-Hoc Queries Standing Queries . . . 1, 5, 2, 7, 0, 9, 3 Output . . . a, r, v, t, y, h, b Processor . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering: Each stream composed of tuples / elements Limited Working Storage Archival Storage 8 Stream Queries • Standing queries versus Ad-hoc – Usually store a sliding window for Ad-hoc queries • Examples – Output alert when element > 25 appears – Compute a running average of the last 20 elements – Compute a running average of the elements from the past 48 hours – Maximum value ever seen Are these easy or hard to answer? – Compute the number of different elements seen 9 Constraints and Issues • Data appears rapidly – Elements must be processed in real time (= in main memory), or are lost • We may be willing to get an approximate answer, instead of an exact answer, to save time • Hashing will be useful as it adds randomness to algorithms! 10 Sampling Data in a Stream 11 Goal • Create a sample of the stream • Answer queries over the subset and have it be representative of the entire stream • Two Different Problems: – Fixed proportion of elements (e.g., 1 in 10) – Random sample of fixed size (e.g., at any time k, we should have a random sample of s elements) 12 Motivating Example • Search engine stream of tuples – (user, query, time) • We have room to store 1/10 of the stream • Query: What fraction of the typical user’s queries were repeated over the past month • Ideas? 13 Solution Attempt • For each tuple, choose a random number from 0 to 9 • Store tuples for which random value was 0 • On average, per user, we store 1/10 of queries • Can this be used to answer our question on repeated queries? 14 Wrong! • Suppose in past month, a user issued x queries once and d queries twice (in total x+2d queries) • Correct answer: d/(x+d) • We have a 10% sample, so we have x/10 of the singleton queries issued and 2d/10 of duplicate queries (at least once) d/100 pairs of duplicates Of d duplicates, 18d/100 appear once 18d/100 = ((1/10 * 9/10) + (9/10 * 1/10)) * d We will give the following wrong answer: d d 100 x d 18d 10 x 19d 10 100 100 Solution • Pick 1/10 of users, and take all their searches in sample • Use a hash function that hashes the user name (or user id) into 10 buckets – Store data if user is hashed into bucket 0 • How would you store a fixed fraction a/b of the stream? 16 Variant of the Problem • Store a random sample of fixed size (e.g., at any time k, we should have a random sample of s elements) – Each of the k elements seen so far have the same probability to be in the sample – Each of the k elements seen so far should have probability s/k to be in the sample • Ideas? 17 Reservoir Sampling 1. Store all the first s elements of stream to S 2. Suppose we have seen n-1 elements and now the nth arrives (n > s) – With probability s/n, keep the nth element, else discard – If we keep the nth element then randomly choose one of the elements already in S to discard Claim: After n elements, the sample contains each element seen so far with probability s/n 18 Proof By Induction • Inductive Claim: If, after n elements, the sample S contains each element seen so far with probability s/n, then, after n+1 elements, the sample S contains each element with probability s/(n+1) • Base Case: n=s, each element of the first n has probability s/n = 1 to be in sample 19 Proof By Induction (cont) • Inductive Step: For elements already in sample, the probability that they remain is: s s s 1 n 1 n 1 n 1 s n 1 Discard element n+1 Keep element n+1 Keep element in sample • At time n, tuple is in sample with prob s/n, so at time n+1 its probability to now be in sample is s/n+1 20 Filtering Stream Data 21 The Problem • We would like to filter the stream according to some critereon – If simple property of tuple (e.g., <10): Easy! – If requires look up in a large set that does not fit in memory: Hard! • Example: Stream of URLs (for crawling). Filter out if “seen” already • Example: Stream of emails. We have addresses believed to be non-spam. Filter out “spam” (perhaps for further processing) 22 Filtering Example • Suppose we have 1 billion “allowed” email addresses • Each email address takes ~ 20 bytes • We have 1GB available main memory – We cannot store a hash table of allowed emails in main memory! • View main memory as a bit array – We have 8 billion bits available 23 Simple Filtering Solution • Let h be a hash table from email addresses to 8 billion • Hash each allowed email, and set corresponding bit to 1 • There are 1 billion email addresses, so about 1/8 of bits will be 1 (perhaps less, due to hash collisions!) – Given a new email, compute hash. If value is 1, let email through. If 0, consider it spam – There will be false positives! (about 1/8th of spam) 24 Bloom Filter • A Bloom filter consists of: – An array of n bits, all initialized to 0 – A collection of hash functions h1,…,hk. Each function returns a value <= n – A set S of m key values • Goal: Allow all stream elements whose keys are in S. Reject all others. 25 Using a Bloom Filter • To set up the filter: – Take each key K in S and set hi(K) to 1, for all hash functions h1,…,hk • To use the filter: – Given a new key K in stream, compute hi(K), for all hash functions h1,…,hk – If all of these are 1, allow element. If at least one of these bits is 0, reject element Can we have false negatives? Can we have false positives? 26 Probability of a False Positive • Model: Throwing darts at targets – Suppose we have x targets and y darts – Any dart is equally likely to hit any target – How many targets will be hit at least once? Probability that a given dart will not hit a given target: (x-1)/x Probability that none of the darts will hit a given target: 1 x 1 1 x x y y x x 27 Probability of a False Positive (cont) • As x goes to infinity, it is well known that x lim 1 1 1 x e x • So, if x is large, we have 1 x 1 1 x x y y x x 1 e y x e y x 28 Back to the Running Example • There are 1 billion email addresses (=1 billion darts) • There are 8 billion bits (=8 billion targets) • Meanwhile, 1 hash function • Probability of a bit to not be hit: e 1, 000, 000, 000 8, 000, 000, 000 e 1 8 0.8825 • Probability of a bit to be hit is approx 0.1175 29 Now, multiple hash functions • Suppose we use k hash functions • Number of targets is n • Number of darts is km • Probability that a bit remains 0 is e km n • Choose k so as to have enough 0 bits • Probability of false positive is now km 1 e n k 30 Counting Distinct Stream Elements 31 Problem • The stream elements are from some universal set • Estimate the number of different elements we have seen from the beginning or from specific point in time • Why is this hard? 32 Applications • How many different words in Web pages at a site? – Why is this interesting? • How many different Web pages does a customer request a week? – Why is this interesting? • How many distinct products have we sold last month? 33 Obvious Solution • Keep all elements seen so far in a hash table • When element appears, check in hashtable • But, what if too many elements to store in hashtable? • Or too many streams to store each in hashtable? • Solution: Use a small amount of memory, and estimate the correct value – Limit probability that the error is large 34 Flajolet-Martin Algorithm: Intuition • We will hash elements of stream to a bit-string – Must choose bit stream length to have more possible bit streams than members of universal set – Example: IP addresses are a universal set of size 4*8 bits – Need a string of length at least 32 bits • The more elements, the more likely we will see “unusual” hash values 35 Flajolet-Martin Algorithm: Intuition • Pick hash function h that maps universe of N elements to at least log2N bits • For each stream element a, let r(a) be the number of training 0-s in h(a) – r(a) = position of the first 1 counting from the right – Example: h(a) = 12, then 12 is 1100 in binary and r(a)=2 • Use R to denote the maximum tail length seen so far • Estimate the number of distinct elements as 2R 36 Why it Works: Very rough and heuristic • h(a) hashes a with equal prob. to any of N values • Then, h(a) is a sequence of log2N bits, where 2-r fraction of all a-s have a tail of r zeros – About 50% of a-s hash to ***0 – About 25% of a-s hash to **00 – So, if the longest tail is R = 2, then we have probably seen about 4 distinct items so far • So, we need to hash about 2r items before we see one with a zero-suffix of length r 37 Why it Works: More formally • Let m be the number of distinct items • We will show that the probability of finding a tail of r zeros: – Goes to 1 if m is much greater than 2r – Goes to 0 if m is much smaller than 2r • So, 2R will be around m! 38 Why it Works: More formally • What is the probability that a given h(a) ends in at least r 0-s? – h(a) hashes elements uniformly at random – Probability that a random number ends in at least r 0-s is 2-r • The probability of not seeing a tail of length r among m elements: (1-2-r)m 39 Why it Works: More formally • Probability of not finding a tail of length r is: r m r 2 (1 2 ) (1 2 ) r m 2 r e m 2 r – If m<<2r, then probability tends to 1 – So, the probability of finding a tail of length r tends to 0 – If m>>2r, then probability tends to 0 – So, the probability of finding a tail of length r tends to 1 • So, 2R will be around m! 40 Why it Doesn’t Work • E[2R] is infinite! – Probability halves when R→R+1, but value doubles • Work around uses many hash functions hi and many samples of Ri – Details omitted 41 Counting Ones in a Window 42 Problem • We have a window of length N on a binary stream – N is too big to store • At all times we want to be able to answers “How many 1-s are there in the last k bits?” (k≤N) • Example: For each spam mail seen we emitted a 1. Now, we want to always know how many of the last million emails were spam • Example: For each tweet seen we emitted a 1 if it is anti-Semitic. Now, we want to always know how many of the billion tweets were anti-Semitic 43 Cost of Exact Counts • For a precise count, we must store all N bits of the window – Easy to show even if we can only ask about number of 1-s in entire window – Suppose we use j<N bits to store information – There are 2 different bit sequences that are represented in the same manner. – Suppose that the sequences agree on their last k-1 bits, but differ on the kth – Follow the window by N-k 1-s – We remain with the same representation for the two different strings, but they must have a different number of 1-s! 44 The Datar-Gionis-Indyk-Motwani Algorithm (DGIM Algorithm) • Use O(log2N) bits and get an estimate that is no more than 50% – Later improve to get a better estimate • Assume: each bit has a timestamp (i.e., position in which it arrives) – Represent timestamp modulo N – require log(N) bits to represent • Store the total number of bits ever seen, modulo N 45 DGIM Algorithm • Divide window into buckets consisting of – The timestamp of its right (most recent) end – Size of bucket = Number of 1-s in bucket. Number must be a power of 2 • Bucket representation: log(N) for timestamp + log(log(N)) for size – We know that size i is some 2j, so only store j. j is at most log(N) and needs log(log(N)) bits 46 Rules for Representing a Stream by Buckets 1) Right end of a bucket always has a 1 2) Every position with 1 is in some bucket 3) No position is in more than one bucket 4) There are one or two buckets of any given size, up to some maximum size 5) All sizes must be a power of 2 6) Buckets cannot decrease in size as we move to the left 47 Example: Bucketized Stream …1011011000101110110010110 0ne of size 8 Two of size 4 One of size 2 Two of size 1 • Observe that all of the DGIM rules hold 48 Space Requirements • Each bucket requires O(lg N) bits – We saw this earlier • For a window of length N, there can be at most N 1-s. – If the largest bucket is of size 2j, then j cannot be more than log N – There are at most 2 buckets of all sizes from logN to 1, i.e., O(log(N)) buckets – Total space for buckets: O(log2(N)) 49 Query Answering • Given k≤N, we want to know how many of the last k bits were 1-s. – Find bucket b with earliest timestamp that includes at least some of the last k bits – Estimate number of 1-s as the sum of sizes of buckets to the right of b plus half of the size of b 50 Example …1011011000101110110010110 0ne of size 8 Two of size 4 One of size 2 Two of size 1 • What would be your estimate if k = 10? • What if k=15? 51 How Good an Estimate is This? • Suppose that the leftmost bucket b included has size 2j. Let c be the correct answer. – Suppose the estimate is less than c: In the worst case, all the 1-s in the leftmost bucket are in the query range. So, the estimate misses half of bucket b, i.e., 2j-1. • Then, c is at least 2j. • Actually c is at least 2j+1-1 since there is at least one bucket of each size 2j-1,2j-2,…,1. So, our estimate is at least 50% of c 52 How Good an Estimate is This? • Suppose that the leftmost bucket b included has size 2j. Let c be the correct answer. – Suppose the estimate is more than c: In the worst case, only the rightmost 1 in the leftmost bucket is in the query range, and there is only one of each bucket size less than 2j • Then, c = 1 + 2j-1+2j-2 +…+ 1=2j • Our estimate was 2j-1+ 2j-1+2j-2 +…+ 1=2j +2j-1-1 • So, our estimate is at most 50% more than c 53 Maintaining DGIM Conditions • Suppose we have a window of length N represented by buckets satisfying DGIM conditions. Then a new bit comes in: Time: At most – Check the leftmost bucket. If its timestamp log(N), is not since there are log(N) currentTimestamp – N, the remove different sizes – If new bit is 0, do nothing – If new bit is 1, create a new bucket of size 1 • If there are now only 2 buckets of size 1, stop • Otherwise, merge previous buckets of size 1 into bucket of size 2 – If there are now only 2 buckets of size 2, stop – Otherwise, merge previous buckets of size 2 into buckets of size 4 – Etc … 54 Example: Updating Buckets 1001010110001011010101010101011010101010101110101010111010100010110010 0010101100010110101010101010110101010101011101010101110101000101100101 0010101100010110101010101010110101010101011101010101110101000101100101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101 0101100010110101010101010110101010101011101010101110101000101100101101 Slide by Jure Leskovec: Mining Massive Datasets 55 Example …1011011000101110110010110 0ne of size 8 Two of size 4 One of size 2 Two of size 1 • What happens if the next 3 bits are 1,1,1? 56 Reducing the Error • Instead of allowing 1 or 2 of each bucket size, allow r-1 or r of each bucket size for sizes 1, 2, 4, 8, … (and an integer r>2) • Of the smallest and largest size present, we allow there to be any number of buckets, from 1 to r – Use similar propagation algorithm to that of before • Buckets are smaller, so there is tighter bound on error – Can prove that the error is at most 1/r 57 Counting Frequent Items 58 Problem • Given a stream, which items are currently popular – E.g., popular movie tickets, popular items being sold in Amazon, etc. – appear more than s times in the window • Possible solution – Stream per item; at each timepoint “1” if the item appears in the original stream and “0” otherwise – Use DGIM to estimate counts of 1 for each item 59 Example • Original Stream: Problem with this approach? 1, 2, 1, 1, 3, 2, 4 Too many streams! Stream for 1: 1, 0, 1, 1, 0, 0, 0 Stream for 2: 0, 1, 0, 0, 0, 1, 0 Stream for 3: 0, 0, 0, 0, 1, 0, 0 Stream for 4: 0, 0, 0, 0, 0, 0, 1 60 Exponentially Decaying Windows • A heuristic for selecting frequent items – Gives more weight to more recent popular items – Instead of computing count in the last N elements, compute smooth aggregation over entire stream • If stream is a1,a2… and we are taking the sum over the stream, the answer at time t is defined as t t i a ( 1 c ) i i 1 where c is a tiny constant ~ 10-6 61 Exponentially Decaying Windows (cont) • If stream is a1,a2… and we are taking the sum over the stream, the answer at time t is defined as t a (1 c) i 1 t i i where c is a tiny constant ~ 10-6 • When new at+1 arrives, we (1) multiply current sum by (1-c) and (2) add at+1 62 Example Stream for 1: 1, 0, 1, 1, 0, 0, 0 Stream for 4: 0, 0, 0, 0, 0, 0, 1 • Suppose c = 10-6 • What is the running score for each stream? t a (1 c) i 1 t i i 63 Back to Our Problem • For each different item x, we compute the running score of the stream defined by the characteristic function of the item – Stream in which there is 1 when item appears and 0 otherwise t x t i ( 1 c ) i i 1 – ix=1 if ai=x and 0 otherwise 64 Retaining the Running Scores • Each time we see some item x in the stream: 1) Multiply all running counts by (1-c) 2) Add 1 for the running sum corresponding to x (create a new running score with value 1 if there is none) • How much memory do we need for running scores??? 65 Property of Decaying Windows • Remember, for each item x, we have a running score t x t i ( 1 c ) i i 1 • Summing over all running scores we get t x i 1 x i (1 c) t i t (1 c) i 1 t i 1 c t 66 Memory Requirements • Suppose we want to find items with weight greater than ½ • Since sum of all scores is 1/c, there cant be more than 2/c items with weight ½ or more! • So, 2/c is a limit on the number of scores being counted at any time – For other weight requirements, we would get a different bound, in a similar manner • Think about it: How would you choose c? 67