Naïve Bayes with Large Feature sets: (1) Stream and Sort (2

advertisement
COSC 526 Class 5
Streaming Data and Analytics
Arvind Ramanathan
Computational Science & Engineering Division
Oak Ridge National Laboratory, Oak Ridge
Ph: 865-576-7266
E-mail: ramanathana@ornl.gov
Projects
2
Projects
• Three main areas:
– Applications
– Algorithms
– Infrastructure
• Datasets:
– Healthcare claims (> 20 GB compressed)
– Twitter (1 TB compressed over a year feed)
– Molecular simulations (> 1TB compressed)
– Cybersecurity (>60 GB)
– Bioinformatics (>6 TB compressed)
– Imaging (> 2 TB compressed)
3
Applications
• Can you find patients with similar phenotypes
knowing their treatment patterns?
• Can you group behaviors of individuals based on
their tweet patterns?
• Can you predict time-points where an attack may
occur within a network?
• Can you piece together 100s of millions of
basepairs against a reference genome?
• Can you connect the cognitive thinking process
associated with decision making?
4
Infrastructure
• Scalable platform to host multiple types of
datasets
• GPU processing and implementation of various
algorithms (ICA, hierarchical eigensolvers, tensor
toolbox, etc.)
• Scalable platform to host related datasets for coanalysis processes
• Efficient transfer of data between CPU/GPU for
specific ML algorithms
5
Algorithms
• Machine learning algorithm implementation for
CPU/GPU resources
• How do we exploit Intel’s co-processors for doing
ML algorithm design?
• Multi-way Multi-task learning
6
Part 0: Introducing Streaming Data
7
Distributed Counting  Stream and Sort Counting
• example 1
• example 2
• example 3
• ….
Standardized
message routing logic
Increment
C[x] by D
• C[x1] += D1
• C[x1] += D2
• ….
Sort
• Can we do this as more data becomes available?
• Do we have to update any of the logic toLogic
achieve
to
combine
this?
counter
Counting logic
• What are the necessary updates for ourupdates
approach?
Machine A
Trivial to parallelize!
8
Machine C1..CK
Machine B
Easy to parallelize!
Is Naïve Bayes a Streaming Algorithm?
• Yes!
• In ML, we call this Online Learning
– Allows to model as the data becomes available
– Adapts slowly as the data streams in
• How are we doing this?
– NB: slow updates to the model
– Learn from the training data
– For every example in th stream, we update it (learning
rate)
9
What Attributes of Data have we seen?
Large Feature Sets
Streaming Data
High Dimensional
Map/Redu
ce
Moment
Estimation
Spectral
Methods
Request &
Answer
Counting
Locally
Sensitive
Hashing
Link Analysis
Stream &
Sort
Querying
Streams
Clustering
Page Rank/
Sim Rank
Rocchio’s
Algorithm
Filtering
Streams
Dimensionality
Reduction
Community
Detection
10
Graph Data
What Hadoop/MapReduce can do?
Can Do (Pros)
Cannot Do (Cons)
• High throughput:
• Streaming/infinite data:
– Batch optimized
• Assumes data is resident
(somewhere, some place)
• Decision making is “slow”
• Independent of storage,
language (somewhat),
flexible and fault-tolerant
– Data is getting updated by
the second
• Decision making needs to
be fast!
• Single fixed data flow
• Not I/O efficient
Lee, K.Y., et al SIGMOD Record 2011 (Vol. 40 [4])
11
Streaming Data
• Data streams are “infinite” and “continuous”
– Distribution(s) can evolve over time
• Managing streaming data is important:
– Input is coming extremely fast!
– We don’t see all of the data
– (sometimes) input rates are limited (e.g., Google
queries, Twitter and Facebook status updates)
• Key question in streaming data:
– How do we make critical calculations about the stream
using limited (secondary) memory?
12
General Stream Processing Model
Ad-Hoc
Queries
Standing
Queries
. . . 1, 5, 2, 7, 0, 9, 3
Output
. . . a, r, v, t, y, h, b
. . . 0, 0, 1, 0, 1, 1, 0
time
Streams Entering.
Each is stream is
composed of
elements/tuples
Processor
Limited
Working
Storage
Archival
Storage
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive
Datasets, http://www.mmds.org
13
Requirements of Stream Processing
Real-time Faulttolerant scalable
processing
OUTGOING
Standing
Queries
Processor
Limited
Working
Storage
14
Guaranteed
delivery
Archival
Storage
Destination
Handle
Data fails
INCOMING
Listeners
Flexible
data
input
Stream model
• Input enter at a rapid rate:
– O(million/min) events
– at one or more ports
• Elements are usually called tuples
How to make critical calculations
aboutTuple
the stream
with
limited
Tuple
Tuple
Tuple
secondary memory?
• We cannot store all of the data stream!
15
Questions that we are interested in
• Sampling from a data stream:
– construct a random sample
• Queries over a sliding window:
– Number of items of type x in the last k elements of the stream
• Filtering a data stream:
– Select elements with property x from the stream
• Counting distinct element:
– Number of distinct elements in the last k elements of the stream
• Estimating Moments
• Finding frequent items
16
Example Applications
• Finding relevant products using
clickstreams
– Amazon, Google, all of them
want to find relevant
products to shop based on
user interactions
• Trending topics on Social media
• Health monitoring using sensor
networks
• Monitoring molecular dynamics
simulations
A. Ramanathan, J.-O. Yoo and C.J. Langmead, On-the-fly
identification of conformational sub-states from molecular
dynamics simulations. J. Chem. Theory Comput. (2011). Vol. 7 (3): 778789.
A. Ramanathan, P. K. Agarwal, M.G. Kurnikova, and C. J. Langmead,
An online approach for mining collective behaviors from molecular
dynamics simulations, J. Comp. Biol. (2010). Vol. 17 (3): 309-324.
17
Sampling from a Data stream
• Practical consideration: we cannot store all of the
data, but we can store a sample!
• Two problems:
1. Sample a fixed proportion of elements in the
stream (say 1 in 10)
2. Maintain a random sample of fixed size over a
potentially infinite stream
• Desirable Property: For all time steps k, each of the
– At any time k, we would like a random sample of s
k elements seen so far have equal probability of
elements
being sampled…
18
Part I: Sampling from a data stream
19
Sampling a fixed proportion
• Scenario: Let’s look at a search engine query stream
– Stream: <user, query, time>
– Why is this interesting?
– Answer questions like “how often did a user have to run the same
query in a single day?”
– We have only storage = 1/10th of the query stream
• Simple solution:
– Generate a random integer in [0..9] for each query
– Store the query if the integer is 0, otherwise discard
20
What is wrong with our simple approach?
• Suppose each user issues s queries and d queries twice
= (s + 2d) total queries
• What fraction of queries by an average search engine
user are duplicates?
d/(s+d)
• if we keep only 10% of the queries:
–
Sample will contain s/10 of the singleton queries and 2d/10 of the duplicate queries at least
once
–
But only d/100 pairs of duplicates
• d/100 = 1/10 ∙ 1/10 ∙ d
–
Of d “duplicates” 18d/100 appear exactly once
• 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d
–
Fraction appearing twice is (d/100) divided by (d/100 + s/10 + 18d/100):
• d/(10s + 19d)
21
Instead of queries, let’s sample users…
• Pick 1/10th of users and take all their searches in
the sample
• User a hash function that hashes the
user name or user id uniformly into 10 buckets
• Why is this a better solution?
22
Generalized Solution
• Stream of tuples with keys:
– Key is some subset of each tuple’s components
• e.g., tuple is (user, search, time); key is user
– Choice of key depends on application
• To get a sample of a/b fraction of the stream:
– Hash each tuple’s key uniformly into b buckets
– Pick the tuple if its hash value is at most a
Hash table with b buckets, pick the tuple if its hash value is at most a.
How to generate a 30% sample?
Hash into b=10 buckets, take the tuple if it hashes to one of the first 3 buckets
23
23
Maintaining a fixed size sample
• Suppose we need to maintain a random
sample S of size exactly s tuples
– E.g., main memory size constraint
• Don’t know length of stream in advance
• Suppose at time n we have seen n items
– Each item is in the sample S with equal prob. s/n
How to think about the problem: say s = 2
Stream: a x c y z k c d e g…
At n= 5, each of the first 5 tuples is included in the sample S with equal prob.
At n= 7, each of the first 7 tuples is included in the sample S with equal prob.
Impractical solution would be to store all the n tuples seen
so far and out of them pick s at random
24
Fixed Size Sample
• Algorithm (a.k.a. Reservoir Sampling)
– Store all the first s elements of the stream to S
– Suppose we have seen n-1 elements, and now
the nth element arrives (n > s)
• With probability s/n, keep the nth element, else discard it
• If we picked the nth element, then it replaces one of the
s elements in the sample S, picked uniformly at random
• Claim: This algorithm maintains a sample S
with the desired property:
– After n elements, the sample contains each element
seen so far with probability s/n
25
Proof: By Induction
• We prove this by induction:
– Assume that after n elements, the sample contains each element
seen so far with probability s/n
– We need to show that after seeing element n+1 the sample
maintains the property
• Sample contains each element seen so far with probability
s/(n+1)
• Base case:
– After we see n=s elements the sample S has the desired property
• Each out of n=s elements is in the sample with probability s/s =
1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
26
Proof: By Induction
• Inductive hypothesis: After n elements, the sample S contains
each element seen so far with prob. s/n
• Now element n+1 arrives
• Inductive step: For elements already in S, probability that the
algorithm keeps it in S is:
s   s  s  1 
n

1 



n  1  Element
s inthe n  1
 n  1  Element
n+1
Element n+1 discarded
not discarded
sample not picked
• So, at time n, tuples in S were there with prob. s/n
• Time nn+1, tuple stayed in S with prob. n/(n+1)
• So prob. tuple is in S at time n+1 = (n/n+1) (s/n) = s/n+1
27
Part II: Querying using a Sliding Window
28
Sliding Window
• Queries are not about individual values, but
instead about a window of length N:
– N is the most recent elements received
• What do we do if N is large that you cannot fit it
into memory?
– single stream is so large that you cannot fit it into
memory
– multiple streams so large that they don’t fit into
memory
29
Example 1: A single stream may not fit into
memory
T0
T6000
T100
• Which regions of the
molecule are moving
together?
• Often MD datasets are so large that they don’t fit
• At which time-points are
into memory (> several terabytes)the spatio-temporal
• Questions are not about a “snapshot”
but
more
patterns
of motions
about “time windows”
changing?
0
1
x1, y1, z1
x2, y2, z2
xN, yN, zN
xN, yN, zN
T-1
T
30
…
…
…
xN, yN, zN
…, …, …,
x1, y1, z1
x2, y2, z2
…
…
…
Time =
xN, yN, zN
…
…
…
30
x1, y1, z1
x2, y2, z2
…
…
…
Data
x1, y1, z1
x2, y2, z2
Example 2: Multiple streams and so, they
won’t fit into memory
howstuffworks.com
fujitsu.com
• Computer networks: information streaming from a variety of
computers on a network: <source, port, destination, no. of
packets>
• Sensor networks: multiple sensors reporting at different timescales…
31
Running example: Amazon stream
• For every product X we keep a 0 or 1 stream:
– 0: not sold
– 1: sold in the nth transaction
• How many times have we sold X in the last k
sales?
010011011101010110110110
Past
32
Future
Counting Bits
• Given a stream of 0s and 1s:
– How many 1s are in the last k bits?
– k≤N
• Simple solution:
– Store only the most recent N bits…
010011011101010110110110
Past
Future
– When a new bit comes in, discard the N+1th bit
33
N=6
What’s the problem?
• No exact answer without storing the actual
window!
– We need at least N bits to store to get the exact
answer
• Amazon sells approximately 400 products every
second – 24 x 3600 x 400 ~= 34 million products
a day
– Let’s say N = 1 billion!
– we quickly run out of memory
34
Can we approximate our solution?
• With
really
large N, an approximate solution is
Initialize
2 counters:
• S: number of 1s from the
acceptable
beginning of the stream
• Z: number
of 0s fromthat
the the bit stream is really
• What
if we assume
beginning
of the stream
uniformly
distributed?
N
0 1 0 Assuming
0 1 1 1 0 0 0uniformity:
10100100010110110111001010110011010
Past
Future
• How many 1s in the last N
• How
do we count the number of 1s in the last N
bits?
bits?
S
N.
S+Z
• What is the caveat in our algorithm?
– Are the distributions really uniform?
35
Datar-Gionis-Indyk-Motwani (DGIM)
Algorithm
• Does not assume uniformity
• We store O(log2N) bits per stream
• Solution gives approximate answer:
– Never off by more than 50%
– Error can be reduced to any fraction ε >0 with a more
complicated algorithm (and more bits – still of
O(log2N))
36
Simple extension to previous idea: Use
Exponential Windows
• Summarize exponentially increasing regions of
the stream, looking backward
• Drop small regions if they begin at the same
point as a larger region
Window of
width 16
has 6 1s
6
?
10
4
3
2
2
1
1 0
010011100010100100010110110111001010110011010
N
We can reconstruct the count of the last N bits, except we
37 are not sure how many of the last 6 1s are included in the N
What works? (And What does not)
What works?
What does not work?
• Stores only O(log2N) bits
• Assumes that 1s are fairly
evenly distributed
• Updates are easy!
• Error is no larger than the
number of 1s in the
unknown area
6
?
• If all the 1s are at the end
of the unknown area:
– You cannot bound the error
10
4
3
2
2
1
1 0
010011100010100100010110110111001010110011010
N
38
DGIM: Summarize blocks with specific number
of 1s and associate a time-stamp
• We now vary not only the window size, but also
the block size!
• When there are few 1s in the window, block
sizes stay small, so errors are small
1001010110001011010101010101011010101010101110101010111010100010110010
N
• Each bit has a time-stamp starting with 1, 2, …
until modulo N
– [total of log2N bits for a window]
39
DGIM: Buckets
• Divide the window into buckets:
1. timestamp of its right (most recent end)
2. The number of 1s in the bucket
• (1) stored in O(logN) bits; (2) stored in O(log log
N) bits
• Important constraint on buckets:
– No. of 1s is a power of 2!
1001010110001011010101010101011010101010101110101010111010100010110010
N
40
Representing a data stream by buckets
• Either one or two buckets with the same powerof-2 number of 1s
• Buckets do not overlap in timestamps
• Buckets are sorted by size
– Earlier buckets are not smaller than later buckets
• Buckets disappear when their
end-time is > N time units in the past
41
How does a bucketized stream look like?
At least 1 of
size 16. Partially
beyond window.
2 of
size 8
2 of
size 4
1 of
size 2
2 of
size 1
1001010110001011010101010101011010101010101110101010111010100010110010
N
• Either one or two buckets with the same power-of-2
number of 1s
• Buckets do not overlap in timestamps
• Buckets are sorted by size
42
How do we update our bucket?
• When a new bit comes in, drop the last (oldest)
bucket if its end-time is prior to N time units
before the current time
• If the current bit is 0: no other changes are
needed
• If the current bit is 1:
– Create a new bucket of size 1, for just this bit
• End timestamp = current time
– If there are now three buckets of size 1, combine the oldest two into a bucket
of size 2
– If there are now three buckets of size 2,
combine the oldest two into a bucket of size 4
– And so on …
43
Running through an example
Current state of the stream:
1001010110001011010101010101011010101010101110101010111010100010110010
Bit of value 1 arrives
0010101100010110101010101010110101010101011101010101110101000101100101
Two orange buckets get merged into a yellow bucket
0010101100010110101010101010110101010101011101010101110101000101100101
Next bit 1 arrives, new orange bucket is created, then 0 comes, then 1:
0101100010110101010101010110101010101011101010101110101000101100101101
Buckets get merged…
0101100010110101010101010110101010101011101010101110101000101100101101
State of the buckets after merging
0101100010110101010101010110101010101011101010101110101000101100101101
44
Querying our bucketized stream
•
To estimate the number of 1s in the most
recent N bits:
1. Sum the sizes of all buckets but the last
(note “size” means the number of 1s in the bucket)
2. Add half the size of the last bucket
•
45
Remember: We do not know how many 1s
of the last bucket are still within the wanted
window
Part III: Counting Distinct Elements
46
Problem Definition
• Data stream consists of a universe of elements
chosen from a set of N
• Maintain a count of number of distinct items seen
so far
One Obvious Approach:
• Maintain the set of elements seen so far
• Keep a hashtable of all the distinct elements seen
so far
47
Example
• How many unique products we sold tonight?
• How many unique requests on a website came in
today?
• How many different words did we find on a
website?
– Unusually large number of words could be indicative of
spam
48
Now, let’s constrain ourselves with limited
storage…
• How to estimate (in an unbiased manner) the
number of distinct elements seen?
– Flajolet-Martin Approach
• Pick a hash function that maps each of the N
elements to at least log2N bits
– For each stream element a, let r(a) be the number of trailing 0s in h(a)
– r(a) = position of first 1 counting from the right
• E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2
– Record R = the maximum r(a) seen
– R = maxa r(a), over all the items a seen so far
– Estimated number of distinct elements = 2R
49
Part III: Counting Item Sets
50
Problem Definition
• Given a stream, which items appear more than s
times in the given window
One Approach
• Think of the stream of buckets as one binary
stream per item
• 1 if item is present; 0 if item is not
• Use DGIM to estimate counts of 1s of all items
6
10
4
3
2
2
1
1 0
010011100010100100010110110111001010110011010
N
51
Exponentially Decaying Windows
• A heuristic for selecting likely frequent item(sets)
• Example:
– What are “currently” the most popular movies?
• Instead of computing the raw counts of the last N
elements, we need a “smooth aggregate” on the
whole stream
52
What do we mean by a “smoothed”
aggregate?
• If a stream has a1, a2, …, at, where a1 is the first
element to arrive and at is the current element.
• Let c be a small constant (10-6 to 10-9)
• We define an exponential window to be:
• When a new item at+1arrives:
– multiply current sum by (1-c) and add at+1
53
Now, how do we count items…
• If each ai is an item, we can compute the characteristic function of each
possible item x as an exponentially decaying window:
• if δi=1 if ai = x or 0 otherwise
• Imagine that for each item x we have a binary stream (1 if x appears; 0
otherwise)
• New item x appears:
– Multiply all counts by (1-c)
– Add +1 to the count of x
• Call this sum the “weight of x”
54
Sliding vs. Decaying Windows
1/c
• A property to remember: Sum over all weights
55
A practical application: Creating a story
board from Molecular Dynamics (MD)
Simulation
• We do really long time-scale simulations:
– 10-15 (femtosec) time-step to 10-6 (microsec)
• Data sets can typically range from a few GB to
O(100) TB!!
– How to find “events” of interest within these datasets?
– How to organize a “storyboard” of events for biologists
to interpret results?
56
Using Higher-order Moments to capture
events in MD simulations
57
Putting it all together…
58
Part IV: Engineering for Streaming Data
• Apache Kafka
• Apache Storm
59
Why do we need to engineer?
• Supporting data integration:
– Making “all” data available in a
uniform way for all its services
and systems
• This is challenging:
– Event data fire-hose: Log Data
comes in various forms!!
– Explosion of specialized data
systems: e.g., OLAP,
databases, etc.
Evolution
Dynamic
Predictive Models
Usability
Data Management & Reporting
Quality
Accuracy, Validity, Completeness, etc.
Availability/ Collection
Data Capture, Access, Availability, etc.
Jay Kreps, LinkedIn Engineering Blog: http://engineering.linkedin.com/distributed-systems/log-what-everysoftware-engineer-should-know-about-real-time-datas-unifying
60
Apache Kafka
• Distributed Messaging System
Real-time Faulttolerant scalable
processing
OUTGOING
Standing
Queries
Processor
Limited
Working
Storage
61
Guaranteed
delivery
Archival
Storage
Destination
Handle
Data fails
INCOMING
Listeners
Flexible
data
input
Three components of a messaging system
• Publish/Subscribe model
• Message Producer, Consumer, Broker
– Durable message: one that is held on if a consumer
becomes unavailable
– Persistent message: one where the message delivery
is guaranteed
• Both messages cost!
– Storing metadata about billions of messages creates
overhead
62
– Storage structure (e.g., B-trees) can impose
restrictions
Messaging with Apache Kafka
Application 1
Front End 2
Service 3
Apache Kafka
Hadoop
Clusters
Real-time
Monitoring
Databases, etc.
• Explicitly distributed:
– Producers, Consumers, Brokers all run on a cluster
that co-operate as a logical group
• LinkedIn Engineering: to handle activity stream
63
Anatomy of Kafka: Topics
• Topic: a category/ feed where messages are
published
– Ordered immutable sequence of messages
contiguously appended to
• Consumers subscribe to topics that they would
like to read
64
Apache Storm
• Topologies: basically like a
graph, where:
– “processes”/ “jobs” are nodes
– “data flows” are edges
• Data flows are essentially data
streams:
– Spout: is a source of a data stream.
E.g., connecting to Twitter API and
emitting a tweet stream
– Bolt: process from one or more
spouts and emit yet another
stream!
65
Topologies run forever!
[unless you stop them]
Comparing MapReduce & Storm
MapReduce/ Hadoop
Storm
• Run MapReduce Jobs
• Run Topologies
• Built to leverage batch
processing:
• Built to leverage streaming
applications:
– ordering is unimportant
• Very simple interface:
– tied to Java
• Massive distributed
processing is the end goal
66
– ordering is important
• More complex (Spouts
and Bolts:
– has interfaces in Java,
Python, Scala, etc.
• Massive volume & velocity
processing is the end goal
Where can we combine the two?
Switch 2
Switch 1
Switch 1
Switch 3 3
Switch
Switch 2
Switch 4
Switch 4
• Which switch has an authentic problem?
• How to predict which switch will go down in
the near future?
• Prediction is not easy:
• Monitor “online”
• Build model “offline”
Validate
Event
Sequence
Event Store
Event
Collector
HDFS
67
Analytic model
(NB)
Summary of Streaming Data Analytics
• Data Management issues are very different:
– Streaming data storage and processing can constrain our
analytic algorithm to think differently
• Example applications:
– Sampling a data stream
– Counting discrete items/ frequent items
– Querying with sliding windows
• Engineering Perspective:
– Apache Storm and Kafka
68
Download