Massive Streams in

advertisement
Massive Data Streams
in Graph Theory and
Computational Geometry
Ph.D. Dissertation Defense
Jian Zhang
Advisor: Joan Feigenbaum
Committee: Ravi Kannan
Avi Silberschatz
Sampath Kannan (UPenn)
Support: NSF grants 0105337 and 0331548
Talk Outline

Streaming computational model

Overview of results

Approximate graph distances in the streaming
model

Future research directions
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
2
Data Streams

A data stream is a sequence of data elements:
a1, a2, …, an .



Data elements have different forms in different
applications.



Stream of stock prices
Stream of IP packets
Scalar value
Tuple
The semantics of the data elements are also
different in different applications.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
3
Streaming Computational Model


Sequential access to the input stream
Order of data elements in the stream is not controlled by the
algorithm and may be adversarial.
Working Space
STREAM

Algorithms may perform pre- or post-processing without
access to the data stream.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
4
Features of Streaming Algorithms

Small working space compared to the stream
length n



Small number of passes over the stream



Polylog n
n
One pass
Constant number of passes
Fast per-data-element processing time
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
5
Sliding-Window Model

A variation of streaming

Data stream is a time series and may be infinite.

Consider the n most recent data elements.

As time progresses, new data elements arrive,
and old data elements expire.

The deletion of old data elements is implicit.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
6
Why Streaming ?

Data streams occur in real systems.


IP-traffic flow
Need to distinguish the working space from the
data storage.



Storage devices: large capacity but slow access
Working space: small capacity but fast random access
We want to restrict random access to the mass
storage but still see every element of the input set at
least once.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
7
Earlier Work on Streaming

Despite the restrictions of the model, a lot can
be done, e.g.:





Lp norms [FKSV02, Indyk00]
histograms [GKS01]
clustering [GMMO00]
Much of the work focuses on computing
statistics.
Often the working-space size is restricted to
polylog space.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
8
Talk Outline

Streaming computational model

Overview of results

Approximate graph distances in the streaming
model

Future research directions
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
9
Dissertation Contributions

Investigate important problem domains.



Computational geometry problems
Graph problems
Show the importance of a more relaxed model.


Sublinear space instead of polylog space
Multiple passes
There are problems that are provably hard in the
restricted model but feasible in the more relaxed model.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
10
Results on Geometric Problems (1)
[ Feigenbaum-S. Kannan-Zhang ]

Exact computation is hard using sublinear
space.
Computing the exact Diameter, Closest Pair, or
Convex Hull requires (n) bits of space, where n is
the number of points in the stream.

Approximation is feasible.
We give a one-pass, ε-approximation, streaming
algorithm for diameter. The algorithm needs storage
for O(1/ε) points and processes each point in
O(log(1/ε)) time.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
11
Results on Geometric Problems (2)

We give an ε-approximation algorithm that maintains the
diameter in the sliding-window model.

The algorithm uses O(1/ε log3n logR) bits of space,
where R is the largest diameter attained in any window.
The amortized processing time for each point is O(logn).

We show that is (1/ε logn logR) space is required for
such an approximation.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
12
Graph Stream

Consider undirected graph: G=(V,E)
V = {v1, v2, …, vn} E = {e1, e2, …, em}

A graph stream is a sequence of edges in E.
3
5
1
(4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4)
4
2

Edges arrive in arbitrary order in the stream.

More general than adjacency matrices or adjacency lists
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
13
Results on Graph Problems (1)
[ Feigenbaum-S. Kannan-McGregor-Suri-Zhang ]

Many problems require (n) bits of space.
Graph distances (even approximation), Connectivity testing,
Planarity testing …

Consider streaming algorithms that use O(n·polylogn)
space and O(1) passes. In such a model, we can
compute or approximate:


Spanning trees
Graph distances
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
14
Results on Graph Problems (2)
[ Elkin-Zhang ]
We give a randomized streaming algorithm that
approximates graph distances:





(1+,)-approximation: Our algorithm outputs {(u,v)} s.t.
(u,v)  (1+ ) distG(u,v) + , where distG(u,v) is the true
distance between vertices u and v.
The algorithm uses O(n1+1/k) space.
Processing time per edge is O(n1/k).
Needs multiple passes.
1/k and  are arbitrarily small parameters.  and the
number of passes are functions of k and 1/.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
15
Results on Graph Problems (3)
[ Feigenbaum-S. Kannan-McGregor-Suri-Zhang ]

We give a one-pass, streaming algorithm for approximating graph
distances.




(2t+1)-approximation: (u,v)  (2t+1)·distG(u,v)
O(t·n1+1/t ·logn) space
Processing time per edge: O(t 2·n1/t ·logn)
Needs one pass.
For t = log n, this gives a one-pass, O(logn)-approximation algorithm using
n·polylog space and polylog time per edge.

Lower bound: The space complexity of one-pass, t-approximation is
(n1+1/t).
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
16
Publications

J. Feigenbaum, S. Kannan, and J. Zhang, “Computing Diameter
in the Streaming and Sliding-Window Models,” Algorithmica 41
(2005), pp. 25-41

J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang,
“On Graph Problems in a Semi-Streaming Model,” ICALP 2004,
pp. 531-543. Journal version to appear in Theoretical Computer
Science.

M. Elkin and J. Zhang, “Efficient Algorithms for Constructing
(1+ε,β)-Spanners in the Distributed and Streaming Models,”
PODC 2004, pp. 160-168

J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang,
“Graph Distances in the Streaming Model: The Value of Space,”
SODA 2005, pp 745-754
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
17
Other Results in Thesis

Streaming-space requirement can be reduced by
annotating the stream.
J. Feigenbaum, S. Kannan, and J. Zhang, “Annotation and Computational
Geometry in the Streaming Model,” Yale University Technical Report
YALEU/DCS/TR-1249, 2003

Using streaming algorithms to detect BGP-update
anomalies.
J. Zhang, J. Rexford, and J. Feigenbaum, “Learning-Based Anomaly
Detection in BGP Updates,” to appear in SIGCOMM Workshop on Mining
Network Data 2005
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
18
Talk Outline

Streaming computational model

Overview of results

Approximate graph distances in the streaming
model

Future research directions
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
19
Shortest-Path Distances

Distance is the length of the shortest path.

Fundamental problem in graph theory

Many algorithms and approximations

Most of them use BFS-like subroutines, which
are hard to adapt to the streaming model.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
20
The “Sketch” Approach




A two-stage approach
First stage: While going through the stream,
construct a small sketch of the input graph.
Second stage: Compute the distance using the
sketch, without further access to the stream.
Perform BFS-like computations in the second
stage.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
21
Graph Spanners as Sketches

Edge subgraph H of a graph G, s.t., for any pair of vertices
u and v, their distance in H, distH(u,v), is not far from their
distance in G, distG(u,v).

Multiplicative spanner [t-Spanner]:
distH(u,v)  t·distG(u,v).
Spanners are sparse. A t-Spanner has O(n1+1/t) edges.


Reduce streaming graph distance to streaming spanner
construction.

BFS-like subroutines are used in most existing spanner
constructions.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
22
Streaming Spanner Construction




For each incoming edge, decide whether it should be in
the spanner.
If the edge causes a cycle of length  t, do not put the
edge in the spanner.
This gives a t-spanner, because there is a path P of
length < t connecting the two endpoints of any discarded
edge.
This spanner is sparse.
Thm [Bollobás78] : A graph whose girth is larger than k can only
have O(n1+2/(k-1)) edges.

Need to know: For an incoming edge, does the path P
exist?
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
23
Partial Solution: Clusters (1)


A cluster is a subset of vertices and a small
diameter spanning tree built on these vertices.
Intra-cluster edge
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
24
Partial Solution: Clusters (2)

Inter-cluster edges
Bollobás’s result no longer applies. Need to control the
number of clusters (i.e., make it
).
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
25
Summary of the One-Pass Algorithm


Use a vertex-labeling scheme to construct the clusters.
Structure of the algorithm:


In the pre-processing phase, generate a multi-level set of labels.
Go through the stream; for each edge:
 According to the current assignment of labels to vertices,
decide whether to put this edge in the spanner.


Depending on the type of edge, possibly assign more labels
to one of its endpoints.
Next, an example with t = log n
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
26
Labels
(2,2)
(1,2)
Level 2
(2,7)
(1,4)
(1,7)
(1,11)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12)



logn/2 levels
w.h.p., there are
Semantics of labels:


June 15, 2005
Level 1
Level 0
top-level labels.
The set of vertices assigned the same top-level label forms
a cluster.
The set of vertices assigned the same lower-level label
forms a “pre-cluster.”
J. Zhang - Ph.D. Dissertation Defense
27
Initial Label Assignment
(2,2)
Level 2
(2,7)
(1,2)
(1,4)
(1,7)
(1,11)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10)(0,11) (0,12)
v1
June 15, 2005
v2
v3
v4
v5
v6
v7
v8
v9
J. Zhang - Ph.D. Dissertation Defense
Level 1
Level 0
v10 v11 v12
28
On arrival of an edge

Already know what to do with:



Intra-cluster/pre-cluster edges
Inter-cluster edges
Edges connecting pre-clusters: the sticky edges


They are added to the spanner.
They may lead to new label assignment and cluster
growth.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
29
“Good” Neighbor (1)
June 15, 2005
(3,2)
(3,2)
(2,2)
(2,2)
(1,6)
(1,2)
(0,6)
(0,2)
v
u
J. Zhang - Ph.D. Dissertation Defense
Has marked
labels
30
Good Neighbor (2)
C(3,2)
C(2,2)
C(1,6)
June 15, 2005
C(1,2)
v
u
J. Zhang - Ph.D. Dissertation Defense
31
“Bad” Neighbor
June 15, 2005
(1,6)
(3,2)
v
u
J. Zhang - Ph.D. Dissertation Defense
No marked
labels
32
Properties of the Clusters

Small diameter

Number of clusters bounded by

Do not need to cover the whole graph with clusters,
but the uncovered subgraph is sparse.
.
The uncovered subgraph consists of sticky edges,
and there are not too many of them.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
33
Sticky Edges are Rare
u1
u4
v
u2
u1, u2, u3, u4 …
u3

A neighbor is good with probability at least ½.

After seeing at most logn/2 good neighbors, v will be assigned a toplevel label and be included in a cluster. No more sticky edges for v.
The number of sticky edges can be bounded by the length of the
shortest prefix in the above sequence that contains logn/2 good
neighbors.

June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
34
Talk Outline

Streaming computational model

Overview of results

Approximate graph distances in the streaming
model

Future research directions
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
35
Summary




We investigated two important problem domains.
Exact computation is hard; approximation may
be feasible.
For some problems, particularly graph problems,
considering a more general model is important,
because polylog space is too restrictive.
Constructing a sketch of non-numerical input is
an important tool in streaming-algorithm design.
June 15, 2005
J. Zhang - Ph.D. Dissertation Defense
36
Future Research Directions

Geometric problems:



High-dimensional geometric problems
Sliding-window with flexible size
Graph problems:

June 15, 2005
Dynamic graph problems
J. Zhang - Ph.D. Dissertation Defense
37
Download