BURSTY AND HIERARCHICAL STRUCTURE IN STREAMS Jon Kleinberg ACM’ 02 OUTLINE. Introduction. Preliminary. Automaton model. Experiment. Conclusion. INTRODUCTION Motivation Stream has many relate work. Text mining, topic detection and tracking and visualization. The “Bursts” may useful? PRELIMINARY n + 1 messages that arrive over a period of time of length T. Gaps of size : ĝ = T/n exponential density function : f(x) = αе-ax AUTOMATON MODEL Two-state model Using a probabilistic automaton A A with two states q0 and q1, we can think of “low” and “high” q0 : f0(x) = α0е-a0x q1: f1(x) = α1е-a1x, α 1> α0 A changes state with probability p∈(0, 1) A begins in state q0. Before each message is emitted, A changes state with probability p. n+1 message, gaps x = (xl, x2,... , xn) state sequence q = (qi1,…..,qin) fq(xl ,.....,xn) = ∏tn=1 fit(xt) AUTOMATON MODEL b denotes the number of state transitions in the sequence q the number of indices it so that qit ≠ qit+1 probability of q : Pr[q|x] = Z is the constant -ln Pr[q|x] = AUTOMATON MODEL Infinite-state model Bursts of greater and greater intensity would be associated with gaps smaller and smaller than ĝ. αi = ĝ-1 si, where s > 1 is a parameter. f (x) = αiе-aix i For every i and j, there is a cost τ(i , j) associated with a state transition from qi to qj. When j > i, moving from qi to qj incurs a cost of (j - i)γInn, where γ > 0 is a parameter; and when j < i, the cost is 0. This automaton, with its associated parameters s and γ will be denoted A*s, γ. AUTOMATON MODEL Computing a minimum-cost state sequence Finding a state sequence q = (qi1, .... ,qin) in A*s, γ that minimizes the cost c(q|x). Such a sequence will be called optimal. A natural number k for q0, q1,... , qk-1 from A*s, γ and denotes the k-state automaton by Aks, γ . two-state automaton A2s, γ Let q* = (ql1,…,qln) be an optimal state sequence in Aks, γ Let q = (qi1,…,qin) be an arbitrary state sequence in A*s, γ The goal is to show that c (q*|x) ≤ c (q|x). AUTOMATON MODEL If q does not contain any states of index greater than k-1, this inequality follows from the fact that q* is an optimal state sequence in Aks, γ . Otherwise…. q' = (qi1',.... , qit') where it' = min(it, k-1). Since q' is a state sequence in Aks, γ , and since q* is an optimal state sequence for this automaton, it follows that c(q*|x) ≤ c(q'|x) ≤ c(q|x) EXPERIMENT EXPERIMENT CONCLUSION