Inferring Mixtures of Markov Chains Tuğkan Batu Sudipto Guha Sampath Kannan University of Pennsylvania An Example: Browsing habits • You read sports and cartoons. You’re equally likely to read both. You do not remember what you read last. • You’d expect a “random” sequence SCSSCSSCSSCCSCCCSSSSCSC… Suppose there were two • I like health and entertainment • I always read entertainment first and then read health page. • The sequence would be EHEHEHEHEHEHEH… Two readers, one log file • If there is one log file… • Assume there is no correlation between us SECHSSECSHESCSSHCCESCHCCSESHESSHECSHCE… Is there enough information to tell that there are two people browsing? What are they browsing? How are they browsing? Clues in stream? • Yes, somewhat. SECHSSECSHESCSSHCCESCHCCSESHESSHECSHCE • H and E have special relationship. • They cannot belong to different (uncorrelated) people. • Not clear about S and C. Suppose there were 3 uncorrelated persons … Markov Chains as Stochastic Sources .4 1 2 .4 .2 3 .9 .7 4 .8 .5 .2 7 .1 Output sequence: 1 4 7 7 1 2 5 7 ... .3 6 .5 1 .1 5 .9 Markov chains on S,E,C,H 1/2 1/2 Modeled by … S 1/2 1 H 1 E C 1/2 Their interleaving cannot be Markovian. Another example • Consider network traffic logs… • Malicious attacks were made • Can you tell apart the pattern of attack from the log? • Intrusion detection, log validation, etc… Yet another example • Consider a genome sequence • Each genome sequence has “coding” regions and “non-coding” regions – (Separate) Markov chains (usually higher order) are used to model these two regions • Can we predict anything about such regions? The origins of the problem • Two or more probabilistic processes • We are observing interleaved behavior • We do not know which state belongs to which process – cold start. The Problem MC1 ... 1 3 2 5 1 4 ...2 6 1 3 2 7 5 3 1 4 1 MC2 ... 2 6 7 3 1 Observe ...2 6 1 3 2 7 5 3 1 4 1 ... Infer: MC1 & MC2 How About ? MC1 MC2 ... 1 3 2 5 1 4 ... 2 6 7 3 1 A gate function How powerful is this function? Clearly a powerful function can produce arbitrary sequences … Power of the Gate function • A powerful gate function can encode powerful models. Hidden or Hierarchical Markov models… • Assume a simple (k-way) coin flip for now. Streaming Model(s) ... 10111010000110100111010010101101100111011100001101001010010... Processor •Processor memory is small (polylog?) compared to input size. •One or more passes but data read left-to-right in each pass. •Input order adversarial or “natural”. For our problem we assume: • Stream is polynomially long in the number of states of 6 each Markov chain (need perhaps O(n ) long stream). • Nonzero probabilities are bounded away from 0. • Space available is some small polynomial in #states. Related Work • [Freund & Ron] Considered gate function to be a “special” Markov chain and individual processes as distribution. • Mixture Analysis [Duda & Hart] • Mixture of Bayesian Networks, DAG models [Thiesson et al.] • Mixture of Gaussians [Dasgupta, Arora & Kannan] • [Abe & Warmuth] complexity of learning HMMs • Hierarchical Markov Models [Kervrann & Heitz] The old example SECHSSECSHEHSECSSHCCESCHCCSESHESSHECSH • No “HH”. • No “HSH” but “HEH”. • The logic: if E is in a different chain then we should also see “HH” A few definitions • • • • • T[u] : probability of ……u…… T[uv] : probability of ……uv…… T[uv]/T[u] = probability of v after u S[u]: stationary probability of u (in its chain) au: mixing probability of chain of u Remark. We have approximations to T and S. Assumption Assume that stream is generated by Markov chains (number unknown to us) that have disjoint state spaces. Remark. Once we figure out state spaces, rest is simple. Inference Idea 1 • Warm-up: T[uv]=0 : u and v are in same chain. • Idea: If u,v in different chains, v will follow u w/ freq. avS(v) Lemma. If , u,v are in same chain. Proof. If u,v in different chain, • So, in first phase, we grow components based on this rule. What do we have after Idea 1? • If we have not “resolved” u & v, T[uv]=T[u] T[v]. • Either u,v in different chain, or Muv = S(v) so that T[uv]=T[u] avMuv=T[u] avS(v)=T[u]T[v]. End of Phase 1 • We have a set of component vertices • But, further collapsing is possible. 1/2 1/2 S 1/2 C C S 1/2 H E Inference Idea 2 • Consider u,v already same component, z in separate component. State z is in same chain if and only if T[uzv]= T[u]T[z]T[v]. Now, we can complete collapsing components. At the end • Either we will resolve all edges incident to all chains, or we have some singleton components such that for each pair u,v, T[u] T[v] = T[uv], equivalently, Muv=S(v). Hence, next state distribution (for any state) is S. The Old Example The components of S and C will be left unmerged. 1/2 1/2 S 1/2 C 1/2 This is no bug! H E More Precisely • If we have two competing hypotheses then the likelihood of observing the string is exactly equal for both the hypotheses. • In other words, we have two competing models which are equivalent. More General Mixing Processes • Up to now, i.i.d. coin flips for mixing • We can handle – even when the next chain is chosen depending on last output (i.e., each state has its own “next-chain” distribution) e.g.: Web logs: At some pages you click sooner, others you read before clicking Intersecting State Sets We need two assumptions: 1. Two Markov chains, 2. There exists a state w that belongs to exactly one chain, for all v, Mwv > S(v) or Mwv=0. • Using analogous inference rules and state w as a reference point, we can infer underlying Markov chains. Open Questions • Remove/relax assumptions for intersecting state spaces • Hardness results? • Reduce stream length? Sample more frequently, but lose independence of samples... is there a more sophisticated argument? • Some form of “hidden” Markov model? Rather than seeing a stream of states we see a stream of a function of states. Difficulty: Identical labels for states CAUTION: inferring a single hidden Markov model is hard.