Estimation, Prediction, and Classification over Large Alphabets A. Orlitsky, UCSD with J. Acharya P. Santhanam Monday, June 21, 2010 H. Das K. Viswanathan S. Pan J. Zhang 2 Probability Estimation Problems Probability of each element P(Slovenia)=.3, P(U.S.)=.7 P(Black)=.3, P(Blue)=.2, P(Brown)=.4 P(Green)=.1 Probability mulstiset {.7, .3} {.4, .3, .2, .1} Estimate using sample Patterns Monday, June 21, 2010 Patterns Replace each symbol by its order of appearance Sequence: US, Slovenia, US, US, Slovenia 2, 1, 1, 2 Pattern: 1, Sequence: Brown, Blue, Brown, Green, Brown, black, Blue Pattern: 1, 2, 1, 3, 1, 4, b,g,b → 1, 2, 1 → 1, 2, 3, 2 Capture structure and frequencies, ignore symbols Monday, June 21, 2010 2 Two Estimation Problems Probability of each element P(Slovenia)=.3, P(U.S.)=.7 2 P(Black)=.3, P(Blue)=.2, P(Brown)=.4 P(Green)=.1 Probability mulstiset {.7, .3} {.4, .3, .2, .1} Estimate using sample Patterns Monday, June 21, 2010 1 Applications Probability multiset {.5, .3, .1, .05, .05} Population: species, gene mutations, sensors Who cares? Coverage: experiments, virus strains Robustness: capacity, memory, buffer Security: unlikely events Information: entropy Statistics Everyone Monday, June 21, 2010 Martianese Martian UFO has landed First message @#@#@@#...@#@ (60 @′ s, 40 #′ s) Statisticians analyze Martianese Assuming independence they determine 2? symbols one has probability .6 ? , other.4 ? } {.6, .4} Methodology Empirical frequency or Maximum likelihood Monday, June 21, 2010 Maximum likelihood - Distribution maximizing P( x ) x = @#@#@@#...@#@ p(@) = p, (60 @, 40 #) p(#) = 1-p P( x) = p60 (1-p)40 (independence) 40 39 59 60 x P’( ) = 60 p (1-p) - 40 p (1-p) = p59(1-p)39 [ 60(1-p) - 40p ] = 0 Maximized by p(@) = p = .6 p(#) = .4 → {.6, .4} Agrees with empirical frequency, always 20 @, 50 #, 30 & → {.2, .5, .3} = {.5, .3, .2} Works well for small alphabets poorly large Monday, June 21, 2010 Marchi tia nese 講座非常悶。。。不過請報以大笑 100 symbols, all distinct Sequence maximum likelihood Alphabet size 1?00 Each symbol probability 1/100 ? ? possibly ∞ alphabet Poor model , better: large, Observation: Values don’t matter, consider pattern Pattern maximum likelihood Why? Pattern: ?1 2 3 4 . . . 98 99 100 Smaller α/β? All different Maximized by: ? Different! Monday, June 21, 2010 Better for very large α/β Sequences v. patterns Sequence independently, identically distributed (iid) Induces probability on patterns ψ - pattern, e.g. 121 P(ψ) = P(observing ψ) = P(all outcomes with pattern ψ) Example: Alphabet {a, b, c} p(a) = p(b) = p(c) = 1/3 1 sample: P(1 ) = 1 2 samples: P(11 ) = P(aa, bb, cc) = 3 * 1/9 = 1/3 = P(second = first) P(12) = P(ab, ac, ba, bc, ca, cb) = 6 * 1/9 = 2/3 = P(second ≠ first) Pattern not distributed iid → Different math Monday, June 21, 2010 Pattern maximum likelihood Given a pattern, e.g. 1213 ^ and achieving distribution Find highest probability P, That’s it! ^ Length 1: P(1) = ?1, ?any distribution ^ 2: P(11) = ?1, ?constant distributions ^ P(12) = 1, ? ?large distributions ^ 3: P(111) = 1, ^ P(112) = ?= ^ P(123) =1 ^P(121) = P(122) ^ Flip 3 coins: h,h,t ? 1/3} Sequence ML → {2/3, Pattern Monday, June 21, 2010 ML → ? ≥ ^ P(112) = 1/4 Coin, P(h) = P(t) = .5 P(112) = P(hht V tth) = P(hht) + P(tth) = 1/8 + 1/8 = 1/4 ≤ P(112) = Σx p2(x) · (1-p(x)) = Σx p(x) · [ p(x) (1-p(x)) ] ≤ Σx p(x) · [ 1/4 ] = 1/4 h,h,t: Sequence ML: {2/3, 1/3}, Which more logical? Pattern ML: {.5, .5} Not p(h) & p(t), just multiset 3 flips, can’t get 1.5 each → 2&1 closest to uniform {2/3, 1/3} & others → 3 same more likely → 2&1 less likely Monday, June 21, 2010 Theoretical Results Skip Monday, June 21, 2010 Experiments Monday, June 21, 2010 Uniform 2x6, 3x4, 13x3, 63x2, 161x1, 258x0 Underlying SML PML −2 10 500 symbols 350 samples −3 10 2x6, 3x4, 13x3, 63x2, 161x1 242 appeared, 258 did not −4 10 Monday, June 21, 2010 200 400 600 800 1000 U[500], 350x, 12 experiments U[500], 350 samples 8/30/09 10:52 PM U[500], 350 samples file:///Users/alon/Desktop/U%5B500%5D,%20350%20samples.webarchive Monday, June 21, 2010 U[500], 350 samples 8/30/09 Page 1 of 2 file:///Users/alon/Desktop/U%5B500%5D,%20350%20samples.webarchive Pa Uniform 2x6, 3x4, 13x3, 63x2, 161x1, 258x0 Underlying SML PML −2 10 500 symbols 350 samples −3 2x6, 3x4, 13x3, 63x2, 161x1 10 248 appeared, 258 did not 3x6, 2x5, 19x4, 62x3, 114x2, 182x1, 118x0 −4 10 Underlying Underlying SML ML PML 200 400 600 −3 10 700 samples −4 10 200 Monday, June 21, 2010 400 600 800 1000 800 1000 U[500], 700x, 12 experiments U[500], 700 samples 8/30/09 11:48 PM U[500], 700 samples file:///Users/alon/Desktop/U%5B500%5D,%20700%20samples.webarchive U[500], 700 samples Page 1 of 2 file:///Users/alon/Desktop/U%5B500%5D,%20700%20samples.webarchive Monday, June 21, 2010 8/31/09 1 Pag Staircase 15K elements, 5 steps, ~3x 30K samples Observe 8,882 elts 6,118 missing 1x66, 2x62, 2x60, ..., 1990x2, 4323x1, 6118x0 −3 10 Underlying SML PML −4 10 −5 10 −6 10 −7 10 Monday, June 21, 2010 2000 4000 6000 8000 10000 12000 14000 Zipf Underlies many natural phenomena pi=C/i, i=100…15,000 30,000 samples Observe 9,047 elts 5,953 missing 1x71, 1x64, 1x61, ..., 1991x2, 4236x1, 5953x0 −3 10 Underlying SML PML −4 10 −5 10 −6 10 −7 10 Monday, June 21, 2010 2000 4000 6000 8000 10000 12000 14000 1990 Census - Last names SMITH JOHNSON WILLIAMS JONES BROWN DAVIS MILLER WILSON MOORE TAYLOR 1.006 0.810 0.699 0.621 0.621 0.480 0.424 0.339 0.312 0.311 1.006 1.816 2.515 3.136 3.757 4.237 4.660 5.000 5.312 5.623 1 2 3 4 5 6 7 8 9 10 AMEND ALPHIN ALLBRIGHT AIKIN ACRES ZUPAN ZUCHOWSKI ZEOLLA 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 77.478 77.478 77.479 77.479 77.480 77.480 77.481 77.481 18835 18836 18837 18838 18839 18840 18841 18842 Monday, June 21, 2010 18,839 names 77.48% population ~230 million 1990 Census - Last names 18,839 last names based on ~230 million 35,000 samples, observed 9,813 names 1x466, 1x343, 1x296, ..., 2067x2, 5108x1, 9026x0 −2 10 Underlying SML PML −4 10 −6 10 0.5 1 1.5 2 4 x 10 Monday, June 21, 2010 Population estimation Malayan butterflies [Fisher Corbet Williams 43] Maximum likelihood doesn’t work Poisson distribution, Gamma prior New species [Good Toulmin 56] Mutations, Strains Sensors Many others Monday, June 21, 2010 Shakespeare 899,306 words, 27,689 distinct (half appear once) [76]: In equal size: 11,430 new words [85, Taylor]: “Shall I die?” 429 words, 9 new: joying, scanty, speck,...., [87]: Expect 6.93 new Pattern maximum likelihood: 6.98 new Monday, June 21, 2010 # new symbols Zipf distribution over 15K elements Sample 30K times Estimate: # new symbols in sample of size λ * 30K Good-Toulmin: λ<1 C/i, i∈[100,...,15099], 30000 samples 1x71, 1x64, 1x61, ..., 1991x2, 4236x1, 5953x0 6000 λ>1 Extends to λ > 1 Applies to other predictions Expected # new symbols Estimate PML & predict 5000 4000 Underlying Underlying Good−Toulmin Underlying Good−Toulmin PML 3000 2000 1000 0 0 Monday, June 21, 2010 1 2 λ 3 4 Probability of Specific Elements Monday, June 21, 2010 Can Patterns Work? US, Slovenia, US, US -> 1, 2, 1, 1 Lost connection to original? No 1 - US, 2 - Slovenia After 1, 2, 1, 1, estimate probabilities of 1, 2, 3 Probabilities of observed symbols & NEW Monday, June 21, 2010 Text classification Train: Documents pre-classified into topics Business, technology, politics, sports,... Test: New document Find topic Existing methods: Many Nearest Neighbor, Boosting, ... Support vector machines (SVM) Current approach Plain probability estimation Monday, June 21, 2010 Comparison Methods: Nearest neighbor, Laplace, SVM, Patterns Data sets: Newsgroups, Reuters R52, CADE Platform: Rainbow (CMU, UMass) R52 Laplace Nrst nbr SVM Patterns Monday, June 21, 2010 Newsgroups CADE 88.43 80.92 57.27 83.22 75.93 51.20 93.57 82.84 52.84 91.98 82.92 59.24 1 2 3 4 5 6789 Thank You! Monday, June 21, 2010