P. Santhanam K. Viswanathan J. Zhang

advertisement
Estimation, Prediction,
and Classification over
Large Alphabets
A. Orlitsky, UCSD
with
J. Acharya
P. Santhanam
Monday, June 21, 2010
H. Das
K. Viswanathan
S. Pan
J. Zhang
2 Probability Estimation Problems
Probability of each element
P(Slovenia)=.3,
P(U.S.)=.7
P(Black)=.3, P(Blue)=.2, P(Brown)=.4 P(Green)=.1
Probability mulstiset
{.7, .3}
{.4, .3, .2, .1}
Estimate using sample
Patterns
Monday, June 21, 2010
Patterns
Replace each symbol by its order of appearance
Sequence: US, Slovenia, US, US, Slovenia
2,
1, 1, 2
Pattern: 1,
Sequence: Brown, Blue, Brown, Green, Brown, black, Blue
Pattern:
1,
2,
1,
3,
1,
4,
b,g,b → 1, 2, 1
→ 1, 2, 3, 2
Capture structure and frequencies, ignore symbols
Monday, June 21, 2010
2
Two Estimation Problems
Probability of each element
P(Slovenia)=.3,
P(U.S.)=.7
2
P(Black)=.3, P(Blue)=.2, P(Brown)=.4 P(Green)=.1
Probability mulstiset
{.7, .3}
{.4, .3, .2, .1}
Estimate using sample
Patterns
Monday, June 21, 2010
1
Applications
Probability multiset {.5, .3, .1, .05, .05}
Population: species, gene mutations, sensors
Who cares?
Coverage: experiments, virus strains
Robustness: capacity, memory, buffer
Security: unlikely events
Information: entropy
Statistics
Everyone
Monday, June 21, 2010
Martianese
Martian UFO has landed
First message
@#@#@@#...@#@
(60 @′ s, 40 #′ s)
Statisticians analyze Martianese
Assuming independence they determine
2? symbols
one has probability .6
? , other.4
?
}
{.6, .4}
Methodology
Empirical frequency or Maximum likelihood
Monday, June 21, 2010
Maximum likelihood
-
Distribution maximizing P( x )
x = @#@#@@#...@#@
p(@) = p,
(60 @, 40 #)
p(#) = 1-p
P( x) = p60 (1-p)40 (independence)
40
39
59
60
x
P’( ) = 60 p (1-p) - 40 p (1-p)
= p59(1-p)39 [ 60(1-p) - 40p ] = 0
Maximized by p(@) = p = .6
p(#) = .4
→ {.6, .4}
Agrees with empirical frequency, always
20 @, 50 #, 30 & → {.2, .5, .3} = {.5, .3, .2}
Works well for small alphabets
poorly large
Monday, June 21, 2010
Marchi
tia nese
講座非常悶。。。不過請報以大笑
100 symbols, all distinct
Sequence maximum likelihood
Alphabet size 1?00
Each symbol probability 1/100
?
?
possibly ∞ alphabet
Poor model , better: large,
Observation: Values don’t matter, consider pattern
Pattern maximum likelihood
Why?
Pattern: ?1 2 3 4 . . . 98 99 100
Smaller α/β?
All different
Maximized by: ?
Different!
Monday, June 21, 2010
Better for very large α/β
Sequences v. patterns
Sequence independently, identically distributed (iid)
Induces probability on patterns
ψ - pattern, e.g. 121
P(ψ) = P(observing ψ) = P(all outcomes with pattern ψ)
Example: Alphabet {a, b, c}
p(a) = p(b) = p(c) = 1/3
1 sample: P(1 ) = 1
2 samples: P(11 ) = P(aa, bb, cc) = 3 * 1/9 = 1/3
= P(second = first)
P(12) = P(ab, ac, ba, bc, ca, cb) = 6 * 1/9 = 2/3
= P(second ≠ first)
Pattern not distributed iid → Different math
Monday, June 21, 2010
Pattern maximum likelihood
Given a pattern, e.g. 1213
^ and achieving distribution
Find highest probability P,
That’s it!
^
Length 1: P(1)
= ?1, ?any distribution
^
2: P(11)
= ?1, ?constant distributions
^
P(12)
= 1,
? ?large distributions
^
3: P(111)
= 1,
^
P(112)
= ?=
^
P(123)
=1
^P(121) = P(122)
^
Flip 3 coins: h,h,t
?
1/3}
Sequence ML → {2/3,
Pattern
Monday, June 21, 2010
ML → ?
≥
^
P(112)
= 1/4
Coin, P(h) = P(t) = .5
P(112) = P(hht V tth) = P(hht) + P(tth)
= 1/8 + 1/8 = 1/4
≤
P(112) = Σx p2(x) · (1-p(x))
= Σx p(x) · [ p(x) (1-p(x)) ]
≤ Σx p(x) · [ 1/4 ]
= 1/4
h,h,t: Sequence ML: {2/3, 1/3},
Which more logical?
Pattern ML: {.5, .5}
Not p(h) & p(t), just multiset
3 flips, can’t get 1.5 each → 2&1 closest to uniform
{2/3, 1/3} & others → 3 same more likely → 2&1 less likely
Monday, June 21, 2010
Theoretical
Results
Skip
Monday, June 21, 2010
Experiments
Monday, June 21, 2010
Uniform
2x6, 3x4, 13x3, 63x2, 161x1, 258x0
Underlying
SML
PML
−2
10
500 symbols
350 samples
−3
10
2x6, 3x4, 13x3, 63x2, 161x1
242 appeared, 258 did not
−4
10
Monday, June 21, 2010
200
400
600
800
1000
U[500], 350x, 12 experiments
U[500], 350 samples
8/30/09 10:52 PM
U[500], 350 samples
file:///Users/alon/Desktop/U%5B500%5D,%20350%20samples.webarchive
Monday, June 21, 2010
U[500], 350 samples
8/30/09
Page 1 of 2
file:///Users/alon/Desktop/U%5B500%5D,%20350%20samples.webarchive
Pa
Uniform
2x6, 3x4, 13x3, 63x2, 161x1, 258x0
Underlying
SML
PML
−2
10
500 symbols
350 samples
−3
2x6, 3x4, 13x3, 63x2, 161x1
10
248 appeared, 258 did not
3x6, 2x5, 19x4, 62x3, 114x2, 182x1, 118x0
−4
10
Underlying
Underlying
SML
ML
PML
200
400
600
−3
10
700 samples
−4
10
200
Monday, June 21, 2010
400
600
800
1000
800
1000
U[500], 700x, 12 experiments
U[500], 700 samples
8/30/09 11:48 PM
U[500], 700 samples
file:///Users/alon/Desktop/U%5B500%5D,%20700%20samples.webarchive
U[500], 700 samples
Page 1 of 2
file:///Users/alon/Desktop/U%5B500%5D,%20700%20samples.webarchive
Monday, June 21, 2010
8/31/09 1
Pag
Staircase
15K elements, 5 steps, ~3x
30K samples
Observe 8,882 elts
6,118 missing
1x66, 2x62, 2x60, ..., 1990x2, 4323x1, 6118x0
−3
10
Underlying
SML
PML
−4
10
−5
10
−6
10
−7
10
Monday, June 21, 2010
2000 4000 6000 8000 10000 12000 14000
Zipf
Underlies many natural phenomena
pi=C/i, i=100…15,000
30,000 samples
Observe 9,047 elts
5,953 missing
1x71, 1x64, 1x61, ..., 1991x2, 4236x1, 5953x0
−3
10
Underlying
SML
PML
−4
10
−5
10
−6
10
−7
10
Monday, June 21, 2010
2000 4000 6000 8000 10000 12000 14000
1990 Census - Last names
SMITH
JOHNSON
WILLIAMS
JONES
BROWN
DAVIS
MILLER
WILSON
MOORE
TAYLOR
1.006
0.810
0.699
0.621
0.621
0.480
0.424
0.339
0.312
0.311
1.006
1.816
2.515
3.136
3.757
4.237
4.660
5.000
5.312
5.623
1
2
3
4
5
6
7
8
9
10
AMEND
ALPHIN
ALLBRIGHT
AIKIN
ACRES
ZUPAN
ZUCHOWSKI
ZEOLLA
0.001
0.001
0.001
0.001
0.001
0.000
0.000
0.000
77.478
77.478
77.479
77.479
77.480
77.480
77.481
77.481
18835
18836
18837
18838
18839
18840
18841
18842
Monday, June 21, 2010
18,839 names
77.48% population
~230
million
1990 Census - Last names
18,839 last names based on ~230 million
35,000 samples, observed 9,813 names
1x466, 1x343, 1x296, ..., 2067x2, 5108x1, 9026x0
−2
10
Underlying
SML
PML
−4
10
−6
10
0.5
1
1.5
2
4
x 10
Monday, June 21, 2010
Population estimation
Malayan butterflies [Fisher Corbet Williams 43]
Maximum likelihood doesn’t work
Poisson distribution, Gamma prior
New species [Good Toulmin 56]
Mutations, Strains
Sensors
Many others
Monday, June 21, 2010
Shakespeare
899,306 words, 27,689 distinct (half appear once)
[76]:
In equal size: 11,430 new words
[85, Taylor]: “Shall I die?”
429 words, 9 new: joying, scanty, speck,....,
[87]:
Expect 6.93 new
Pattern maximum likelihood: 6.98 new
Monday, June 21, 2010
# new symbols
Zipf distribution over 15K elements
Sample 30K times
Estimate: # new symbols in sample of size λ * 30K
Good-Toulmin:
λ<1
C/i, i∈[100,...,15099], 30000 samples
1x71, 1x64, 1x61, ..., 1991x2, 4236x1, 5953x0
6000
λ>1
Extends to λ > 1
Applies to other predictions
Expected # new symbols
Estimate PML & predict
5000
4000
Underlying
Underlying
Good−Toulmin
Underlying
Good−Toulmin
PML
3000
2000
1000
0
0
Monday, June 21, 2010
1
2
λ
3
4
Probability of
Specific Elements
Monday, June 21, 2010
Can Patterns Work?
US, Slovenia, US, US -> 1, 2, 1, 1
Lost connection to original?
No
1 - US, 2 - Slovenia
After 1, 2, 1, 1, estimate probabilities of 1, 2, 3
Probabilities of observed symbols & NEW
Monday, June 21, 2010
Text classification
Train: Documents pre-classified into topics
Business, technology, politics, sports,...
Test: New document
Find topic
Existing methods: Many
Nearest Neighbor, Boosting, ...
Support vector machines (SVM)
Current approach
Plain probability estimation
Monday, June 21, 2010
Comparison
Methods: Nearest neighbor, Laplace, SVM, Patterns
Data sets: Newsgroups, Reuters R52, CADE
Platform: Rainbow (CMU, UMass)
R52
Laplace
Nrst nbr
SVM
Patterns
Monday, June 21, 2010
Newsgroups CADE
88.43
80.92
57.27
83.22
75.93
51.20
93.57
82.84
52.84
91.98
82.92
59.24
1 2 3 4 5 6789
Thank You!
Monday, June 21, 2010
Related documents
Download