Introduction Le Song Machine Learning I

advertisement
Introduction
Le Song
Machine Learning I
CSE 6740, Fall 2013
What is Machine Learning (ML)
Study of algorithms that improve their performance at some
task with experience
2
Common to Industrial scale problems
13 million wikipedia pages
800 million users
6 billion photos
24 hours video uploaded per minutes
> 1 trillion webpages
3
Organizing Images
Image
Databases
What are the desired
outcomes?
What are the inputs
(data)?
What are the learning
paradigms?
4
Visualize Image Relations
Each image has
thousands or millions of
pixels.
What are the desired
outcomes?
What are the inputs
(data)?
What are the learning
paradigms?
5
Organizing documents
Reading, digesting, and
categorizing a vast text database
is too much for human!
We want:
What are the desired outcomes?
What are the inputs (data)?
What are the learning paradigms?
6
Weather Prediction
Predict
Numeric values:
40 F
Wind: NE at 14 km/h
Humidity: 83%
Predict
What are the desired outcomes?
What are the inputs (data)?
What are the learning paradigms?
7
Face Detection
What are the desired
outcomes?
What are the inputs (data)?
What are the learning
paradigms?
8
Understanding brain activity
What are the desired
outcomes?
What are the inputs
(data)?
What are the learning
paradigms?
9
Product Recommendation
What are the desired
outcomes?
What are the inputs
(data)?
What are the learning
paradigms?
10
Handwritten digit recognition/text annotation
Inter-character
dependency
Inter-word
dependency
Aoccdrnig to a sudty at Cmabrigde
Uinervtisy, it deosn’t mttaer in waht
oredr the ltteers in a wrod are, the
olny iprmoetnt tihng is taht the frist
and lsat ltteer be at the rghit pclae.
The rset can be a ttoal mses and you
can sitll raed it wouthit a porbelm.
Tihs is bcuseae the huamn mnid
deos not raed ervey lteter by istlef,
What are the desired outcomes?
but the wrod as a wlohe.
What are the inputs (data)?
What are the learning paradigms?
11
Similar problem: speech recognition
Models
Hidden Markov Models
Text
“Machine Learning is the preferred method for speech recognition …”
Audio signals
12
cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc
ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg
attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta
ctgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca
cagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc
actccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca
tattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata
catatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg
tggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc
aagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt
tctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt
tacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat
agtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc
gagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt
gagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga
accaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag
ttactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag
aaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg
attgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac
aagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc
ttcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc
ttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc
gtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg
ctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctcwhereisthegenetgattaaaaatatcctttaagaaagcccatgggtataactt
actgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg
aggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg
ataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa
cccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaa
cgctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaac
cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc
ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg
attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta
ctgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca
cagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc
actccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca
tattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata
catatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg
tggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc
aagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt
tctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt
tacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat
agtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc
gagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt
gagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga
accaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag
ttactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag
aaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg
attgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac
aagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc
ttcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc
ttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc
gtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg
ctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctctgaaaacagttcatggtttaaaaatatcctttaagaaagcccatgggtataactt
actgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg
aggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg
ataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa
Similar problem: bioinformatics
Where is the gene?
13
Spam Filtering
What are the
desired outcomes?
What are the inputs
(data)?
What are the
learning
paradigms?
14
Similar problem: webpage classification
Company homepage vs. University homepage
15
Robot Control
Now cars can find their own ways!
What are the desired
outcomes?
What are the inputs (data)?
What are the learning
paradigms?
16
Nonlinear classifier
Nonlinear
Decision
Boundaries
Linear SVM
Decision
Boundaries
17
Nonconventional clusters
Need more advanced
methods, such as
kernel methods or spectral
clustering to work
18
Syllabus
Cover a number of most commonly used machine learning
algorithms in sufficient amount of details in their mechanisms.
Organization
Unsupervised learning (data exploration)
Clustering, dimensionality reduction, density estimation, novelty
detection
Supervised learning (predictive models)
Classifications, regressions
Complex models (dealing with nonlinearity, combine models etc)
Kernel methods, graphical models, boosting
19
Prerequisites
Probabilities
Distributions, densities, marginalization, conditioning ….
Basic statistics
Moments, classification, regression, maximum likelihood
estimation
Algorithms
Dynamic programming, basic data structures, complexity
Programming
Mostly your choice of language, but Matlab will be very useful
The class will be fast paced
Ability to deal with abstract mathematical concepts
20
Textbooks
Textbooks:
Pattern Recognition and Machine Learning, Chris Bishop
The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman
Machine Learning, Tom Mitchell
21
Grading
6 assignments (60%)
Approximately 1 assignment every 4 lectures
Start early
Midterm exam (20%)
Final exam (20%)
Project for advanced students
Can be used to replace exams
Based on student experience and lecturer interests.
22
Homeworks
Zero credit after each deadline
All homeworks must be handed in, even for zero credit
Collaboration
You may discuss the questions
Each student writes their own answers
Write on your homework anyone with whom you collaborate
Each student must write their own codes for the programming
part
23
Staff
Instructor: Le Song, Klaus 1340
TA: Joonseok, Klaus 1305, Nan Du, Klaus 1305
Guest Lecturer: TBD
Administrative Assistant: Mimi Haley, Klaus 1321
Mailing list: mlcda2013@gmail.com
More information:
http://www.cc.gatech.edu/~lsong/teaching/CSE6740fall13.html
24
Today
Probabilities
Independence
Conditional Independence
25
Random Variables (RV)
Data may contain many different attributes
Age, grade, color, location, coordinate, time …
Upper-case for random variables (eg. 𝑋, π‘Œ), lower-case for
values (eg. π‘₯, 𝑦)
𝑃(𝑋) for distribution, 𝑝(𝑋) for density
π‘‰π‘Žπ‘™(𝑋) = possible values of random variable 𝑋
For discrete (categorical): 𝑖=1…|π‘‰π‘Žπ‘™(𝑋)| 𝑃(𝑋 = π‘₯𝑖 ) = 1
For continuous:
π‘‰π‘Žπ‘™ 𝑋
𝑝 𝑋 = π‘₯ 𝑑π‘₯ = 1
𝑃 π‘₯ ≥0
Shorthand: 𝑃(π‘₯) for 𝑃(𝑋 = π‘₯)
26
Interpretations of probability
Frequentists
𝑃(π‘₯) is the frequency of π‘₯ in the limit
Many arguments against this interpretation
What is the frequency of the event “it will rain tomorrow?”
Subjective interpretation
𝑃(π‘₯) is my degree of belief that π‘₯ will happen
What does “degree of belief mean”?
If 𝑃(π‘₯) = 0.8, then I am willing to bet
For this class, we don’t care the type of interpretation.
27
Conditional probability
After we have seen π‘₯, how do we feel 𝑦 will happen?
𝑃 𝑦 π‘₯ means 𝑃(π‘Œ = 𝑦|𝑋 = π‘₯)
A conditional distribution are a family of distributions
For each 𝑋 = π‘₯, it is a distribution 𝑃(π‘Œ|π‘₯)
28
Two of the most important rules: I. The chain rule
𝑃 𝑦, π‘₯ = 𝑃 𝑦 π‘₯ 𝑃 π‘₯
More generally:
𝑃 π‘₯1 , π‘₯2 , … , π‘₯π‘˜ = 𝑃 π‘₯1 𝑃 π‘₯2 π‘₯1 … 𝑃 π‘₯π‘˜ π‘₯π‘˜−1 , … , π‘₯2 , π‘₯1
29
Two of the most important rules: II. Bayes rule
Prior
likelihood
𝑃 𝑦π‘₯ =
𝑃 π‘₯𝑦 𝑃 𝑦
𝑃 π‘₯
posterior
=
𝑃(π‘₯,𝑦)
π‘₯∈Val 𝑋 𝑃(π‘₯,𝑦)
Normalization constant
More generally, additional variable z:
𝑃(𝑦|π‘₯, 𝑧) =
𝑃 π‘₯ 𝑦,𝑧 𝑃(𝑦|𝑧)
𝑃(π‘₯|𝑧)
30
Independence
𝑋 and π‘Œ independent, if 𝑃(π‘Œ|𝑋) = 𝑃(π‘Œ)
𝑃 π‘Œ 𝑋 = 𝑃 π‘Œ → (𝑋 ⊥ π‘Œ)
Proposition: 𝑋 and π‘Œ independent if and only if
𝑃(𝑋, π‘Œ) = 𝑃(𝑋)𝑃(π‘Œ)
31
Conditional independence
Independence is rarely true; conditional independence is more
prevalent
𝑋 and π‘Œ conditionally independent given Z if
𝑃(π‘Œ|𝑋, 𝑍) = 𝑃(π‘Œ|𝑍)
𝑃(π‘Œ|𝑋, 𝑍) = 𝑃(π‘Œ|𝑍) → (𝑋 ⊥ π‘Œ | 𝑍)
𝑋 ⊥ π‘Œ 𝑍 if and only if 𝑃 𝑋, π‘Œ 𝑍 = 𝑃 𝑋 𝑍 𝑃 π‘Œ 𝑍
32
Joint distribution, marginalization
Two random variables – Grades (G) & Intelligence (I)
G
𝑃 𝐺, 𝐼 = A
B
I
VH
H
0.7
0.1
0.15
0.05
For 𝑛 binary variables, the table (multiway array) gets really big
𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛 has 2𝑛 entries!
Marginalization – Compute marginal over a single variable
𝑃(𝐺 = 𝐡) = 𝑃(𝐺 = 𝐡, 𝐼 = 𝑉𝐻) + 𝑃(𝐺 = 𝐡, 𝐼 = 𝐻) = 0.2
33
Marginalization – the general case
Compute marginal distribution 𝑃 𝑋𝑖 from
𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑖 , 𝑋𝑖+1 , … , 𝑋𝑛
𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑖 =
𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑖 , π‘₯𝑖+1 , … π‘₯𝑛
π‘₯𝑖+1 ,…,π‘₯𝑛
𝑃 𝑋𝑖 =
𝑃 π‘₯1 , … , π‘₯𝑖−1 , 𝑋𝑖
π‘₯1,…,π‘₯𝑖−1
If binary variables, need to sum over 2𝑛−1 terms!
34
Example problem
Estimate the probability πœƒ of landing in heads
using a biased coin
Given a sequence of 𝑁 independently and
identically distributed (iid) flips
Eg., 𝐷 = π‘₯1 , π‘₯2 , … , π‘₯𝑁 = {1,0,1, … , 0}, π‘₯𝑖 ∈ {0,1}
Model: 𝑃 π‘₯|πœƒ = πœƒ π‘₯ 1 − πœƒ
𝑃(π‘₯|πœƒ ) =
1−π‘₯
1 − πœƒ, π‘“π‘œπ‘Ÿ π‘₯ = 0
πœƒ,
π‘“π‘œπ‘Ÿ π‘₯ = 1
Likelihood of a single observation π‘₯𝑖 ?
𝑃 π‘₯𝑖 |πœƒ = πœƒ π‘₯𝑖 1 − πœƒ
1−π‘₯𝑖
35
Frequentist Parameter Estimation
Frequentists think of a parameter as a fixed, unknown
constant, not a random variable
Hence different “objective” estimators, instead of Bayes’ rule
These estimators have different properties, such as being
“unbiased”, “minimum variance”, etc.
A very popular estimator is the maximum likelihood estimator
(MLE), which is simple and has good statistical properties
πœƒ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃 𝐷 πœƒ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ
𝑁
𝑖=1 𝑃(π‘₯𝑖 |πœƒ)
36
MLE for Biased Coin
Objective function, log likelihood
𝑙 πœƒ; 𝐷 = log 𝑃 𝐷 πœƒ = log πœƒ π‘›β„Ž 1 − πœƒ
𝑁 − π‘›β„Ž log 1 − πœƒ
𝑛𝑑
= π‘›β„Ž log πœƒ +
We need to maximize this w.r.t. πœƒ
Take derivatives w.r.t. πœƒ
πœ•π‘™
πœ•πœƒ
=
π‘›β„Ž
πœƒ
−
𝑁−π‘›β„Ž
1−πœƒ
= 0 ⇒ πœƒπ‘€πΏπΈ =
π‘›β„Ž
𝑁
or πœƒπ‘€πΏπΈ =
1
𝑁
𝑖 π‘₯𝑖
37
Bayesian Parameter Estimation
Bayesian treat the unknown parameters as a random variable,
whose distribution can be inferred using Bayes rule:
𝑃(πœƒ|𝐷) =
𝑃 𝐷 πœƒ 𝑃(πœƒ)
𝑃(𝐷)
=
𝑃 𝐷 πœƒ 𝑃(πœƒ)
𝑃 𝐷 πœƒ 𝑃 πœƒ π‘‘πœƒ
πœƒ
The crucial equation can be written in words
π‘ƒπ‘œπ‘ π‘‘π‘’π‘Ÿπ‘–π‘œπ‘Ÿ =
π‘™π‘–π‘˜π‘’π‘™π‘–β„Žπ‘œπ‘œπ‘‘×π‘π‘Ÿπ‘–π‘œπ‘Ÿ
π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘›π‘Žπ‘™ π‘™π‘–π‘˜π‘’π‘™π‘–β„Žπ‘œπ‘œπ‘‘
𝑁
For iid data, the likelihood is 𝑃 𝐷 πœƒ =
𝑁
π‘₯𝑖
𝑖=1 πœƒ
1−πœƒ
1−π‘₯𝑖
=πœƒ
𝑖 π‘₯𝑖
1−πœƒ
𝑋
𝑖 1−π‘₯𝑖
𝑁
𝑖=1 𝑃(π‘₯𝑖 |πœƒ)
= πœƒ #β„Žπ‘’π‘Žπ‘‘ 1 − πœƒ
#π‘‘π‘Žπ‘–π‘™
The prior 𝑃 πœƒ encodes our prior knowledge on the domain
Different prior 𝑃 πœƒ will end up with different estimate 𝑃(πœƒ|𝐷)!
38
Bayesian estimation for biased coin
Prior over πœƒ, Beta distribution
𝑃 πœƒ; 𝛼, 𝛽 =
à 𝛼+𝛽
Γ π‘Ž à 𝛽
πœƒ 𝛼−1 1 − πœƒ
𝛽−1
When x is discrete Γ π‘₯ + 1 = π‘₯Γ π‘₯ = π‘₯!
Posterior distribution of πœƒ
𝑃 π‘₯1 ,…,π‘₯𝑁 πœƒ 𝑃 πœƒ
𝑃 π‘₯1 ,…,π‘₯𝑁
𝑛𝑑 πœƒ 𝛼−1 1 − πœƒ 𝛽−1 =
𝑃 πœƒ|π‘₯1 , … , π‘₯𝑁 =
∝
πœƒ π‘›β„Ž 1 − πœƒ
πœƒ π‘›β„Ž +𝛼−1 1 − πœƒ 𝑛𝑑 +𝛽−1
𝛼 and 𝛽 are hyperparameters and correspond to the number of
“virtual” heads and tails (pseudo counts)
39
Bayesian Estimation for Bernoulli
Posterior distribution πœƒ
𝑃 π‘₯1 ,…,π‘₯𝑁 πœƒ 𝑃 πœƒ
𝑃 π‘₯1 ,…,π‘₯𝑁
𝑛𝑑 πœƒ 𝛼−1 1 − πœƒ 𝛽−1 =
𝑃 πœƒ|π‘₯1 , … , π‘₯𝑁 =
∝
πœƒ π‘›β„Ž 1 − πœƒ
πœƒ π‘›β„Ž +𝛼−1 1 − πœƒ
𝑛𝑑 +𝛽−1
Posterior mean estimation:
πœƒπ‘π‘Žπ‘¦π‘’π‘  =
πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ = 𝐢 πœƒ × πœƒ π‘›β„Ž +𝛼−1 1 − πœƒ
𝑛𝑑 +𝛽−1
π‘‘πœƒ =
(π‘›β„Ž +𝛼)
𝑁+𝛼+𝛽
Prior strength: 𝐴 = 𝛼 + 𝛽
A can be interpreted as an imaginary dataset
40
Download