Introduction Le Song Machine Learning I CSE 6740, Fall 2013 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2 Common to Industrial scale problems 13 million wikipedia pages 800 million users 6 billion photos 24 hours video uploaded per minutes > 1 trillion webpages 3 Organizing Images Image Databases What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 4 Visualize Image Relations Each image has thousands or millions of pixels. What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 5 Organizing documents Reading, digesting, and categorizing a vast text database is too much for human! We want: What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 6 Weather Prediction Predict Numeric values: 40 F Wind: NE at 14 km/h Humidity: 83% Predict What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 7 Face Detection What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 8 Understanding brain activity What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 9 Product Recommendation What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 10 Handwritten digit recognition/text annotation Inter-character dependency Inter-word dependency Aoccdrnig to a sudty at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, What are the desired outcomes? but the wrod as a wlohe. What are the inputs (data)? What are the learning paradigms? 11 Similar problem: speech recognition Models Hidden Markov Models Text “Machine Learning is the preferred method for speech recognition …” Audio signals 12 cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta ctgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca cagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc actccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca tattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata catatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg tggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc aagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt tctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt tacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat agtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc gagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt gagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga accaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag ttactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag aaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg attgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac aagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc ttcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc ttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc gtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg ctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctcwhereisthegenetgattaaaaatatcctttaagaaagcccatgggtataactt actgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg aggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg ataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa cccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaa cgctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaac cacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc ggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg attaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta ctgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca cagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc actccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca tattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata catatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg tggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc aagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt tctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt tacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat agtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc gagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt gagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga accaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag ttactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag aaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg attgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac aagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc ttcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc ttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc gtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg ctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctctgaaaacagttcatggtttaaaaatatcctttaagaaagcccatgggtataactt actgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg aggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg ataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa Similar problem: bioinformatics Where is the gene? 13 Spam Filtering What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 14 Similar problem: webpage classification Company homepage vs. University homepage 15 Robot Control Now cars can find their own ways! What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 16 Nonlinear classifier Nonlinear Decision Boundaries Linear SVM Decision Boundaries 17 Nonconventional clusters Need more advanced methods, such as kernel methods or spectral clustering to work 18 Syllabus Cover a number of most commonly used machine learning algorithms in sufficient amount of details in their mechanisms. Organization Unsupervised learning (data exploration) Clustering, dimensionality reduction, density estimation, novelty detection Supervised learning (predictive models) Classifications, regressions Complex models (dealing with nonlinearity, combine models etc) Kernel methods, graphical models, boosting 19 Prerequisites Probabilities Distributions, densities, marginalization, conditioning …. Basic statistics Moments, classification, regression, maximum likelihood estimation Algorithms Dynamic programming, basic data structures, complexity Programming Mostly your choice of language, but Matlab will be very useful The class will be fast paced Ability to deal with abstract mathematical concepts 20 Textbooks Textbooks: Pattern Recognition and Machine Learning, Chris Bishop The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman Machine Learning, Tom Mitchell 21 Grading 6 assignments (60%) Approximately 1 assignment every 4 lectures Start early Midterm exam (20%) Final exam (20%) Project for advanced students Can be used to replace exams Based on student experience and lecturer interests. 22 Homeworks Zero credit after each deadline All homeworks must be handed in, even for zero credit Collaboration You may discuss the questions Each student writes their own answers Write on your homework anyone with whom you collaborate Each student must write their own codes for the programming part 23 Staff Instructor: Le Song, Klaus 1340 TA: Joonseok, Klaus 1305, Nan Du, Klaus 1305 Guest Lecturer: TBD Administrative Assistant: Mimi Haley, Klaus 1321 Mailing list: mlcda2013@gmail.com More information: http://www.cc.gatech.edu/~lsong/teaching/CSE6740fall13.html 24 Today Probabilities Independence Conditional Independence 25 Random Variables (RV) Data may contain many different attributes Age, grade, color, location, coordinate, time … Upper-case for random variables (eg. π, π), lower-case for values (eg. π₯, π¦) π(π) for distribution, π(π) for density πππ(π) = possible values of random variable π For discrete (categorical): π=1…|πππ(π)| π(π = π₯π ) = 1 For continuous: πππ π π π = π₯ ππ₯ = 1 π π₯ ≥0 Shorthand: π(π₯) for π(π = π₯) 26 Interpretations of probability Frequentists π(π₯) is the frequency of π₯ in the limit Many arguments against this interpretation What is the frequency of the event “it will rain tomorrow?” Subjective interpretation π(π₯) is my degree of belief that π₯ will happen What does “degree of belief mean”? If π(π₯) = 0.8, then I am willing to bet For this class, we don’t care the type of interpretation. 27 Conditional probability After we have seen π₯, how do we feel π¦ will happen? π π¦ π₯ means π(π = π¦|π = π₯) A conditional distribution are a family of distributions For each π = π₯, it is a distribution π(π|π₯) 28 Two of the most important rules: I. The chain rule π π¦, π₯ = π π¦ π₯ π π₯ More generally: π π₯1 , π₯2 , … , π₯π = π π₯1 π π₯2 π₯1 … π π₯π π₯π−1 , … , π₯2 , π₯1 29 Two of the most important rules: II. Bayes rule Prior likelihood π π¦π₯ = π π₯π¦ π π¦ π π₯ posterior = π(π₯,π¦) π₯∈Val π π(π₯,π¦) Normalization constant More generally, additional variable z: π(π¦|π₯, π§) = π π₯ π¦,π§ π(π¦|π§) π(π₯|π§) 30 Independence π and π independent, if π(π|π) = π(π) π π π = π π → (π ⊥ π) Proposition: π and π independent if and only if π(π, π) = π(π)π(π) 31 Conditional independence Independence is rarely true; conditional independence is more prevalent π and π conditionally independent given Z if π(π|π, π) = π(π|π) π(π|π, π) = π(π|π) → (π ⊥ π | π) π ⊥ π π if and only if π π, π π = π π π π π π 32 Joint distribution, marginalization Two random variables – Grades (G) & Intelligence (I) G π πΊ, πΌ = A B I VH H 0.7 0.1 0.15 0.05 For π binary variables, the table (multiway array) gets really big π π1 , π2 , … , ππ has 2π entries! Marginalization – Compute marginal over a single variable π(πΊ = π΅) = π(πΊ = π΅, πΌ = ππ») + π(πΊ = π΅, πΌ = π») = 0.2 33 Marginalization – the general case Compute marginal distribution π ππ from π π1 , π2 , … , ππ , ππ+1 , … , ππ π π1 , π2 , … , ππ = π π1 , π2 , … , ππ , π₯π+1 , … π₯π π₯π+1 ,…,π₯π π ππ = π π₯1 , … , π₯π−1 , ππ π₯1,…,π₯π−1 If binary variables, need to sum over 2π−1 terms! 34 Example problem Estimate the probability π of landing in heads using a biased coin Given a sequence of π independently and identically distributed (iid) flips Eg., π· = π₯1 , π₯2 , … , π₯π = {1,0,1, … , 0}, π₯π ∈ {0,1} Model: π π₯|π = π π₯ 1 − π π(π₯|π ) = 1−π₯ 1 − π, πππ π₯ = 0 π, πππ π₯ = 1 Likelihood of a single observation π₯π ? π π₯π |π = π π₯π 1 − π 1−π₯π 35 Frequentist Parameter Estimation Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different “objective” estimators, instead of Bayes’ rule These estimators have different properties, such as being “unbiased”, “minimum variance”, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties π = ππππππ₯π π π· π = ππππππ₯π π π=1 π(π₯π |π) 36 MLE for Biased Coin Objective function, log likelihood π π; π· = log π π· π = log π πβ 1 − π π − πβ log 1 − π ππ‘ = πβ log π + We need to maximize this w.r.t. π Take derivatives w.r.t. π ππ ππ = πβ π − π−πβ 1−π = 0 ⇒ πππΏπΈ = πβ π or πππΏπΈ = 1 π π π₯π 37 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: π(π|π·) = π π· π π(π) π(π·) = π π· π π(π) π π· π π π ππ π The crucial equation can be written in words πππ π‘πππππ = ππππππβπππ×πππππ ππππππππ ππππππβπππ π For iid data, the likelihood is π π· π = π π₯π π=1 π 1−π 1−π₯π =π π π₯π 1−π π π 1−π₯π π π=1 π(π₯π |π) = π #βπππ 1 − π #π‘πππ The prior π π encodes our prior knowledge on the domain Different prior π π will end up with different estimate π(π|π·)! 38 Bayesian estimation for biased coin Prior over π, Beta distribution π π; πΌ, π½ = Γ πΌ+π½ Γ π Γ π½ π πΌ−1 1 − π π½−1 When x is discrete Γ π₯ + 1 = π₯Γ π₯ = π₯! Posterior distribution of π π π₯1 ,…,π₯π π π π π π₯1 ,…,π₯π ππ‘ π πΌ−1 1 − π π½−1 = π π|π₯1 , … , π₯π = ∝ π πβ 1 − π π πβ +πΌ−1 1 − π ππ‘ +π½−1 πΌ and π½ are hyperparameters and correspond to the number of “virtual” heads and tails (pseudo counts) 39 Bayesian Estimation for Bernoulli Posterior distribution π π π₯1 ,…,π₯π π π π π π₯1 ,…,π₯π ππ‘ π πΌ−1 1 − π π½−1 = π π|π₯1 , … , π₯π = ∝ π πβ 1 − π π πβ +πΌ−1 1 − π ππ‘ +π½−1 Posterior mean estimation: ππππ¦ππ = π π π π· ππ = πΆ π × π πβ +πΌ−1 1 − π ππ‘ +π½−1 ππ = (πβ +πΌ) π+πΌ+π½ Prior strength: π΄ = πΌ + π½ A can be interpreted as an imaginary dataset 40