Machine Learning with EM 闫宏飞 北京大学信息科学技术学院 7/24/2012 http://net.pku.edu.cn/~course/cs402/2012 Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Today’s Agenda • Introduction to statistical models • Expectation maximization • Apache Mahout Introduction to statistical models • Until the 1990s, text processing relied on rulebased systems • Advantages – More predictable – Easy to understand – Easy to identify errors and fix them • Disadvantages – Extremely labor-intensive to create – Not robust to out of domain input – No partial output or analysis when failure occurs Introduction to statistical models • A better strategy is to use data-driven methods • Basic idea: learn from a large corpus of examples of what we wish to model (Training Data) • Advantages – More robust to the complexities of real-world input – Creating training data is usually cheaper than creating rules • Even easier today thanks to Amazon Mechanical Turk • Data may already exist for independent reasons • Disadvantages – Systems often behave differently compared to expectations – Hard to understand the reasons for errors or debug errors Introduction to statistical models • Learning from training data usually means estimating the parameters of the statistical model • Estimation usually carried out via machine learning • Two kinds of machine learning algorithms • Supervised learning – Training data consists of the inputs and respective outputs (labels) – Labels are usually created via expert annotation (expensive) – Difficult to annotate when predicting more complex outputs • Unsupervised learning – Training data just consists of inputs. No labels. – One example of such an algorithm: Expectation Maximization EM-Algorithm What is MLE? • Given – A sample X={X1, …, Xn} – A vector of parameters θ • We define – Likelihood of the data: P(X | θ) – Log-likelihood of the data: L(θ)=log P(X|θ) • Given X, find ML arg max L() MLE (cont) • Often we assume that Xis are independently identically distributed (i.i.d.) ML arg max L() arg max log P( X | ) arg max log P( X 1 ,..., X n | ) arg max log P( X i | ) i arg max logP( X i | ) i • Depending on the form of p(x|θ), solving optimization problem can be easy or hard. An easy case • Assuming – A coin has a probability p of being heads, 1-p of being tails. – Observation: We toss a coin N times, and the result is a set of Hs and Ts, and there are m Hs. • What is the value of p based on MLE, given the observation? An easy case (cont) L() log P( X | ) log p m (1 p) N m m log p ( N m) log(1 p) dL() d (m log p ( N m) log(1 p)) m N m 0 dp dp p 1 p p= m/N EM: basic concepts Basic setting in EM • X is a set of data points: observed data • Θ is a parameter vector. • EM is a method to find θML where ML arg max L() arg max log P( X | ) • Calculating P(X | θ) directly is hard. • Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data). The basic EM strategy • Z = (X, Y) – Z: complete data (“augmented data”) – X: observed data (“incomplete” data) – Y: hidden data (“missing” data) The log-likelihood function • L is a function of θ, while holding X constant: L( | X ) L( ) P( X | ) l ( ) log L( ) log P ( X | ) n log P ( xi | ) i 1 n logP ( xi | ) i 1 n log P ( xi , y | ) i 1 y The iterative approach for MLE ML arg max L() arg max l () n arg max log p( xi , y | ) i 1 y In many cases, we cannot find the solution directly. An alternative is to find a sequence: s.t. 0 , 1 ,..., t ,.... l ( ) l ( ) ... l ( ) .... 0 1 t l ( ) l ( t ) log P ( X | ) log P ( X | t ) n n log P ( x i , y | ) log P ( x i , y | t ) i 1 n log i 1 i 1 y y P( x , y | ) i y P( x , y | i t ) y n log i 1 y P ( xi , y | ) P( x i , y '| t ) y' P ( xi , y | ) P ( xi , y | t ) log t P ( xi , y | t ) i 1 y P( x i , y '| ) n y' P ( xi , y | t ) P ( xi , y | ) log t P ( xi , y | t ) i 1 y P( x i , y '| ) n y' n log P ( y | xi , t ) i 1 y n log E P ( y| x , t ) [ i i 1 n E P ( y| x , t ) [log i 1 i P ( xi , y | ) P ( xi , y | t ) P ( xi , y | ) ] P ( xi , y | t ) P ( xi , y | ) ] t P ( xi , y | ) Jensen’s inequality Jensen’s inequality if f is convex, then E[ f ( g ( x)] f ( E[ g ( x]) if f is concave, then E[ f ( g ( x)] f ( E[ g ( x]) log is a concave function E[log( p( x)] log(E[ p( x)]) Maximizing the lower bound ( t 1) p( xi , y | ) arg max EP ( y| x , t ) [log ] t i p( xi , y | ) i 1 n P( xi , y | ) arg max P( y | xi , ) log t P ( x , y | ) i 1 y i n t n arg max P( y | xi , ) log P( xi , y | ) t i 1 y n arg max EP ( y| x , t ) [log P( xi , y | )] i 1 i The Q function The Q-function • Define the Q-function (a function of θ): Q( , t ) E[log P( X , Y | ) | X , t ] E P (Y | X , t ) [log P( X , Y | )] n P(Y | X , ) log P( X , Y | ) E P ( y| x , t ) [log P( xi , y | )] t i 1 Y i n P( y | xi , t ) log P( xi , y | ) i 1 – – – – y Y is a random vector. X=(x1, x2, …, xn) is a constant (vector). Θt is the current parameter estimate and is a constant (vector). Θ is the normal variable (vector) that we wish to adjust. • The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt. The inner loop of the EM algorithm • E-step: calculate n Q( , t ) P( y | xi , t ) log P( xi , y | ) i 1 y • M-step: find ( t 1) arg maxQ( , ) t L(θ) is non-decreasing at each iteration • The EM algorithm will produce a sequence , ,..., ,.... 0 1 t • It can be proved that l ( ) l ( ) ... l ( ) .... 0 1 t The inner loop of the Generalized EM algorithm (GEM) • E-step: calculate n Q( , t ) P( y | xi , t ) log P( xi , y | ) i 1 y • M-step: find ( t 1) Q( arg maxQ( , ) t t 1 , ) Q( , ) t t t Recap of the EM algorithm Idea #1: find θ that maximizes the likelihood of training data ML arg max L() arg max log P( X | ) Idea #2: find the θt sequence No analytical solution iterative approach, find , ,..., ,.... 0 1 t s.t. l ( ) l ( ) ... l ( ) .... 0 1 t Idea #3: find θt+1 that maximizes a tight lower bound of l ( ) l ( t ) P ( xi , y | ) l ( ) l ( ) E P ( y| x , t ) [log ] t i P ( xi , y | ) i 1 n t a tight lower bound Idea #4: find θt+1 that maximizes the Q function Lower bound of ( t 1) l ( ) l ( ) t p ( xi , y | ) arg max E P ( y| x , t ) [log ] t i p ( xi , y | ) i 1 n n arg max E P ( y| x , t ) [log P( xi , y | )] i 1 i The Q function The EM algorithm • Start with initial estimate, θ0 • Repeat until convergence – E-step: calculate n Q( , ) P( y | xi , t ) log P( xi , y | ) t i 1 y – M-step: find (t 1) arg maxQ( , t ) An EM Example E-step M-step Apache Mahout Industrial Strength Machine Learning May 2008 Current Situation • Large volumes of data are now available • Platforms now exist to run computations over large datasets (Hadoop, HBase) • Sophisticated analytics are needed to turn data into information people can use • Active research community and proprietary implementations of “machine learning” algorithms • The world needs scalable implementations of ML under open license - ASF History of Mahout • Summer 2007 – Developers needed scalable ML – Mailing list formed • Community formed – Apache contributors – Academia & industry – Lots of initial interest • Project formed under Apache Lucene – January 25, 2008 Current Code Base • Matrix & Vector library – Memory resident sparse & dense implementations • Clustering – Canopy – K-Means – Mean Shift • Collaborative Filtering – Taste • Utilities – Distance Measures – Parameters Under Development • • • • • • • Naïve Bayes Perceptron PLSI/EM Genetic Programming Dirichlet Process Clustering Clustering Examples Hama (Incubator) for very large arrays Appendix • Sean Owen, Robin Anil, Ted Dunning and Ellen Friedman,Mahout in action,Manning Publications; Pap/Psc edition (October 14, 2011) • From Mahout Hands on, by Ted Dunning and Robin Anil, OSCON 2011, Portland Step 1 – Convert dataset into a Hadoop Sequence File • http://www.daviddlewis.com/resources/testcolle ctions/reuters21578/reuters21578.tar.gz • Download (8.2 MB) and extract the SGML files. – $ mkdir -p mahout-work/reuters-sgm – $ cd mahout-work/reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd .. • Extract content from SGML to text file – $ bin/mahout org.apache.lucene.benchmark.utils.Ex tractReuters mahout-work/reuters-sgm mahout-work/reuters-out Step 1 – Convert dataset into a Hadoop Sequence File • Use seqdirectory tool to convert text file into a Hadoop Sequence File – $ bin/mahout seqdirectory \ -i mahout-work/reuters-out \ -o mahout-work/reuters-outseqdir \ -c UTF-8 -chunk 5 Hadoop Sequence File • Sequence of Records, where each record is a <Key, Value> pair – – – – – – <Key1, Value1> <Key2, Value2> … … … <Keyn, Valuen> • Key and Value needs to be of class org.apache.hadoop.io.Text – Key = Record name or File name or unique identifier – Value = Content as UTF-8 encoded string • TIP: Dump data from your database directly into Hadoop Sequence Files (see next slide) Writing to Sequence Files Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("testdata/part00000"); SequenceFile.Writer writer = new SequenceFile.Writer( fs, conf, path, Text.class, Text.class); for (int i = 0; i < MAX_DOCS; i++) writer.append(new Text(documents(i).Id()), new Text(documents(i).Content())); } writer.close(); Generate Vectors from Sequence Files • Steps 1. Compute Dictionary 2. Assign integers for words 3. Compute feature weights 4. Create vector for each document using word-integer mapping and feature-weight Or • Simply run $ bin/mahout seq2sparse Generate Vectors from Sequence Files • $ bin/mahout seq2sparse \ -i mahout-work/reuters-out-seqdir/ \ -o mahout-work/reuters-out-seqdirsparse-kmeans • Important options – Ngrams – Lucene Analyzer for tokenizing – Feature Pruning • Min support • Max Document Frequency • Min LLR (for ngrams) – Weighting Method • TF v/s TFIDF • lp-Norm • Log normalize length Start K-Means clustering • $ bin/mahout kmeans \ -i mahout-work/reuters-out-seqdirsparse-kmeans/tfidf-vectors/ \ -c mahout-work/reuters-kmeans-clusters \ -o mahout-work/reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeas ure –cd 0.1 \ -x 10 -k 20 –ow • Things to watch out for – – – – Number of iterations Convergence delta Distance Measure Creating assignments Inspect clusters • $ bin/mahout clusterdump \ -s mahout-work/reuterskmeans/clusters-9 \ -d mahout-work/reuters-outseqdir-sparsekmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 20 Typical output :VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, … Top Terms: iran => 3.1861672217321213 strike => 2.567886952727918 iranian => 2.133417966282966 union => 2.116033937940266 said => 2.101773806290277 workers => 2.066259451354332 gulf => 1.9501374918521601 had => 1.6077752463145605 he => 1.5355078004962228