Machine Learning and Statistical Analysis

Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu)  Social Bookmarking Socialized Bookmarks Tags 1  Motivations  Social indexing or collaborative annotation  Collect knowledge from people  Extract information  Challenges  Vast amount of data  Very dynamic  Unsupervised data 2  Principles of Machine Learning  Bayes’ theorem and maximum likelihood  Machine Learning Algorithms  Clustering analysis  Dimension reduction  Classification  Parallel Computing  General parallel computing architecture  Parallel algorithms 3  Definition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc.  Algorithm Types  Unsupervised learning  Supervised learning  Reinforcement learning  Topics  Models ▪ Artificial Neural Network (ANN) ▪ Support Vector Machine (SVM)  Optimization ▪ Expectation-Maximization (EM) ▪ Deterministic Annealing (DA) 4  Posterior probability of i, given X  i 2  : Parameter  X : Observations  P(i) : Prior (or marginal) probability  P(X|i) : likelihood  Maximum Likelihood (ML)  Used to find the most plausible i 2 , given X  Computing maximum likelihood (ML) or log-likelihood  Optimization problem 5  Problem Estimate hidden parameters (={, }) from the given data extracted from k Gaussian distributions  Gaussian distribution  Maximum Likelihood (Mitchell , 1997) With Gaussian (P = N), Solve either brute-force or numeric method 6  Problems in ML estimation  Observation X is often not complete  Latent (hidden) variable Z exists  Hard to explore whole parameter space  Expectation-Maximization algorithm  Object : To find ML, over latent distribution P(Z |X,)  Steps 0. Init – Choose a random old 1. E-step – Expectation P(Z |X, old) 2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating old Ã new 7  Definition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information  Dissimilarity measurement  Distance : Euclidean(L2), Manhattan(L1), …  Angle : Inner product, …  Non-metric : Rank, Intensity, …  Types of Clustering  Hierarchical ▪ Agglomerative or divisive  Partitioning ▪ K-means, VQ, MDS, … (Matlab helppage) 8  Find K partitions with the total intra-cluster variance minimized  Iterative method  Initialization : Randomized yi  Assignment of x (yi fixed)  Update of yi (x fixed)  Problem?  Trap in local minima (MacKay, 2003) 9  Deterministically avoid local minima  No stochastic process (random walk)  Tracing the global solution by changing level of randomness  Statistical Mechanics (Maxima and Minima, Wikipedia)  Gibbs distribution  Helmholtz free energy F = D – TS ▪ Average Energy D = < Ex> ▪ Entropy S = - P(Ex) ln P(Ex) ▪ F = – T ln Z  In DA, we make F minimized 10  Analogy to physical annealing process  Control energy (randomness) by temperature (high  low)  Starting with high temperature (T = 1) ▪ Soft (or fuzzy) association probability ▪ Smooth cost function with one global minimum  Lowering the temperature (T ! 0) ▪ Hard association ▪ Revealing full complexity, clusters are emerged  Minimization of F, using E(x, yj) = ||x-yj||2 Iteratively, 11  Definition Process to transform high-dimensional data into lowdimensional ones for improving accuracy, understanding, or removing noises.  Curse of dimensionality  Complexity grows exponentially in volume by adding extra dimensions  Types (Koppen, 2000)  Feature selection : Choose representatives (e.g., filter,…)  Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) 12   Finding a map of principle components (PCs) of data into an orthogonal space, such that y = W x where W 2 Rd£h (hÀd) x2 x1 PCs – Variables with the largest variances  Orthogonality  Linearity – Optimal least mean-square error  Limitations?  Strict linearity  specific distribution  Large variance assumption 13  Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 Rd£p (pÀd)  Johnson-Lindenstrauss lemma  When projecting to a randomly selected subspace, the distance are approximately preserved Generating R  Hard to obtain orthogonalized R  Gaussian R  Simple approach choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively  14  Dimension reduction preserving distance proximities observed in original data set  Loss functions  Inner product  Distance  Squared distance  Classical MDS: minimizing STRAIN, given   From , find inner product matrix B (Double centering)  From B, recover the coordinates X’ (i.e., B=X’X’T ) 15  SMACOF : minimizing STRESS  Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.:  Majorization for STRESS (Cox, 2001)  Minimize tr(XT B(Y) Y), known as Guttman transform 16  Competitive and unsupervised learning process for clustering and visualization  Learning  Choose the best similar model Input  Model vector mj with xi  Update the winner and its neighbors by mk = mk + (t) (t)(xi – mk) (t) : learning rate (t) : neighborhood size Result : similar data getting closer in the model space 17  Definition A procedure dividing data into the given set of categories based on the training set in a supervised way  Generalization Vs. Specification  Hard to achieve both Underfitting Overfitting  Avoid overfitting(overtraining) ▪ ▪ ▪ ▪ Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation Validation Error Training Error (Overfitting, Wikipedia) 18  Perceptron : A computational unit with binary threshold Weighted Sum  Activation Function Abilities  Linear separable decision surface  Represent boolean functions (AND, OR, NO)  Network (Multilayer) of perceptrons  Various network architectures and capabilities (Jain, 1996) 19  Learning weights – random initialization and updating  Error-correction training rules  Difference between training data and output: E(t,o)  Gradient descent (Batch learning) ▪ With E =  Ei ,  Stochastic approach (On-line learning) ▪ Update gradient for each result  Various error functions  Adding weight regularization term ( wi2) to avoid overfitting  Adding momentum (wi(n-1)) to expedite convergence 20  Q: How to draw the optimal linear separating hyperplane?  A: Maximizing margin  Margin maximization  The distance between H+1 and H-1:  Thus, ||w|| should be minimized Margin 21  Constraint optimization problem  Given training set {xi, yi} (yi 2 {+1, -1}):  Minimize :  Lagrangian equation with saddle points  Minimized w.r.t the primal variable w and b:  Maximized w.r.t the dual variables i (all i ¸ 0)  xi with i > 0 (not i = 0) is called support vector (SV) 22  Soft Margin (Non-separable case)  Slack variables i < C  Optimization with additional constraint  Non-linear SVM  Map non-linear input to feature space  Kernel function k(x,y) = h(x), (y)i  Kernel classifier with support vectors si Input Space Feature Space 23  Memory Architecture Shared Memory  Symmetric Multiprocessor (SMP)  OpenMP, POSIX, pthread, MPI  Easy to manage but expensive Distributed Memory  Commodity, off-the-shelf processors  MPI  Cost effective but hard to maintain (Barney, 2007)  Decomposition Strategy  Task – E.g., Word, IE, …  Data – scientific problem  Pipelining – Task + Data (Barney, 2007) 24  Shrinking  Recall : Only support vectors (i>0) are used in SVM optimization  Predict if data is either SV or non-SV  Remove non-SVs from problem space  Parallel SVM  Partition the problem  Merge data hierarchically  Each unit finds support vectors  Loop until converge (Graf, 2005) 25 26

Machine Learning and Statistical Analysis

Related documents

Products

Support

Machine Learning and Statistical Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib