Machine Learning and Statistical Analysis

advertisement
Jong Youl Choi
Computer Science Department
(jychoi@cs.indiana.edu)

Social Bookmarking
Socialized
Bookmarks
Tags
1

Motivations
 Social indexing or collaborative annotation
 Collect knowledge from people
 Extract information

Challenges
 Vast amount of data
 Very dynamic
 Unsupervised data
2

Principles of Machine Learning
 Bayes’ theorem and maximum likelihood

Machine Learning Algorithms
 Clustering analysis
 Dimension reduction
 Classification

Parallel Computing
 General parallel computing architecture
 Parallel algorithms
3

Definition
Algorithms or techniques that enable computer (machine) to
“learn” from data. Related with many areas such as data mining,
statistics, information theory, etc.

Algorithm Types
 Unsupervised learning
 Supervised learning
 Reinforcement learning

Topics
 Models
▪ Artificial Neural Network (ANN)
▪ Support Vector Machine (SVM)
 Optimization
▪ Expectation-Maximization (EM)
▪ Deterministic Annealing (DA)
4

Posterior probability of i, given X
 i 2  : Parameter
 X : Observations
 P(i) : Prior (or marginal) probability
 P(X|i) : likelihood

Maximum Likelihood (ML)
 Used to find the most plausible i 2 , given X
 Computing maximum likelihood (ML) or log-likelihood
 Optimization problem
5

Problem
Estimate hidden parameters (={, })
from the given data extracted from
k Gaussian distributions

Gaussian distribution

Maximum Likelihood
(Mitchell , 1997)
With Gaussian (P = N),
Solve either brute-force or numeric method
6

Problems in ML estimation
 Observation X is often not complete
 Latent (hidden) variable Z exists
 Hard to explore whole parameter space

Expectation-Maximization algorithm
 Object : To find ML, over latent distribution P(Z |X,)
 Steps
0. Init – Choose a random old
1. E-step – Expectation P(Z |X, old)
2. M-step – Find new which maximize likelihood.
3. Go to step 1 after updating old à new
7

Definition
Grouping unlabeled data into clusters, for the purpose of
inference of hidden structures or information

Dissimilarity measurement
 Distance : Euclidean(L2), Manhattan(L1), …
 Angle : Inner product, …
 Non-metric : Rank, Intensity, …

Types of Clustering
 Hierarchical
▪ Agglomerative or divisive
 Partitioning
▪ K-means, VQ, MDS, …
(Matlab helppage)
8

Find K partitions with the total
intra-cluster variance minimized

Iterative method
 Initialization : Randomized yi
 Assignment of x (yi fixed)
 Update of yi (x fixed)

Problem?
 Trap in local minima
(MacKay, 2003)
9

Deterministically avoid local minima
 No stochastic process (random walk)
 Tracing the global solution by changing
level of randomness

Statistical Mechanics
(Maxima and Minima, Wikipedia)
 Gibbs distribution
 Helmholtz free energy F = D – TS
▪ Average Energy D = < Ex>
▪ Entropy S = - P(Ex) ln P(Ex)
▪ F = – T ln Z

In DA, we make F minimized
10

Analogy to physical annealing process
 Control energy (randomness) by temperature (high  low)
 Starting with high temperature (T = 1)
▪ Soft (or fuzzy) association probability
▪ Smooth cost function with one global minimum
 Lowering the temperature (T ! 0)
▪ Hard association
▪ Revealing full complexity, clusters are emerged

Minimization of F, using E(x, yj) = ||x-yj||2
Iteratively,
11

Definition
Process to transform high-dimensional data into lowdimensional ones for improving accuracy, understanding, or
removing noises.

Curse of dimensionality
 Complexity grows exponentially
in volume by adding extra
dimensions

Types
(Koppen, 2000)
 Feature selection : Choose representatives (e.g., filter,…)
 Feature extraction : Map to lower dim. (e.g., PCA, MDS, … )
12


Finding a map of principle components
(PCs) of data into an orthogonal space,
such that
y = W x where W 2 Rd£h (hÀd)
x2
x1
PCs – Variables with the largest variances
 Orthogonality
 Linearity – Optimal least mean-square error

Limitations?
 Strict linearity
 specific distribution
 Large variance assumption
13

Like PCA, reduction of dimension by y = R x where R is a
random matrix with i.i.d columns and R 2 Rd£p (pÀd)

Johnson-Lindenstrauss lemma
 When projecting to a randomly selected subspace, the distance
are approximately preserved
Generating R
 Hard to obtain orthogonalized R
 Gaussian R
 Simple approach
choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6
respectively

14

Dimension reduction preserving distance proximities
observed in original data set

Loss functions
 Inner product
 Distance
 Squared distance

Classical MDS: minimizing STRAIN, given 
 From , find inner product matrix B (Double centering)
 From B, recover the coordinates X’ (i.e., B=X’X’T )
15

SMACOF : minimizing STRESS
 Majorization – for complex f(x),
find auxiliary simple g(x,y) s.t.:
 Majorization for STRESS
(Cox, 2001)
 Minimize tr(XT B(Y) Y), known as Guttman transform
16

Competitive and unsupervised learning process for
clustering and visualization

Learning
 Choose the best similar model
Input

Model
vector mj with xi
 Update the winner and its
neighbors by
mk = mk + (t) (t)(xi – mk)
(t) : learning rate
(t) : neighborhood size
Result : similar data getting closer in the model space
17

Definition
A procedure dividing data into the given set of categories
based on the training set in a supervised way

Generalization Vs. Specification
 Hard to achieve both
Underfitting
Overfitting
 Avoid overfitting(overtraining)
▪
▪
▪
▪
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
Validation Error
Training Error
(Overfitting, Wikipedia)
18

Perceptron : A computational unit with binary threshold
Weighted Sum

Activation Function
Abilities
 Linear separable decision surface
 Represent boolean functions (AND, OR, NO)
 Network (Multilayer) of perceptrons
 Various network architectures and capabilities
(Jain, 1996)
19

Learning weights – random initialization and updating

Error-correction training rules
 Difference between training data and output: E(t,o)
 Gradient descent (Batch learning)
▪ With E =  Ei ,
 Stochastic approach (On-line learning)
▪ Update gradient for each result

Various error functions
 Adding weight regularization term ( wi2) to avoid overfitting
 Adding momentum (wi(n-1)) to expedite convergence
20

Q: How to draw the optimal linear
separating hyperplane?
 A: Maximizing margin

Margin maximization
 The distance between H+1 and H-1:
 Thus, ||w|| should be minimized
Margin
21

Constraint optimization problem
 Given training set {xi, yi} (yi 2 {+1, -1}):
 Minimize :

Lagrangian equation with saddle points
 Minimized w.r.t the primal variable w and b:
 Maximized w.r.t the dual variables i (all i ¸ 0)
 xi with i > 0 (not i = 0) is called support vector (SV)
22

Soft Margin (Non-separable case)
 Slack variables i < C
 Optimization with additional constraint

Non-linear SVM
 Map non-linear input to feature space
 Kernel function k(x,y) = h(x), (y)i
 Kernel classifier with support vectors si
Input Space
Feature Space
23

Memory Architecture
Shared Memory
 Symmetric Multiprocessor (SMP)
 OpenMP, POSIX, pthread, MPI
 Easy to manage but expensive
Distributed Memory
 Commodity, off-the-shelf processors
 MPI
 Cost effective but hard to maintain
(Barney, 2007)

Decomposition Strategy
 Task – E.g., Word, IE, …
 Data – scientific problem
 Pipelining – Task + Data
(Barney, 2007)
24

Shrinking
 Recall : Only support vectors (i>0) are
used in SVM optimization
 Predict if data is either SV or non-SV
 Remove non-SVs from problem space

Parallel SVM
 Partition the problem
 Merge data hierarchically
 Each unit finds support vectors
 Loop until converge
(Graf, 2005)
25
26
Download