Database Mining CSCI 4390/6390 Lecture 5: Convex Problems and Probability Continued

advertisement
Database Mining
CSCI 4390/6390
Lecture 5: Convex Problems and Probability
Continued
Wei Liu
IBM T. J. Watson Research Center
Sep 9, 2014
1
Overview

Convex Problems

Probability Distributions
2
Convex Problems

Convex function & convex set

Local & global optima

Convex quadratic form

Least squares

Linear regression
3
Convex Set
Let S be a vector space. A set C in S is convex if, for all x and y in C
and all t in [0,1], the point (1-t)x+ty also belongs to C.
Every point on the line segment connecting x, y is in C. Imply that
a convex set in a real topological vector pace is path-connected.
A convex set
Examples:
,
A nonconvex set
.
4
Convex Function
Let C be a convex set in a vector space, and let f be a function
,
f is convex if:
f is strictly convex if:
A function f is (strictly) concave if –f is (strictly) convex.
Examples:
5
Convex Function
The interior point in f is no higher than
the linearly interpolated point!
6
Convex Function
A most used sufficient condition for judging a convex function is:
1) If a function f is twice continuously differentiable over a domain D
and its Hessian matrix
is positive semi-definite, then f is convex
over D.
2) If a function f is twice continuously differentiable over a domain D
and its Hessian matrix
is positive definite, then f is strictly convex
over D.
In 1d cases,
>= 0 convex,
> 0 strictly convex.
Examples:
over
over
7
Convex Problem
A convex optimization problem is: its objective function f is a
convex function, and its feasible solution (constraint) set C is a
closed convex set.
f
c
Fact: Any convex problem must have a globally optimal solution,
and if multiple globally optimal solutions exist they must have the
same objective value.
8
Convex Problems

Convex function & convex set

Local & global optima

Convex quadratic form

Least squares

Linear regression
9
Local Optimum
A local optimum of an optimization problem is a solution that is
optimal (in specific, minimal) within a neighboring set of candidate
solutions.
is the neighborhood radius.
A necessary condition: For any continuously differentiable objective
function f , if
is a local optimum, then
.
10
Global Optimum
A global optimum is a selection from a given domain D , which
provides the lowest value (the global minimum). It is the optimal
solution among all possible solutions, not just those in a particular
neighborhood.
Practice: Search for all locally optimal solutions (smooth and nonsmooth), evaluate the objective values at these local optima and
the boundary points, and pick up a globally optimal solution.
11
Local & Global Optima
Solve a convex problem:
f
c
a globally optimal
solution
Fact: For any convex problem, once a local optimum is found, it
is immediately a global optimum.
12
Convex Problems

Convex function & convex set

Local & global optima

Convex quadratic form

Least squares

Linear regression
13
Convex Quadratic Form
The convex quadratic form possibly is one of the simplest convex
problems. It has been well and deeply studied.
c is a constant.
Linear term: b is a constant vector.
Quadratic term: A is a positive semi-definite matrix.
Fact: Minimizing any function in the convex quadratic form subject
to
must lead to a global optimum.
A is asymmetric? Doing a symmetrizing step.
14
Convex Problems

Convex function & convex set

Local & global optima

Convex quadratic form

Least squares

Linear regression
15
Least Squares
The Least Squares problem falls into the convex quadratic form.
is a general matrix,
is a constant vector.
Positive semi-definite for any A,
.
If Rank(A) = n, then the Least Squares problem has a unique global
optimum
, which is acquired by making the
gradient to zero, i.e.,
.
Otherwise ill-posed, many solutions. Usually run a quadratic
programming solver, e.g., the interior method.
16
Regularized Least Squares
The Regularized Least Squares problem excludes the ill-posed case,
and makes the problem itself uniquely solved.
is a general matrix,
is the regularization parameter.
is a constant vector, and
Positive definite for any A (the sum of PD and PSD matrices is PD).
Fact: The Regularized Least Squares problem is strictly convex, and
has a unique global optimum
, which
is still acquired by making the gradient to zero.
17
Convex Problems

Convex function & convex set

Local & global optima

Convex quadratic form

Least squares

Linear regression
18
Linear Regression
Also known as Ridge Regression, actually a regularized least
squares problem. A linear regression task is to find a linear
regression function
such that any input data sample is mapped to its desired output.
Input x: single or multiple observation variable(s) .
Output y: a response value (one-dimensional).
Formulate into the following problem:
A trick to write
19
Linear Regression
Predict the responses of unseen samples in a linear manner.
One-variable input:
fitting a line in
.
Multi-variable input:
fitting a (hyper)plane in
.
20
Linear Regression
Write the data matrix
write the response vector
and then solve the linear regression problem as:
which has a unique global optimum
21
Summary of Convex Problems

Convex function & convex set = convex problem

Local & global optima => any local opt is a global opt

Convex quadratic form => simplest convex problem

Least squares => regularized least squares has unique opt

Linear regression => widespread in statistics/ML/DM
22
Probability Distributions

Discrete random variables

Continuous random variables

Central limit theorem
23
Parametric Distributions

Basic building blocks: model and handle a parametric
probability distribution
.

Need to determine: the parameters
observation data
.

Representation and optimization:
given the
= argmax
?
24
Probability Distributions

Discrete random variables

Continuous random variables

Central limit theorem
25
Discrete Random Variables (1)

Coin flipping: heads=1/tails=0 (binary random variable)

Bernoulli Distribution
26
Discrete Random Variables (2)

N coin flips (discrete random variable m in [0:N]):

Binomial Distribution
27
Binomial Distribution
28
Parameter Estimation
Maximum Likelihood (ML) for Bernoulli parameter
estimation
Given:
Max by making gradient to 0
29
Parameter Estimation
Example:
Prediction: all future tosses will land heads up forever.
Overfitting to small-sized D!
Call for Maximum A Posteriori (MAP) estimation
(or called as Bayesian inference).
30
Beta Distribution
Regard the parameter
as a random variable, and
assume a prior probability distribution:
31
Bayesian Bernoulli
MAP parameter estimation:
posteriori Beta distribution
The Beta distribution provides the conjugate prior for the
Bernoulli distribution.
32
Beta Distribution
Choose proper constant parameters a, b, and
maximum of the posteriori Beta distribution.
is the global
33
Prior ∙ Likelihood = Posterior
34
Properties of Posteriori Beta
As the size of the data set, N , increases
35
Prediction under the Posterior
What is the probability that the next coin toss will land
heads up?
36
Multinomial Variables
1-of-K coding scheme:
37
ML Parameter Estimation
Given:
To ensure
Let
, introduce a Lagrange multiplier
, we solve
38
The Multinomial Distribution
In contrast to the binomial distribution, the multinomial
distribution is
39
The Dirichlet Distribution
The Dirichlet distribution is
the conjugate prior for the
multinomial distribution.
40
Bayesian Multinomial (1)
Compute the posterior distribution:
It turns out that the posterior is still a Dirichlet distribution:
41
Bayesian Multinomial (2)
42
Probability Distributions

Discrete random variables

Continuous random variables

Central limit theorem
43
The Gaussian Distribution
Single variable
Multi-variable (D-dimensional x)
44
First-Order Moment of the Multivariate Gaussian
45
Second-Order Moment of the Multivariate Gaussian
46
Bayes’ Theorem for Gaussian Variables
Bayesian linear regression:
x marginal:
y conditioned on x:
We can infer:
y marginal:
x conditioned on y:
where
Fact: Given two Gaussian distributions, the others are so.
47
Maximum Likelihood for the Gaussian (1)
Given i.i.d. data
is given by
, the log likelihood function
Set the derivative of the log likelihood to zero,
and solve to obtain
Similarly by setting
, we obtain
48
Maximum Likelihood for the Gaussian (2)
Under the true distribution:
[Biased estimation]
Hence adjust
[Unbiased estimation]
49
Sequential ML Estimation
Incrementally update
given the Nth data point, xN
Compute fast
correction given xN
correction weight
old estimate
50
Bayesian Inference for the Gaussian (1)
Assume
is known. Given i.i.d. data
the likelihood function for
is given by
This has a Gaussian shape as a function of
but it is not a distribution over .
,
,
51
Bayesian Inference for the Gaussian (2)
Combined with a Gaussian prior over
,
this gives the posterior
Completing the square over
, we see that
52
Bayesian Inference for the Gaussian (3)
… where
Note:
53
Bayesian Inference for the Gaussian (4)
Example:
for N = 0, 1, 2 and 10.
54
Bayesian Inference for the Gaussian (5)
Sequential Estimation
The posterior obtained after observing N-1 data points becomes a
kind of prior, when we observe the N th data point.
55
Bayesian Inference for the Gaussian (6)
Now assume
is given by
is known. The likelihood function for
This has a Gamma shape as a function of
.
56
Bayesian Inference for the Gaussian (7)

The Gamma distribution
57
Bayesian Inference for the Gaussian (8)
Now we combine the Gamma prior
with the likelihood function for
to obtain
which amounts to the posterior
58
Bayesian Inference for the Gaussian (9)
If both and
given by
are unknown, the joint likelihood function is
We need a prior with the same functional dependence on
(more difficult).
and
59
Bayesian Inference for the Gaussian (10)
Multivariate conjugate priors
1.
unknown,
known: p( ) Gaussian.
2.
unknown,
known: p( ) Wishart,
3.
and
unknown: p( ,
) Gaussian-Wishart,
60
Mixtures of Gaussians (1)

Any probability distribution can be modeled as a mixture of
multiple (possibly infinite) Gaussians.
Single Gaussian
Mixture of two
Gaussians
61
Mixtures of Gaussians (2)
Combine simple models into a
complex model:
Component
Mixing coefficient
K=3
Nonnegative and summing-to-1
62
Mixtures of Gaussians (3)
63
Mixtures of Gaussians (4)

Determining parameters
,
, and
using ML estimation
Log of a sum; no closed-form solutions.

Feasible Solution: use standard, iterative, numeric optimization
methods or the expectation maximization algorithm (will
studied in clustering).
64
Probability Distributions

Discrete random variables

Continuous random variables

Central limit theorem
65
Central Limit Theorem
Also known as large number theorem. The distribution of the
sum or the arithmetic mean of N i.i.d. random variables becomes
increasingly a Gaussian as N grows to be large enough.
Example: N uniform [0,1] random variables.
66
Central Limit Theorem
Given i.i.d. random variables
which has
, each of
,
then for sufficiently large N , we approximately have
67
Summary of Probability Distributions

Assume a proper probability distribution for discrete or
continuous random variable(s).

ML & MAP parameter estimations (for relatively simple
distributions, obtain closed-form solutions) .

A Gaussian distribution is always a safe assumption due to
Central Limit Theorem.
68
Summary of Probability Distributions
ML estimation
MAP estimation
prior
69
Courtesy to Christopher M. Bishop: some of the slides
about probability are based on his slides for his book
“Pattern Recognition and Machine Learning”.
70
Download