Database Mining CSCI 4390/6390 Lecture 5: Convex Problems and Probability Continued Wei Liu IBM T. J. Watson Research Center Sep 9, 2014 1 Overview Convex Problems Probability Distributions 2 Convex Problems Convex function & convex set Local & global optima Convex quadratic form Least squares Linear regression 3 Convex Set Let S be a vector space. A set C in S is convex if, for all x and y in C and all t in [0,1], the point (1-t)x+ty also belongs to C. Every point on the line segment connecting x, y is in C. Imply that a convex set in a real topological vector pace is path-connected. A convex set Examples: , A nonconvex set . 4 Convex Function Let C be a convex set in a vector space, and let f be a function , f is convex if: f is strictly convex if: A function f is (strictly) concave if –f is (strictly) convex. Examples: 5 Convex Function The interior point in f is no higher than the linearly interpolated point! 6 Convex Function A most used sufficient condition for judging a convex function is: 1) If a function f is twice continuously differentiable over a domain D and its Hessian matrix is positive semi-definite, then f is convex over D. 2) If a function f is twice continuously differentiable over a domain D and its Hessian matrix is positive definite, then f is strictly convex over D. In 1d cases, >= 0 convex, > 0 strictly convex. Examples: over over 7 Convex Problem A convex optimization problem is: its objective function f is a convex function, and its feasible solution (constraint) set C is a closed convex set. f c Fact: Any convex problem must have a globally optimal solution, and if multiple globally optimal solutions exist they must have the same objective value. 8 Convex Problems Convex function & convex set Local & global optima Convex quadratic form Least squares Linear regression 9 Local Optimum A local optimum of an optimization problem is a solution that is optimal (in specific, minimal) within a neighboring set of candidate solutions. is the neighborhood radius. A necessary condition: For any continuously differentiable objective function f , if is a local optimum, then . 10 Global Optimum A global optimum is a selection from a given domain D , which provides the lowest value (the global minimum). It is the optimal solution among all possible solutions, not just those in a particular neighborhood. Practice: Search for all locally optimal solutions (smooth and nonsmooth), evaluate the objective values at these local optima and the boundary points, and pick up a globally optimal solution. 11 Local & Global Optima Solve a convex problem: f c a globally optimal solution Fact: For any convex problem, once a local optimum is found, it is immediately a global optimum. 12 Convex Problems Convex function & convex set Local & global optima Convex quadratic form Least squares Linear regression 13 Convex Quadratic Form The convex quadratic form possibly is one of the simplest convex problems. It has been well and deeply studied. c is a constant. Linear term: b is a constant vector. Quadratic term: A is a positive semi-definite matrix. Fact: Minimizing any function in the convex quadratic form subject to must lead to a global optimum. A is asymmetric? Doing a symmetrizing step. 14 Convex Problems Convex function & convex set Local & global optima Convex quadratic form Least squares Linear regression 15 Least Squares The Least Squares problem falls into the convex quadratic form. is a general matrix, is a constant vector. Positive semi-definite for any A, . If Rank(A) = n, then the Least Squares problem has a unique global optimum , which is acquired by making the gradient to zero, i.e., . Otherwise ill-posed, many solutions. Usually run a quadratic programming solver, e.g., the interior method. 16 Regularized Least Squares The Regularized Least Squares problem excludes the ill-posed case, and makes the problem itself uniquely solved. is a general matrix, is the regularization parameter. is a constant vector, and Positive definite for any A (the sum of PD and PSD matrices is PD). Fact: The Regularized Least Squares problem is strictly convex, and has a unique global optimum , which is still acquired by making the gradient to zero. 17 Convex Problems Convex function & convex set Local & global optima Convex quadratic form Least squares Linear regression 18 Linear Regression Also known as Ridge Regression, actually a regularized least squares problem. A linear regression task is to find a linear regression function such that any input data sample is mapped to its desired output. Input x: single or multiple observation variable(s) . Output y: a response value (one-dimensional). Formulate into the following problem: A trick to write 19 Linear Regression Predict the responses of unseen samples in a linear manner. One-variable input: fitting a line in . Multi-variable input: fitting a (hyper)plane in . 20 Linear Regression Write the data matrix write the response vector and then solve the linear regression problem as: which has a unique global optimum 21 Summary of Convex Problems Convex function & convex set = convex problem Local & global optima => any local opt is a global opt Convex quadratic form => simplest convex problem Least squares => regularized least squares has unique opt Linear regression => widespread in statistics/ML/DM 22 Probability Distributions Discrete random variables Continuous random variables Central limit theorem 23 Parametric Distributions Basic building blocks: model and handle a parametric probability distribution . Need to determine: the parameters observation data . Representation and optimization: given the = argmax ? 24 Probability Distributions Discrete random variables Continuous random variables Central limit theorem 25 Discrete Random Variables (1) Coin flipping: heads=1/tails=0 (binary random variable) Bernoulli Distribution 26 Discrete Random Variables (2) N coin flips (discrete random variable m in [0:N]): Binomial Distribution 27 Binomial Distribution 28 Parameter Estimation Maximum Likelihood (ML) for Bernoulli parameter estimation Given: Max by making gradient to 0 29 Parameter Estimation Example: Prediction: all future tosses will land heads up forever. Overfitting to small-sized D! Call for Maximum A Posteriori (MAP) estimation (or called as Bayesian inference). 30 Beta Distribution Regard the parameter as a random variable, and assume a prior probability distribution: 31 Bayesian Bernoulli MAP parameter estimation: posteriori Beta distribution The Beta distribution provides the conjugate prior for the Bernoulli distribution. 32 Beta Distribution Choose proper constant parameters a, b, and maximum of the posteriori Beta distribution. is the global 33 Prior ∙ Likelihood = Posterior 34 Properties of Posteriori Beta As the size of the data set, N , increases 35 Prediction under the Posterior What is the probability that the next coin toss will land heads up? 36 Multinomial Variables 1-of-K coding scheme: 37 ML Parameter Estimation Given: To ensure Let , introduce a Lagrange multiplier , we solve 38 The Multinomial Distribution In contrast to the binomial distribution, the multinomial distribution is 39 The Dirichlet Distribution The Dirichlet distribution is the conjugate prior for the multinomial distribution. 40 Bayesian Multinomial (1) Compute the posterior distribution: It turns out that the posterior is still a Dirichlet distribution: 41 Bayesian Multinomial (2) 42 Probability Distributions Discrete random variables Continuous random variables Central limit theorem 43 The Gaussian Distribution Single variable Multi-variable (D-dimensional x) 44 First-Order Moment of the Multivariate Gaussian 45 Second-Order Moment of the Multivariate Gaussian 46 Bayes’ Theorem for Gaussian Variables Bayesian linear regression: x marginal: y conditioned on x: We can infer: y marginal: x conditioned on y: where Fact: Given two Gaussian distributions, the others are so. 47 Maximum Likelihood for the Gaussian (1) Given i.i.d. data is given by , the log likelihood function Set the derivative of the log likelihood to zero, and solve to obtain Similarly by setting , we obtain 48 Maximum Likelihood for the Gaussian (2) Under the true distribution: [Biased estimation] Hence adjust [Unbiased estimation] 49 Sequential ML Estimation Incrementally update given the Nth data point, xN Compute fast correction given xN correction weight old estimate 50 Bayesian Inference for the Gaussian (1) Assume is known. Given i.i.d. data the likelihood function for is given by This has a Gaussian shape as a function of but it is not a distribution over . , , 51 Bayesian Inference for the Gaussian (2) Combined with a Gaussian prior over , this gives the posterior Completing the square over , we see that 52 Bayesian Inference for the Gaussian (3) … where Note: 53 Bayesian Inference for the Gaussian (4) Example: for N = 0, 1, 2 and 10. 54 Bayesian Inference for the Gaussian (5) Sequential Estimation The posterior obtained after observing N-1 data points becomes a kind of prior, when we observe the N th data point. 55 Bayesian Inference for the Gaussian (6) Now assume is given by is known. The likelihood function for This has a Gamma shape as a function of . 56 Bayesian Inference for the Gaussian (7) The Gamma distribution 57 Bayesian Inference for the Gaussian (8) Now we combine the Gamma prior with the likelihood function for to obtain which amounts to the posterior 58 Bayesian Inference for the Gaussian (9) If both and given by are unknown, the joint likelihood function is We need a prior with the same functional dependence on (more difficult). and 59 Bayesian Inference for the Gaussian (10) Multivariate conjugate priors 1. unknown, known: p( ) Gaussian. 2. unknown, known: p( ) Wishart, 3. and unknown: p( , ) Gaussian-Wishart, 60 Mixtures of Gaussians (1) Any probability distribution can be modeled as a mixture of multiple (possibly infinite) Gaussians. Single Gaussian Mixture of two Gaussians 61 Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3 Nonnegative and summing-to-1 62 Mixtures of Gaussians (3) 63 Mixtures of Gaussians (4) Determining parameters , , and using ML estimation Log of a sum; no closed-form solutions. Feasible Solution: use standard, iterative, numeric optimization methods or the expectation maximization algorithm (will studied in clustering). 64 Probability Distributions Discrete random variables Continuous random variables Central limit theorem 65 Central Limit Theorem Also known as large number theorem. The distribution of the sum or the arithmetic mean of N i.i.d. random variables becomes increasingly a Gaussian as N grows to be large enough. Example: N uniform [0,1] random variables. 66 Central Limit Theorem Given i.i.d. random variables which has , each of , then for sufficiently large N , we approximately have 67 Summary of Probability Distributions Assume a proper probability distribution for discrete or continuous random variable(s). ML & MAP parameter estimations (for relatively simple distributions, obtain closed-form solutions) . A Gaussian distribution is always a safe assumption due to Central Limit Theorem. 68 Summary of Probability Distributions ML estimation MAP estimation prior 69 Courtesy to Christopher M. Bishop: some of the slides about probability are based on his slides for his book “Pattern Recognition and Machine Learning”. 70