Recap for Final Sreyas Mohan CDS at NYU 8 May 2019 Sreyas Mohan (NYU) Recap for Final 8 May 2019 1 / 33 Learning Theory Framework 1 Input Space X , output space Y and Action Space A 2 Prediction function f : X → A 3 Loss Function ` : A × Y → R 4 Risk R(f ) = E (`(f (x), y ) 5 Bayes prediction function f ∗ = argminf R(f ) 6 Bayes Risk R(f ∗ ) 7 8 9 Empirical Risk, R̂(f ) - sample approximation for R(f ). 1 Pn R̂n (f ) = n i=1 (f (xi ), yi ) Empirical Risk Minimization fˆ = argminf R̂(f ) Constrained Empirical Risk Minization Sreyas Mohan (NYU) Recap for Final 8 May 2019 2 / 33 Learning Theory Framework f ∗ = arg min `(f (X ), Y ) f fF = arg min `(f (X ), Y )) f ∈F 1 fˆn = arg min f ∈F n n X `(f (xi ), yi ) i=1 Approximation Error (of F) = R(fF ) − R(f ∗ ) Estimation error (of fˆn in F) = R(fˆn ) − R(fF ) Sreyas Mohan (NYU) Recap for Final 8 May 2019 3 / 33 Approximation Error of Regression Stumps Suppose X = U[−1, 1]. y |x ∼ N (x, σ 2 ). We are minimizing square loss. Consider a hypothesis class H made of regression stumps. What is the f ∈ H that minimizes the approximation error of H? Sreyas Mohan (NYU) Recap for Final 8 May 2019 4 / 33 Solution Note that f ∗ (x) = x and R(f ∗ ) = σ 2 . Also R(f ) = E ((f (x) − y )2 ) = E [(f (x) − E [y |x])2 ] + E [(y − E [y |x])2 ] We have to minimize E [(f (x) − E [y |x])2 ] = E [(f (x) − x)2 ] The function we are looking for is ( −0.5 −1 ≤ x ≤ 0 f (x) = +0.5 0 < x ≤ 1 Sreyas Mohan (NYU) Recap for Final 8 May 2019 5 / 33 Lasso and Ridge Regression Linear Regression Closed form Expression (w = X T X )−1 X T y Ridge Regression Solution Lasso Regression Solution Methods - co-ordinate descent. Correlated Features for Lasso and Ridge Sreyas Mohan (NYU) Recap for Final 8 May 2019 6 / 33 Optimization Methods (True/False) Gradient Descent or SGD is only defined for convex loss functions. (True/False) Negative Subgradient direction is a descent direction (True/False) Negative Stochastic Gradient direction is a descent direction Sreyas Mohan (NYU) Recap for Final 8 May 2019 7 / 33 Optimization Methods False False, but takes you closed to minimizer False Sreyas Mohan (NYU) Recap for Final 8 May 2019 8 / 33 SVM, RBF Kernel and Neural Network Suppose we have fit a kernelized SVM to some training data (x1 , y1 ) , . . . , (xn , yn ) ∈ × {−1, 1}, and we end up with a score function of the form n X f (x) = αi k(x, xi ), i=1 where k(x, x 0 ) = ϕ(x − x 0 ), for some function ϕ : R → R. Write f : R → R as a multilayer perceptron by specifying how to set m and ai , wi , bi for i = 1, . . . , m, and the activation function σ : R → R in the following expression: f (x) = m X ai σ(wi x + bi ) i=1 Sreyas Mohan (NYU) Recap for Final 8 May 2019 9 / 33 Solution Simply take m = n. Then ai = αi , wi = 1, and bi = −xi . Finally take σ = ϕ. Sreyas Mohan (NYU) Recap for Final 8 May 2019 10 / 33 Question continued Suppose you knew which all data points were support vectors. How would you use this information to reduce the size of the network? Sreyas Mohan (NYU) Recap for Final 8 May 2019 11 / 33 Solution Remove all the nodes corresponding to ai = 0 or non-support vectors. Sreyas Mohan (NYU) Recap for Final 8 May 2019 12 / 33 Maximum Likelihood Suppose we have samples x1 ≤ x2 ≤ · · · ≤ xn drawn from U[a, b] Find Maximum Likelihood estimate of a and b. Sreyas Mohan (NYU) Recap for Final 8 May 2019 13 / 33 ML Solution The likelihood is: L(a, b) = Πni=1 1 1 (xi ) b − a [a,b] Let x(1) , . . . , x(n) be the order statistics. The likelihood is greater than zero if and only a < x(1) and b > x(n) . When a < x(1) and b > x(n) , the likelihood is a monotonically decreasing function of (b − a). And the smallest (b − a) will be attained when b = x(n) and a = x(1) . Therefore, b = x(n) and a = x(1) give us the MLE. Sreyas Mohan (NYU) Recap for Final 8 May 2019 14 / 33 Bayesian Bayesian Bernoulli Model Suppose we have a coin with unknown probability of heads θ ∈ (0, 1). We flip the coin n times and get a sequence of coin flips with nh heads and nt tails. Recall the following: A Beta(α, β) distribution, for shape parameters α, β > 0, is a distribution supported on the interval (0, 1) with PDF given by f (x; α, β) ∝ x α−1 (1 − x)β−1 . α α−1 The mean of a Beta(α, β) distribution is α+β . The mode is α+β−2 assuming α, β ≥ 1 and α + β > 2. If α = β = 1, then every value in (0, 1) is a mode. Sreyas Mohan (NYU) Recap for Final 8 May 2019 15 / 33 Question Continued Which ONE of the following prior distributions on θ corresponds to a strong belief that the coin is approximately fair (i.e. has an equal probability of heads and tails)? 1 Beta(50, 50) 2 Beta(0.1, 0.1) 3 Beta(1, 100) Sreyas Mohan (NYU) Recap for Final 8 May 2019 16 / 33 Question Continued 1 Give an expression for the likelihood function LD (θ) for this sequence of flips. 2 Suppose we have a Beta(α, β) prior on θ, for some α, β > 0. Derive the posterior distribution on θ and, if it is a Beta distribution, give its parameters. 3 If your posterior distribution on θ is Beta(3, 6), what is your MAP estimate of θ? Sreyas Mohan (NYU) Recap for Final 8 May 2019 17 / 33 LD (θ) = θnh (1 − θ)nt p(θ|D) ∝ p(θ)L( θ) ∝ θα−1 (1 − θ)β−1 θnh (1 − θ)nt ∝ θnh +α−1 (1 − θ)nt +β−1 Thus the posterior distribution is Beta(α + nh , β + nt ). Based on information box above, the mode of the beta distribution is α−1 2 α+β−2 for α, β > 1. So the MAP estimate is 7 . Sreyas Mohan (NYU) Recap for Final 8 May 2019 18 / 33 Boostrap What is the probability of not picking one datapoint while creating a bootstrap sample? Suppose the dataset is fairly large. In an expected sense, what fraction of our bootstrap sample will be unique? Sreyas Mohan (NYU) Recap for Final 8 May 2019 19 / 33 Bootstrap Solutions 1 (1 − n1 )n 2 As n → ∞, (1 − n1 )n → e1 . So 1 − Sreyas Mohan (NYU) 1 e Recap for Final unique samples. 8 May 2019 20 / 33 Random Forest and Boosting Indicate whether each of the statements (about random forests and gradient boosting) is true or false. 1 True or False: If your gradient boosting model is overfitting, taking additional steps is likely to help. 2 True or False: In gradient boosting, if you reduce your step size, you should expect to need fewer rounds of boosting (i.e. fewer steps) to achieve the same training set loss. 3 True or False: Fitting a random forest model is extremely easy to parallelize. 4 True or False: Fitting a gradient boosting model is extremely easy to parallelize, for any base regression algorithm. 5 True or False: Suppose we apply gradient boosting with absolute loss to a regression problem. If we use linear ridge regression as our base regression algorithm, the final prediction function from gradient boosting always will be an affine function of the input. Sreyas Mohan (NYU) Recap for Final 8 May 2019 21 / 33 Solution False, False, True, False, True Sreyas Mohan (NYU) Recap for Final 8 May 2019 22 / 33 Hypothesis space of GBM and RF Let HB represent hypothesis class of (small) regression trees. Let PTa base 1 HR = {g |g = i=1 T fi , fi ∈ HB } represent the hypothesis space of prediction functions in a random forest P with T trees where each tree is picked from HB . Let HG = {g |g = T i=1 νi fi , fi ∈ HB , ni ∈ R} represent the hypothesis space of prediction functions in a gradient boosting with T trees. True or False: 1 If fi ∈ HR then αfi ∈ HR for all α ∈ R 2 If fi ∈ HG then αfi ∈ HG for all α ∈ R 3 If fi ∈ HG then fi ∈ HR 4 If fi ∈ HR then fi ∈ HG Sreyas Mohan (NYU) Recap for Final 8 May 2019 23 / 33 Solutions True, True, True, True Sreyas Mohan (NYU) Recap for Final 8 May 2019 24 / 33 Neural Networks 1 2 True or False: Consider a hypothesis space H of prediction functions f : Rd → R given by a multilayer perceptron (MLP) with 3 hidden layers, each consisting of m nodes, for which the activation function is σ(x) = cx, for some fixed c ∈ R. Then this hypothesis space is strictly larger than the set of all affine functions mapping Rd to R. True or False: Let g : [0, 1]d → R be any continuous function on the compact set [0, 1]d . Then for any > 0, there exists m ∈ {1, 2, 3, . . .}, m ,b = (b , . . . , b ) ∈m , and a = (a 1 , . . . , am ) ∈ 1 m - w1T − .. .. ∈ Rm×d for which the function f : [0, 1]d → R W = ... . . T − wm − given by m X f (x) = ai max(0, wiT x + bi ) i=1 satisfies |f (x) − g (x)| < for all x ∈ [0, 1]d . Sreyas Mohan (NYU) Recap for Final 8 May 2019 25 / 33 Solutions False, True Sreyas Mohan (NYU) Recap for Final 8 May 2019 26 / 33 Neural Networks We have a neural network with one input neuron in the first layer, a hidden layer with n neurons and a output layer with 1 neuron. The hidden layer has ReLU activation functions. You are allowed to use bias. Explain if the following functions can be represented in this architecture and if yes, give a value for n and set of weights for which this is true. 1 f (x) = |x| Sreyas Mohan (NYU) Recap for Final 8 May 2019 27 / 33 Backprop Suppose f (x; θ) and g (x; θ) are differentiable with respect to θ. For any example (x, y ), the log-likelihood function is J = L(θ; x, y ). Give an ∂J expression for the scalar value ∂θ in terms of the “local” partial i derivatives at each node. That is, you may assume you know the partial derivative of the output of any node with respect to each of its scalar inputs. For example, you may write your expression in terms of ∂J ∂J ∂α ∂sα (1) and θ (2) (1) , etc. Note that, as discussed in lecture, θ ∂α , ∂β , ∂sα , ∂θi represent identical copies of θ, and similarly for x (1) and x (2) . θ ∈ Rm θ (1) sα = f (x(1) , θ (1) ) sα y ∈ (0, 1) α = ψ(sα ) θ (2) y α x(1) x ∈ Rd x(2) sβ = g(x(2) , θ (2) ) sβ β = ψ(sβ ) β J = L(α, β; y) J Sreyas Mohan (NYU) Recap for Final 8 May 2019 28 / 33 Backprop solution ∂J ∂J ∂α ∂sα ∂J ∂β ∂sβ = + ∂θi ∂α ∂sα ∂θ(1) ∂β ∂sβ ∂θ(2) i i Sreyas Mohan (NYU) Recap for Final 8 May 2019 29 / 33 KMeans Sreyas Mohan (NYU) Recap for Final 8 May 2019 30 / 33 KMeans Solution Sreyas Mohan (NYU) Recap for Final 8 May 2019 31 / 33 Mixture Models Suppose we have a latent variable z ∈ {1, 2, 3} and an observed variable x ∈ (0, ∞) generated as follows: z ∼ Categorical(π1 , π2 , π3 ) x | z ∼ Gamma(2, βz ), where (β1 , β2 , β3 ) ∈ (0, ∞)3 , and Gamma(2, β) is supported on (0, ∞) and has density p(x) = β 2 xe −βx . Suppose we know that β1 = 1, β2 = 2, β3 = 4. Give an explicit expression for p(z = 1|x = 1) in terms of the unknown parameters π1 , π2 , π3 . Sreyas Mohan (NYU) Recap for Final 8 May 2019 32 / 33 Solutions p(z = 1|x = 1) ∝ p(z = 1|x = 1)p(z = 1) = π1 e −1 p(z = 2|x = 1) ∝ p(z = 2|x = 1)p(z = 2) = π2 4e −2 p(z = 3|x = 1) ∝ p(z = 3|x = 1)p(z = 3) = π3 16e −4 p(z = 1|x = 1) = Sreyas Mohan (NYU) π1 e −1 π1 e −1 + π2 4e −2 + π3 16e −4 Recap for Final 8 May 2019 33 / 33