Uploaded by Karay Zhou

ML Final Review

advertisement
Recap for Final
Sreyas Mohan
CDS at NYU
8 May 2019
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
1 / 33
Learning Theory Framework
1
Input Space X , output space Y and Action Space A
2
Prediction function f : X → A
3
Loss Function ` : A × Y → R
4
Risk R(f ) = E (`(f (x), y )
5
Bayes prediction function f ∗ = argminf R(f )
6
Bayes Risk R(f ∗ )
7
8
9
Empirical Risk,
R̂(f ) - sample approximation for R(f ).
1 Pn
R̂n (f ) = n i=1 (f (xi ), yi )
Empirical Risk Minimization fˆ = argminf R̂(f )
Constrained Empirical Risk Minization
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
2 / 33
Learning Theory Framework
f ∗ = arg min `(f (X ), Y )
f
fF = arg min `(f (X ), Y ))
f ∈F
1
fˆn = arg min
f ∈F n
n
X
`(f (xi ), yi )
i=1
Approximation Error (of F) = R(fF ) − R(f ∗ )
Estimation error (of fˆn in F) = R(fˆn ) − R(fF )
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
3 / 33
Approximation Error of Regression Stumps
Suppose X = U[−1, 1]. y |x ∼ N (x, σ 2 ). We are minimizing square loss.
Consider a hypothesis class H made of regression stumps. What is the
f ∈ H that minimizes the approximation error of H?
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
4 / 33
Solution
Note that f ∗ (x) = x and R(f ∗ ) = σ 2 .
Also R(f ) = E ((f (x) − y )2 ) = E [(f (x) − E [y |x])2 ] + E [(y − E [y |x])2 ]
We have to minimize E [(f (x) − E [y |x])2 ] = E [(f (x) − x)2 ]
The function we are looking for is
(
−0.5 −1 ≤ x ≤ 0
f (x) =
+0.5 0 < x ≤ 1
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
5 / 33
Lasso and Ridge Regression
Linear Regression Closed form Expression (w = X T X )−1 X T y
Ridge Regression Solution
Lasso Regression Solution Methods - co-ordinate descent.
Correlated Features for Lasso and Ridge
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
6 / 33
Optimization Methods
(True/False) Gradient Descent or SGD is only defined for convex loss
functions.
(True/False) Negative Subgradient direction is a descent direction
(True/False) Negative Stochastic Gradient direction is a descent
direction
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
7 / 33
Optimization Methods
False
False, but takes you closed to minimizer
False
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
8 / 33
SVM, RBF Kernel and Neural Network
Suppose we have fit a kernelized SVM to some training data
(x1 , y1 ) , . . . , (xn , yn ) ∈ × {−1, 1}, and we end up with a score function of
the form
n
X
f (x) =
αi k(x, xi ),
i=1
where k(x, x 0 ) = ϕ(x − x 0 ), for some function ϕ : R → R. Write
f : R → R as a multilayer perceptron by specifying how to set m and
ai , wi , bi for i = 1, . . . , m, and the activation function σ : R → R in the
following expression:
f (x) =
m
X
ai σ(wi x + bi )
i=1
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
9 / 33
Solution
Simply take m = n. Then ai = αi , wi = 1, and bi = −xi . Finally take
σ = ϕ.
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
10 / 33
Question continued
Suppose you knew which all data points were support vectors. How would
you use this information to reduce the size of the network?
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
11 / 33
Solution
Remove all the nodes corresponding to ai = 0 or non-support vectors.
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
12 / 33
Maximum Likelihood
Suppose we have samples x1 ≤ x2 ≤ · · · ≤ xn drawn from U[a, b] Find
Maximum Likelihood estimate of a and b.
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
13 / 33
ML Solution
The likelihood is:
L(a, b) =
Πni=1
1
1
(xi )
b − a [a,b]
Let x(1) , . . . , x(n) be the order statistics.
The likelihood is greater than zero if and only a < x(1) and b > x(n) .
When a < x(1) and b > x(n) , the likelihood is a monotonically
decreasing function of (b − a).
And the smallest (b − a) will be attained when b = x(n) and a = x(1) .
Therefore, b = x(n) and a = x(1) give us the MLE.
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
14 / 33
Bayesian
Bayesian Bernoulli Model
Suppose we have a coin with unknown probability of heads θ ∈ (0, 1). We
flip the coin n times and get a sequence of coin flips with nh heads and nt
tails.
Recall the following: A Beta(α, β) distribution, for shape parameters
α, β > 0, is a distribution supported on the interval (0, 1) with PDF given
by
f (x; α, β) ∝ x α−1 (1 − x)β−1 .
α
α−1
The mean of a Beta(α, β) distribution is α+β
. The mode is α+β−2
assuming α, β ≥ 1 and α + β > 2. If α = β = 1, then every value in (0, 1)
is a mode.
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
15 / 33
Question Continued
Which ONE of the following prior distributions on θ corresponds to a
strong belief that the coin is approximately fair (i.e. has an equal
probability of heads and tails)?
1
Beta(50, 50)
2
Beta(0.1, 0.1)
3
Beta(1, 100)
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
16 / 33
Question Continued
1
Give an expression for the likelihood function LD (θ) for this sequence
of flips.
2
Suppose we have a Beta(α, β) prior on θ, for some α, β > 0. Derive
the posterior distribution on θ and, if it is a Beta distribution, give its
parameters.
3
If your posterior distribution on θ is Beta(3, 6), what is your MAP
estimate of θ?
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
17 / 33
LD (θ) = θnh (1 − θ)nt
p(θ|D) ∝ p(θ)L( θ)
∝ θα−1 (1 − θ)β−1 θnh (1 − θ)nt
∝ θnh +α−1 (1 − θ)nt +β−1
Thus the posterior distribution is Beta(α + nh , β + nt ).
Based on information box above, the mode of the beta distribution is
α−1
2
α+β−2 for α, β > 1. So the MAP estimate is 7 .
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
18 / 33
Boostrap
What is the probability of not picking one datapoint while creating a
bootstrap sample?
Suppose the dataset is fairly large. In an expected sense, what
fraction of our bootstrap sample will be unique?
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
19 / 33
Bootstrap Solutions
1
(1 − n1 )n
2
As n → ∞, (1 − n1 )n → e1 . So 1 −
Sreyas Mohan (NYU)
1
e
Recap for Final
unique samples.
8 May 2019
20 / 33
Random Forest and Boosting
Indicate whether each of the statements (about random forests and
gradient boosting) is true or false.
1
True or False: If your gradient boosting model is overfitting, taking
additional steps is likely to help.
2
True or False: In gradient boosting, if you reduce your step size, you
should expect to need fewer rounds of boosting (i.e. fewer steps) to
achieve the same training set loss.
3
True or False: Fitting a random forest model is extremely easy to
parallelize.
4
True or False: Fitting a gradient boosting model is extremely easy to
parallelize, for any base regression algorithm.
5
True or False: Suppose we apply gradient boosting with absolute loss
to a regression problem. If we use linear ridge regression as our base
regression algorithm, the final prediction function from gradient
boosting always will be an affine function of the input.
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
21 / 33
Solution
False, False, True, False, True
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
22 / 33
Hypothesis space of GBM and RF
Let HB represent
hypothesis class of (small) regression trees. Let
PTa base
1
HR = {g |g = i=1 T fi , fi ∈ HB } represent the hypothesis space of
prediction functions in a random forest
P with T trees where each tree is
picked from HB . Let HG = {g |g = T
i=1 νi fi , fi ∈ HB , ni ∈ R} represent
the hypothesis space of prediction functions in a gradient boosting with T
trees.
True or False:
1
If fi ∈ HR then αfi ∈ HR for all α ∈ R
2
If fi ∈ HG then αfi ∈ HG for all α ∈ R
3
If fi ∈ HG then fi ∈ HR
4
If fi ∈ HR then fi ∈ HG
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
23 / 33
Solutions
True, True, True, True
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
24 / 33
Neural Networks
1
2
True or False: Consider a hypothesis space H of prediction functions
f : Rd → R given by a multilayer perceptron (MLP) with 3 hidden
layers, each consisting of m nodes, for which the activation function is
σ(x) = cx, for some fixed c ∈ R. Then this hypothesis space is
strictly larger than the set of all affine functions mapping Rd to R.
True or False: Let g : [0, 1]d → R be any continuous function on the
compact set [0, 1]d . Then for any > 0, there exists m ∈ {1, 2, 3, . . .},
m ,b = (b , . . . , b ) ∈m , and
a = (a
1 , . . . , am ) ∈ 
1
m
- w1T −

..
..  ∈ Rm×d for which the function f : [0, 1]d → R
W =  ...
.
.
T
− wm −
given by
m
X
f (x) =
ai max(0, wiT x + bi )
i=1
satisfies |f (x) − g (x)| < for all x ∈ [0, 1]d .
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
25 / 33
Solutions
False, True
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
26 / 33
Neural Networks
We have a neural network with one input neuron in the first layer, a
hidden layer with n neurons and a output layer with 1 neuron. The hidden
layer has ReLU activation functions. You are allowed to use bias. Explain
if the following functions can be represented in this architecture and if yes,
give a value for n and set of weights for which this is true.
1
f (x) = |x|
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
27 / 33
Backprop
Suppose f (x; θ) and g (x; θ) are differentiable with respect to θ. For any
example (x, y ), the log-likelihood function is J = L(θ; x, y ). Give an
∂J
expression for the scalar value ∂θ
in terms of the “local” partial
i
derivatives at each node. That is, you may assume you know the partial
derivative of the output of any node with respect to each of its scalar
inputs. For example, you may write your expression in terms of
∂J ∂J ∂α ∂sα
(1) and θ (2)
(1) , etc. Note that, as discussed in lecture, θ
∂α , ∂β , ∂sα ,
∂θi
represent identical copies of θ, and similarly for x (1) and x (2) .
θ ∈ Rm
θ (1)
sα = f (x(1) , θ (1) )
sα
y ∈ (0, 1)
α = ψ(sα )
θ (2)
y
α
x(1)
x ∈ Rd
x(2)
sβ = g(x(2) , θ (2) )
sβ
β = ψ(sβ )
β
J = L(α, β; y)
J
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
28 / 33
Backprop solution
∂J
∂J ∂α ∂sα
∂J ∂β ∂sβ
=
+
∂θi
∂α ∂sα ∂θ(1) ∂β ∂sβ ∂θ(2)
i
i
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
29 / 33
KMeans
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
30 / 33
KMeans Solution
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
31 / 33
Mixture Models
Suppose we have a latent variable z ∈ {1, 2, 3} and an observed variable
x ∈ (0, ∞) generated as follows:
z ∼ Categorical(π1 , π2 , π3 )
x | z ∼ Gamma(2, βz ),
where (β1 , β2 , β3 ) ∈ (0, ∞)3 , and Gamma(2, β) is supported on (0, ∞)
and has density p(x) = β 2 xe −βx . Suppose we know that
β1 = 1, β2 = 2, β3 = 4. Give an explicit expression for p(z = 1|x = 1) in
terms of the unknown parameters π1 , π2 , π3 .
Sreyas Mohan (NYU)
Recap for Final
8 May 2019
32 / 33
Solutions
p(z = 1|x = 1) ∝ p(z = 1|x = 1)p(z = 1) = π1 e −1
p(z = 2|x = 1) ∝ p(z = 2|x = 1)p(z = 2) = π2 4e −2
p(z = 3|x = 1) ∝ p(z = 3|x = 1)p(z = 3) = π3 16e −4
p(z = 1|x = 1) =
Sreyas Mohan (NYU)
π1 e −1
π1 e −1 + π2 4e −2 + π3 16e −4
Recap for Final
8 May 2019
33 / 33
Download