Uploaded by hanooda240

final exam ecen446 summer2021-2

advertisement
FINAL EXAM
ECEN 446 - Neural Networks
Information Theory, Inference, and Learning Algorithms
Joseph J. Boutros, 11 July 2021
Three Problems in Inference and Neural Networks (100 points)
Textbooks, lecture notes, scientific calculators are allowed. Laptops and tablets are
only authorized for consulting class notes, for visiting ECEN 446 web page, and for
running Jupyter-Notebook, Python 3, Pari/gp or WolframAlpha/Mathematica software.
Cellphones and chat applications are forbidden, they should stay away from your desk.
Total time duration is 120 minutes.
Part I - Information Theory
(20 points)
We recall that, for a discrete random variable X ∈ X = {x1 , x2 , . . . , xM }, the entropy
of X is defined as
X
H(X) = −
P(X = x) log(P(X = x))
x∈X
=−
X
x∈X
pX (x) log(pX (x)) = −
M
X
pi log(pi ),
(1)
i=1
where pX (x) is the probability mass function of X and pi = P(X = xi ) = pX (xi ),
i = 1 . . . M. Here, the integer M could be finite or infinite (to be replaced by ∞), e.g.
if X is Poisson. For continuous random variables, we recall the similar definition of
differential entropy
Z
∞
h(X) = −
pX (x) log(pX (x)) dx,
(2)
−∞
where pX (x) is the probability density function of X, X ∈ R.
(a) Let X1 ∼ Bernoulli(p1 ) and X2 ∼ Bernoulli(p2 ), X1 and X2 independent. Define
S = X1 + X2 , where the sum is modulo 2. Find the expression of the entropy of S as a
function of p1 and p2 .
Hint: Determine P(S = 0) and P(S = 1).
(b) Compute H(S) for p1 = 1/2 and any p2 . Compute H(S) for p1 = p2 = 1/4. Does
the entropy increase after modulo 2 addition?
Hint: Compare H(S) to H(X1 ).
(c) Now S = X1 + X2 where + is the standard addition in R. The two random variables
are not independent, their joint distribution is given by the following table
1
x2 = 0
x2 = 1
x2 = 2
x1 = 0
0/18
4/18
2/18
x1 = 1 x1 = 2
3/18
2/18
1/18
4/18
2/18
0/18
Table 1: Joint probability distribution for two random variables X1 and X2 .
Find the value of H(S).
Hint: Determine P(S = s), for all valid values of s.
(d) A feedforward dense neural network is used to classify MNIST images. As usual, the
output layer has ten units (logits) followed by a softmax function. Consider the softmax
output Y = (ŷ1 , ŷ2, . . . , ŷ10 ). Find a finite real number Λ that bounds the entropy H(Y )
from above, i.e. H(Y ) ≤ Λ.
Hint: Assume a uniform distribution for Y .
(e) Bonus question. A standard Gaussian random variable, N (0, 1), is applied at the
input of a sigmoid function. This could be the Gaussian-like distribution produced by
a neuron in an artificial neural network before applying a non-linearity. Determine the
differential entropy before and after applying the sigmoid activation.
The probability density function of the Gaussian distribution (before the sigmoid) and
the new squished distribution (after the sigmoid) are shown in Figure 1 for illustration.
1.6
Before Sigmoid
After Sigmoid
1.4
Density Function
1.2
1
0.8
0.6
0.4
0.2
0
-3
-2
-1
0
X
1
2
3
Figure 1: A standard normal distribution applied at a sigmoid input.
2
Part II - Inference Algorithms and Methods
(20 points)
The K-means algorithm, also known as the Lloyd-Max algorithm in Information Theory,
aims at assigning K centroids to a set of data. The centroids are points (or vectors) in
Rℓ , for a given dimension ℓ. The set of data, represented as a cloud of points, has N
points in Rℓ . Figure 2 shows a dataset {xn }N
n=1 of N = 2000 points and K = 10 centroids
K
{ck }k=1 . The centroids are also called means because each centroid is the mean of its
neighboring points. In Figure 2, the whole dataset is grouped into a unique cluster. In
this case, the K centroids serve for vector quantization, i.e. approximate the shape of
the entire cloud such that the mean squared error (MSE) is minimized. The MSE is
PK
kxn − c(xn )k2
MSE = k=1
,
(3)
N
where c(xn ) is the centroid of the Voronoi cell (local neighborhood or local cluster) to
which xn belongs. After a random initialization of the K centroids, the K-means algorithm proceeds in two steps:
Update the Voronoi Cells.
c(xn ) = ck̂(n) = argminck kxn − ck k = closest centroid to xn , n = 1 . . . N.
(4)
See (20.3) in Chap. 20 of Sir David MacKay’s book. Now, for each point xn , we know
that it belongs to the cell k̂ (n) with centroid c(xn ) = ck̂(n) . c(xn ) is the closest centroid to
P
(k)
= N.
xn as given by (4). Let R(k) be the number of points in cell k, we have K
k=1 R
Update the Centroids.
P
ck =
n : c(xn )=ck
R(k)
xn
=
sum of all points in cell k
, k = 1 . . . K.
number of points in cell k
(5)
See (20.5) in Chap. 20 of Sir David MacKay’s book. The algorithm iterates over the two
steps until the MSE stops decreasing or until reaching a maximum number of iterations.
In some cases, as shown in Figure 3, the dataset is already organized in clusters. The
K-means algorithm is used to find the best positions of the K centroids. Then, given a
new point, you can use its distance to the centroids to infer the cluster (i.e. the Voronoi
cell) this new point belongs to. In this case, the centroids serve for classification.
(a) The MSE versus the iteration number is shown in Figure 4 for the dataset of N = 2000
blue points of Figure 2. How many iterations are needed for the K-means algorithm to
reach convergence? Hint: Just follow the plot of Figure 4.
(b) The initialization of the K centroids can be done in two different methods: 1) Select
K random points in the space Rℓ , 2) Select K random points among the N points of the
dataset. Which method is better? Explain.
(c) Assume that the dataset is organized in three clusters, before running the K-means
algorithm, as in Figure 3. Running the K-means with K = 3 centroids should lead to
one centroid per cluster as shown in Figure 3. What happens if we run the K-means
algorithm with K = 4 centroids or more?
3
(d) An undercomplete auto-encoder (AE) is trained on the MNIST dataset. The latent
dimension is 2. The user plots the 10 clusters with 10 different colors in the latent space
z = (z[0], z[1]). The label of each image of the training dataset is known in advance,
hence it is possible to choose the correct color for each point. Recall that the AE’s decoder reconstructs the input image from z and does not do any classification. Given a
new image with an unknown label, suggest a method to classify the image.
Hint: Propose a combination of the AE and the K-means.
(e) Bonus question. A user observes a sequence of N real samples x1 , x2 , . . . , xN of
a Gaussian random variable with mean µ and variance σ 2 . The user does not know µ
and σ 2 , he should infer them from the observed sequence {xi }N
i=1 . The likelihood of this
sequence is
PN
2
1
2
i=1 (xi − µ)
p(x1 , x2 , . . . , xn |µ, σ ) = √
·
exp(−
).
(6)
N
2σ 2
2
2πσ
What are the best estimates of µ and σ 2 such that the above likelihood is maximized?
Hint: Find the partial derivative of log(p(x1 , . . . , xn |µ, σ 2)) with respect to µ and σ 2 .
Note: We saw at our final ECEN 446 lecture how a neural network can estimate the
parameters of a probability distribution. Here, in question (e) above, we are looking for
the mathematical expression of the best estimates without employing neural networks.
40
30
20
10
0
-10
-20
-30
-40
-40
-30
-20
-10
0
10
20
30
40
Figure 2: Quantization of a set of N = 2000 points with K = 10 centroids.
4
Figure 3: Centroids found by K-means used for classification.
28
26
MSE=363.02 at Iteration 0
MSE=18.51 at Iterations=29 to 40
MSE
24
22
20
18
0
5
10
15
20
K-means Iteration
25
30
35
40
Figure 4: MSE versus the K-means iteration for the cloud of 2000 points.
5
Part III - Artificial Neural Networks
(60 points)
In this problem, we consider a bi-dimensional dataset of random points belonging to
two classes. The first class of points is inside the circle centered at 0 with radius 1,
whereas the second class is inside a ring located between radius 1.5 and radius 2.5. All
points are perturbed with a Gaussian noise of zero mean and standard deviation 0.1. A
Bernoulli(p=0.6) process allows us to select 40% of points from the first class (red points)
and 60% of points from the second class (blue points). 4000 points of the training dataset
are plotted in Figure 5. We also generated a validation dataset of size 800 points. The
two datasets can be uploaded to TensorFlow from the files:
training circular bidim dataset.csv and test circular bidim dataset.csv
3
2
X2
1
0
-1
-2
-3
-3
-2
-1
0
X1
1
2
3
Figure 5: A circular bi-dimensional dataset with 2 classes (2 categories/2 clusters).
The binary classification is performed via a 2-6-3-1 dense neural network depicted in
Figure 6. The input has 2 dimensions, the model has 2 hidden layers with tanh activation
functions, and the output has one unit only with a sigmoid. The output sigmoid is well
suited to our binary classification problem.
(a) Suppose the model input is x = (x1 , x2 )t = (0.5, 0.5)t. The 6 × 2 weight matrix of
the first hidden layer is set to


0.50 −0.05
 −0.25 0.25 


 0.04
0.75 
.

(7)
W1 = 

0.04
−0.06


 0.64
0.04 
−0.64 0.08
The 3 × 6 weight matrix of the second hidden layer is set to


0.50 1.00 −0.05 0.40 −0.06 1.00
0.02
0.01 −0.40  .
W2 =  1.00 −0.25 0.25
0.04 0.75
1.00 −0.50 0.01
0.64
6
(8)
Figure 6: A 2-6-3-1 model to make the binary classification for our dataset.
Finally, the weights of the output layer are set to W3 = (2.0, 0.5, −1.0). We denote by
h1 = (h11 , h12 , . . . , h16 )t the output of the first hidden layer after the tanh activation. We
also denote by h2 = (h21 , h22 , h23 )t the output of the second hidden layer after the tanh
activation. The model output is denoted by the letter o. For simplicity, it is assumed
that the model units have no bias in question (a).
Using the linear-algebra equations of forward propagation, without a bias, we calculate h1 = φ(W1 x), h2 = φ(W2 h1 ), and the value of the output o = Ψ(W3 h2 ), where
φ(x) and Ψ(x) are the tanh and the sigmoid functions respectively. We get h1 =
(0.22127, 0.0, 0.37566, −0.00999, 0.32748, −0.27290)t, h2 = (−0.20188, 0.40317, 0.21473)t,
and o = 0.39725.
(a) Based on the values computed by the above forward propagation, compute the gradient dwdo11 of the output o with respect to the weight parameter w11 between the first
input and the first unit on the first hidden layer. Use the rules of backpropagation.
(b) If Xavier initialization is used for the 12 parameters in W1 with a Gaussian distribution, what should be the distribution variance?
(c) Build the 2-6-3-1 neural network of Figure 6 in the Jupyter notebook template received with the final exam PDF file. Set the activation functions to tanh for the hidden
layers. The output activation should be a sigmoid. Write the model construction via
TensorFlow/Keras. Save your Jupyter notebook before sending it back to the instructor
by email. After model.summary(), the neural network should look like this:
7
Model: "sequential"
_________________________________________________________________
Layer (type)
Output Shape
Param #
=================================================================
dense (Dense)
(None, 6)
18
_________________________________________________________________
dense_1 (Dense)
(None, 3)
21
_________________________________________________________________
dense_2 (Dense)
(None, 1)
4
=================================================================
Total params: 43
Trainable params: 43
Non-trainable params: 0
Set the epochs to 100, the batch size to 32, the learning rate to 0.0002, and the optimizer
to Adam. Run model.compile with the correct arguments and then launch the learning
by model.fit(train points, train labels, epochs=epochs). The training accuracy should be
above 99%. The validation accuracy is most likely of 100%.
(d) Replace the tanh activation of the hidden layers by ReLu. Try also sigmoids instead
of tanh. What are the accuracy results after changing the activation functions?
.
.
.
Good Luck.
Joseph J. Boutros.
.
8
Download