FINAL EXAM ECEN 446 - Neural Networks Information Theory, Inference, and Learning Algorithms Joseph J. Boutros, 11 July 2021 Three Problems in Inference and Neural Networks (100 points) Textbooks, lecture notes, scientific calculators are allowed. Laptops and tablets are only authorized for consulting class notes, for visiting ECEN 446 web page, and for running Jupyter-Notebook, Python 3, Pari/gp or WolframAlpha/Mathematica software. Cellphones and chat applications are forbidden, they should stay away from your desk. Total time duration is 120 minutes. Part I - Information Theory (20 points) We recall that, for a discrete random variable X ∈ X = {x1 , x2 , . . . , xM }, the entropy of X is defined as X H(X) = − P(X = x) log(P(X = x)) x∈X =− X x∈X pX (x) log(pX (x)) = − M X pi log(pi ), (1) i=1 where pX (x) is the probability mass function of X and pi = P(X = xi ) = pX (xi ), i = 1 . . . M. Here, the integer M could be finite or infinite (to be replaced by ∞), e.g. if X is Poisson. For continuous random variables, we recall the similar definition of differential entropy Z ∞ h(X) = − pX (x) log(pX (x)) dx, (2) −∞ where pX (x) is the probability density function of X, X ∈ R. (a) Let X1 ∼ Bernoulli(p1 ) and X2 ∼ Bernoulli(p2 ), X1 and X2 independent. Define S = X1 + X2 , where the sum is modulo 2. Find the expression of the entropy of S as a function of p1 and p2 . Hint: Determine P(S = 0) and P(S = 1). (b) Compute H(S) for p1 = 1/2 and any p2 . Compute H(S) for p1 = p2 = 1/4. Does the entropy increase after modulo 2 addition? Hint: Compare H(S) to H(X1 ). (c) Now S = X1 + X2 where + is the standard addition in R. The two random variables are not independent, their joint distribution is given by the following table 1 x2 = 0 x2 = 1 x2 = 2 x1 = 0 0/18 4/18 2/18 x1 = 1 x1 = 2 3/18 2/18 1/18 4/18 2/18 0/18 Table 1: Joint probability distribution for two random variables X1 and X2 . Find the value of H(S). Hint: Determine P(S = s), for all valid values of s. (d) A feedforward dense neural network is used to classify MNIST images. As usual, the output layer has ten units (logits) followed by a softmax function. Consider the softmax output Y = (ŷ1 , ŷ2, . . . , ŷ10 ). Find a finite real number Λ that bounds the entropy H(Y ) from above, i.e. H(Y ) ≤ Λ. Hint: Assume a uniform distribution for Y . (e) Bonus question. A standard Gaussian random variable, N (0, 1), is applied at the input of a sigmoid function. This could be the Gaussian-like distribution produced by a neuron in an artificial neural network before applying a non-linearity. Determine the differential entropy before and after applying the sigmoid activation. The probability density function of the Gaussian distribution (before the sigmoid) and the new squished distribution (after the sigmoid) are shown in Figure 1 for illustration. 1.6 Before Sigmoid After Sigmoid 1.4 Density Function 1.2 1 0.8 0.6 0.4 0.2 0 -3 -2 -1 0 X 1 2 3 Figure 1: A standard normal distribution applied at a sigmoid input. 2 Part II - Inference Algorithms and Methods (20 points) The K-means algorithm, also known as the Lloyd-Max algorithm in Information Theory, aims at assigning K centroids to a set of data. The centroids are points (or vectors) in Rℓ , for a given dimension ℓ. The set of data, represented as a cloud of points, has N points in Rℓ . Figure 2 shows a dataset {xn }N n=1 of N = 2000 points and K = 10 centroids K {ck }k=1 . The centroids are also called means because each centroid is the mean of its neighboring points. In Figure 2, the whole dataset is grouped into a unique cluster. In this case, the K centroids serve for vector quantization, i.e. approximate the shape of the entire cloud such that the mean squared error (MSE) is minimized. The MSE is PK kxn − c(xn )k2 MSE = k=1 , (3) N where c(xn ) is the centroid of the Voronoi cell (local neighborhood or local cluster) to which xn belongs. After a random initialization of the K centroids, the K-means algorithm proceeds in two steps: Update the Voronoi Cells. c(xn ) = ck̂(n) = argminck kxn − ck k = closest centroid to xn , n = 1 . . . N. (4) See (20.3) in Chap. 20 of Sir David MacKay’s book. Now, for each point xn , we know that it belongs to the cell k̂ (n) with centroid c(xn ) = ck̂(n) . c(xn ) is the closest centroid to P (k) = N. xn as given by (4). Let R(k) be the number of points in cell k, we have K k=1 R Update the Centroids. P ck = n : c(xn )=ck R(k) xn = sum of all points in cell k , k = 1 . . . K. number of points in cell k (5) See (20.5) in Chap. 20 of Sir David MacKay’s book. The algorithm iterates over the two steps until the MSE stops decreasing or until reaching a maximum number of iterations. In some cases, as shown in Figure 3, the dataset is already organized in clusters. The K-means algorithm is used to find the best positions of the K centroids. Then, given a new point, you can use its distance to the centroids to infer the cluster (i.e. the Voronoi cell) this new point belongs to. In this case, the centroids serve for classification. (a) The MSE versus the iteration number is shown in Figure 4 for the dataset of N = 2000 blue points of Figure 2. How many iterations are needed for the K-means algorithm to reach convergence? Hint: Just follow the plot of Figure 4. (b) The initialization of the K centroids can be done in two different methods: 1) Select K random points in the space Rℓ , 2) Select K random points among the N points of the dataset. Which method is better? Explain. (c) Assume that the dataset is organized in three clusters, before running the K-means algorithm, as in Figure 3. Running the K-means with K = 3 centroids should lead to one centroid per cluster as shown in Figure 3. What happens if we run the K-means algorithm with K = 4 centroids or more? 3 (d) An undercomplete auto-encoder (AE) is trained on the MNIST dataset. The latent dimension is 2. The user plots the 10 clusters with 10 different colors in the latent space z = (z[0], z[1]). The label of each image of the training dataset is known in advance, hence it is possible to choose the correct color for each point. Recall that the AE’s decoder reconstructs the input image from z and does not do any classification. Given a new image with an unknown label, suggest a method to classify the image. Hint: Propose a combination of the AE and the K-means. (e) Bonus question. A user observes a sequence of N real samples x1 , x2 , . . . , xN of a Gaussian random variable with mean µ and variance σ 2 . The user does not know µ and σ 2 , he should infer them from the observed sequence {xi }N i=1 . The likelihood of this sequence is PN 2 1 2 i=1 (xi − µ) p(x1 , x2 , . . . , xn |µ, σ ) = √ · exp(− ). (6) N 2σ 2 2 2πσ What are the best estimates of µ and σ 2 such that the above likelihood is maximized? Hint: Find the partial derivative of log(p(x1 , . . . , xn |µ, σ 2)) with respect to µ and σ 2 . Note: We saw at our final ECEN 446 lecture how a neural network can estimate the parameters of a probability distribution. Here, in question (e) above, we are looking for the mathematical expression of the best estimates without employing neural networks. 40 30 20 10 0 -10 -20 -30 -40 -40 -30 -20 -10 0 10 20 30 40 Figure 2: Quantization of a set of N = 2000 points with K = 10 centroids. 4 Figure 3: Centroids found by K-means used for classification. 28 26 MSE=363.02 at Iteration 0 MSE=18.51 at Iterations=29 to 40 MSE 24 22 20 18 0 5 10 15 20 K-means Iteration 25 30 35 40 Figure 4: MSE versus the K-means iteration for the cloud of 2000 points. 5 Part III - Artificial Neural Networks (60 points) In this problem, we consider a bi-dimensional dataset of random points belonging to two classes. The first class of points is inside the circle centered at 0 with radius 1, whereas the second class is inside a ring located between radius 1.5 and radius 2.5. All points are perturbed with a Gaussian noise of zero mean and standard deviation 0.1. A Bernoulli(p=0.6) process allows us to select 40% of points from the first class (red points) and 60% of points from the second class (blue points). 4000 points of the training dataset are plotted in Figure 5. We also generated a validation dataset of size 800 points. The two datasets can be uploaded to TensorFlow from the files: training circular bidim dataset.csv and test circular bidim dataset.csv 3 2 X2 1 0 -1 -2 -3 -3 -2 -1 0 X1 1 2 3 Figure 5: A circular bi-dimensional dataset with 2 classes (2 categories/2 clusters). The binary classification is performed via a 2-6-3-1 dense neural network depicted in Figure 6. The input has 2 dimensions, the model has 2 hidden layers with tanh activation functions, and the output has one unit only with a sigmoid. The output sigmoid is well suited to our binary classification problem. (a) Suppose the model input is x = (x1 , x2 )t = (0.5, 0.5)t. The 6 × 2 weight matrix of the first hidden layer is set to 0.50 −0.05 −0.25 0.25 0.04 0.75 . (7) W1 = 0.04 −0.06 0.64 0.04 −0.64 0.08 The 3 × 6 weight matrix of the second hidden layer is set to 0.50 1.00 −0.05 0.40 −0.06 1.00 0.02 0.01 −0.40 . W2 = 1.00 −0.25 0.25 0.04 0.75 1.00 −0.50 0.01 0.64 6 (8) Figure 6: A 2-6-3-1 model to make the binary classification for our dataset. Finally, the weights of the output layer are set to W3 = (2.0, 0.5, −1.0). We denote by h1 = (h11 , h12 , . . . , h16 )t the output of the first hidden layer after the tanh activation. We also denote by h2 = (h21 , h22 , h23 )t the output of the second hidden layer after the tanh activation. The model output is denoted by the letter o. For simplicity, it is assumed that the model units have no bias in question (a). Using the linear-algebra equations of forward propagation, without a bias, we calculate h1 = φ(W1 x), h2 = φ(W2 h1 ), and the value of the output o = Ψ(W3 h2 ), where φ(x) and Ψ(x) are the tanh and the sigmoid functions respectively. We get h1 = (0.22127, 0.0, 0.37566, −0.00999, 0.32748, −0.27290)t, h2 = (−0.20188, 0.40317, 0.21473)t, and o = 0.39725. (a) Based on the values computed by the above forward propagation, compute the gradient dwdo11 of the output o with respect to the weight parameter w11 between the first input and the first unit on the first hidden layer. Use the rules of backpropagation. (b) If Xavier initialization is used for the 12 parameters in W1 with a Gaussian distribution, what should be the distribution variance? (c) Build the 2-6-3-1 neural network of Figure 6 in the Jupyter notebook template received with the final exam PDF file. Set the activation functions to tanh for the hidden layers. The output activation should be a sigmoid. Write the model construction via TensorFlow/Keras. Save your Jupyter notebook before sending it back to the instructor by email. After model.summary(), the neural network should look like this: 7 Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 6) 18 _________________________________________________________________ dense_1 (Dense) (None, 3) 21 _________________________________________________________________ dense_2 (Dense) (None, 1) 4 ================================================================= Total params: 43 Trainable params: 43 Non-trainable params: 0 Set the epochs to 100, the batch size to 32, the learning rate to 0.0002, and the optimizer to Adam. Run model.compile with the correct arguments and then launch the learning by model.fit(train points, train labels, epochs=epochs). The training accuracy should be above 99%. The validation accuracy is most likely of 100%. (d) Replace the tanh activation of the hidden layers by ReLu. Try also sigmoids instead of tanh. What are the accuracy results after changing the activation functions? . . . Good Luck. Joseph J. Boutros. . 8