Q1: Give a detailed example to show the equivalence between a weight matrix based approaches, e.g., information theoretic approach, and a neural network having a single neuron. Solution: We first consider the similarities between a weight matrix and a SLP: Both cannot handle non-linearity. Both assume independence of each position (inputs in case of NN). To study the similarity between weight matrix and Neural Networks lets take a sample problem and try solving it using either approaches. Consider picking out a marble from a bag containing 2 marbles colored yellow and red respectively. We try picking them for 6 different tries and try forming a model based on the results. For the weight matrix based approach, consider the following trials. Let the set of sequences be YY YR RR RR RY YR Solving by the frequentist approach, we construct a weight matrix as follows Step 1: Tabulate the number of occurrences of each outcome at each position ‘l’ Outcome Position Y R Column sum 1 2 Row sum 3 3 6 2 4 6 5 7 12 Step 2: Divide each entry (including row sum) by column sum Outcome Position Y R 1 2 Row sum 0.5 0.5 0.33 0.66 0.83 1.16 Step 3: Divide each entry by the normalized row sum. Outcome Position Y R 1 2 Row sum 0.602 0.398 0.568 0.83 1.16 0.431 Step 4: Take the natural logarithm to get the weight matrix Outcome Position Y R Consensus 1 2 -0.51 -0.84 Y -0.92 -0.24 R 1 We differentiate our outputs into 2 classes of outputs keeping (-1) as the threshold. For the given sequence RY, the information will be –1.76 negative. For another sequence YR, the information will be -0.75 positive. Solving the same problem by the Neural Networks approach, we use the following SLP with the neuron having hard limit as the activation function. p1 w1,1 = 1 b=-0.5 Neuron p2 w1,2=-1 Y = 0; R = 1 For RY: For YR: p1 = 0, p2 = 1; p1 = 1, p2 = 0; (-1) – (0.5) = -1.5 (1) – (0.5) = 0.5 output = 0 output = 1 Hence we see that the equivalence in both approaches. 2 Q2: What is the relationship between the concepts of mutual information and weight sharing in a neural network? Explain with an example. Lets start by defining the terms: Mutual Information: The mutual information of X and Y, written I(X,Y) is the amount of information gained about X when Y is learned, and vice versa. I(X,Y) = 0 if and only if X and Y are independent. Weight Sharing: Overfitting is a drawback of a NN, where the NN memorizes the mapping between target and data. Overfitting occurs when the number of parameters is large, and subsets of parameters exclusively map to each training example. Hence an “optimal NN is the one that has minimal parameters”. Weight Sharing is a parameter reduction technique, wherein sets of edges share the same weight. In any network, the large number of parameters is a result of the large input layer. Using mutual information we can know the relationship between 2 inputs in an NN, which can be combined to share the same weight, thus reducing the number of parameters. Hence the knowledge of mutual information can help in parameter reduction by means of weight sharing. Example: Consider the inputs to a NN follow a pattern ABCDEFAB… where the same character occurs at input positions ‘x’ and ‘x+6’. Then to reduce parameters, we can send both A’s in one input. Example 2: In a very large network, with large number of neurons and weights, weight sharing is enforced during training by an obvious modification of backpropagation that weight updates are summed for weights sharing same value. Mutual information can be used to get weights that share the same value. 3 Q3: The number of parameters in a neural network should ideally be a small fraction of the number of training examples. The use of orthogonal coding for amino acids increases the number of weights by an order of magnitude. Is this a recommended input transformation or not, and why? In order to prevent the NN from memorizing the mapping from target and data, the number of parameters in a neural network should ideally be a small fraction of the number of training examples. Orthogonal Coding: orthogonal coding replaces each letter in the alphabet (20 letters for amino acids) by a binary 20-bit string with zeros in all positions except one. This increases the number of weights by an order of magnitude. For example in case of DNA sequences: A codes for 0000 and C codes for 0100. etc Thus uncertainty can be handled in a better way, for example if we get 1100 we know its an A or C. However it requires 4 bits as compared to 2 bits required by log2a. Similarly for amino acids it requires 20 bits as compared to 4,3 bits required by log 2a. In any network, the large number of parameters is a result of the large input layer. Although orthogonal coding handles uncertainty better, it might not be a recommended approach because it increases the number of weights, which might force the network to memorize the mappings. 4 Q4: Give an example of a sequence analysis problem where it makes sense to use a noncontiguous sliding window as input to a neural network. RNA Synthesis: Process by which non-coding sequences of base pairs (introns) are subtracted from the coding sequences (exons) of a gene in order to transcribe DNA into messenger RNA (mRNA.) One of the most important stages in RNA processing is RNA splicing. In many genes, the DNA sequence coding for proteins, or "exons", may be interrupted by stretches of non-coding DNA, called "introns". In the cell nucleus, the DNA that includes all the exons and introns of the gene is first transcribed into a complementary RNA copy called "nuclear RNA," or nRNA. In a second step, introns are removed from nRNA by a process called RNA splicing. The edited sequence is called "messenger RNA," or mRNA. Since the start and end sequences for an intron is known (GT and AG respectively) we can use a non-contiguous sliding window for the separation of introns from the exons. 5 Q5: Slide 12 of lecture 9 shows a line that divides the input space into two regions of the drug effectiveness classification problem. Show how the line was derived. p1 w1,1 = 1 Neuron p2 b=-1.5 w1,2=1 In the above diagram, we can clearly see that Output = (p1 * w1, 1) + (p2 * w1, 2) + (b) We also know that W, P are vectors such that W = [w1,1 , w2,1] and P = [p1 , p2] Hence the above expression becomes Output = (W * P) + (b) For the example of drug effectiveness classification problem, we see that the problem converges when W = [1, 1] and b = -1.5 Substituting these values in equation 1 we get p1 + p2 + (-1.5) = 0 Hence p1 + p2 = 1.5 Substituting p1 = 0; we get p2 = 1.5 Substituting p2 = 0; we get p1 = 1.5 Since for plotting a line, availability of 2 points is the necessary and sufficient condition, hence we see that the line of separation passes through points (1.5, 0) and (0, 1.5) p2 2 1 W 1 2 p2 6 Q6: Consider an MLP that solves the XOR problem. It consists of a single 2neuron hidden layer; a single output neuron and uses the hardlimit function as the activation function for each of the three neurons. Derive and draw the two lines that divide the input space correctly for predicting the result of the 4 possible inputs. In the above diagram, we can clearly see that H1 = (p1 * w1, 1) + (p2 * w1, 2) + (b) H2 = (p1 * w2, 1) + (p2 * w2, 2) + (b) Output = (h1 * wo, 1) + (h2 * wo, 2) + (b) We also know that W, P are vectors such that Wi,1 = [w1,1 , w2,1], Wi,2 = [w1,2 , w2,2] and P = [p1 , p2] For the XOR example problem, we see that the problem converges for the values of the weights and the biases shown above. Hence the above expressions become H1 = (2*p1) + (2*p2) + (-1) H1 = (-1 *p1) + (-1 * p2) + (1.5) Output = (h1 + h2) + (-1.5) Solving the above equations for WP + b = 0 we get 2p1 + 2p2 + (-1) = 0 Hence p1 = 0.5; p2 = 0.5 -p1 - p2 + 1.5 = 0 Hence p1=0.75; p2=0.75 h1 + h2 – 1.5 = 0 Hence h1 = 0.75; h2 = 0.75 Hence we draw the 2 graphs: 7 p2 1 p1 + p2 = 1.5 p1+p2=0.5 1 1 p1 h2 1 h1 + h2 = 1.5 1 h1 1 8 Q7: Consider the network shown in question 6, but assume the log-sigmoid activation function for the neurons in the hidden layer and the linear function for the output neuron. Assume an initial configuration where all the weights are set to random values between -1 and 1. Iterate using backpropagation till the network converges to give the right answer for each of the 4 possible inputs of the XOR problem. We have written a program in C for this problem. Back propagation has been implemented using the Forward pass and the Reverse pass (Steepest Descent Algorithm). The initial weights taken are [0.2, 0.4, 0.6, -0.4, 1.0] and the biases are [-1, 0.5, -0.5]. We are including a copy of the program for your perusal. The initial few iterations for the output have been attached. 9 Q8: In principle, a neural network should be able to approximate any function. Demonstrate this by designing and optimizing the parameters of a minimal neural network that approximates f(x) = x2. Plot the output of your converged network for values of x in the interval -3 to 3 and superimpose this on the curve for the actual function [Suggestion: Use activation functions similar to those in problem 7]. A program in C language has been written to implement the network for back propagation. We have used the log sigmoid activation in the hidden layer and linear in the output. We are including a copy of the program for your perusal. The initial iterations for the output have been attached. 10