Weight Matrix and Neural Networks + equivalent

advertisement
Q1: Give a detailed example to show the equivalence between a weight
matrix based approaches, e.g., information theoretic approach, and a neural
network having a single neuron.
Solution:
We first consider the similarities between a weight matrix and a SLP:

Both cannot handle non-linearity.

Both assume independence of each position (inputs in case of NN).
To study the similarity between weight matrix and Neural Networks lets take a sample
problem and try solving it using either approaches.
Consider picking out a marble from a bag containing 2 marbles colored yellow and red
respectively. We try picking them for 6 different tries and try forming a model based on the
results.
For the weight matrix based approach, consider the following trials.
Let the set of sequences be
YY
YR
RR
RR
RY
YR
Solving by the frequentist approach, we construct a weight matrix as follows
Step 1: Tabulate the number of occurrences of each outcome at each position ‘l’
Outcome
Position
Y
R
Column sum
1
2
Row sum
3
3
6
2
4
6
5
7
12
Step 2: Divide each entry (including row sum) by column sum
Outcome
Position
Y
R
1
2
Row sum
0.5
0.5
0.33
0.66
0.83
1.16
Step 3: Divide each entry by the normalized row sum.
Outcome
Position
Y
R
1
2
Row sum
0.602
0.398
0.568
0.83
1.16
0.431
Step 4: Take the natural logarithm to get the weight matrix
Outcome
Position
Y
R
Consensus
1
2
-0.51
-0.84
Y
-0.92
-0.24
R
1
We differentiate our outputs into 2 classes of outputs keeping (-1) as the threshold.
For the given sequence RY, the information will be –1.76  negative.
For another sequence YR, the information will be -0.75  positive.
Solving the same problem by the Neural Networks approach, we use the following SLP with
the neuron having hard limit as the activation function.
p1
w1,1 = 1
b=-0.5
Neuron
p2
w1,2=-1
Y = 0; R = 1
For RY:
For YR:
p1 = 0, p2 = 1;
p1 = 1, p2 = 0;
(-1) – (0.5) = -1.5
(1) – (0.5) = 0.5


output = 0
output = 1
Hence we see that the equivalence in both approaches.
2
Q2: What is the relationship between the concepts of mutual information
and weight sharing in a neural network? Explain with an example.
Lets start by defining the terms:
Mutual Information: The mutual information of X and Y, written I(X,Y) is the amount of
information gained about X when Y is learned, and vice versa. I(X,Y) = 0 if and only if X and Y
are independent.
Weight Sharing: Overfitting is a drawback of a NN, where the NN memorizes the mapping
between target and data. Overfitting occurs when the number of parameters is large, and
subsets of parameters exclusively map to each training example. Hence an “optimal NN is the
one that has minimal parameters”. Weight Sharing is a parameter reduction technique,
wherein sets of edges share the same weight.
In any network, the large number of parameters is a result of the large input layer. Using
mutual information we can know the relationship between 2 inputs in an NN, which can be
combined to share the same weight, thus reducing the number of parameters. Hence the
knowledge of mutual information can help in parameter reduction by means of weight sharing.
Example: Consider the inputs to a NN follow a pattern ABCDEFAB… where the same character
occurs at input positions ‘x’ and ‘x+6’. Then to reduce parameters, we can send both A’s in
one input.
Example 2: In a very large network, with large number of neurons and weights, weight
sharing is enforced during training by an obvious modification of backpropagation that weight
updates are summed for weights sharing same value. Mutual information can be used to get
weights that share the same value.
3
Q3: The number of parameters in a neural network should ideally be a small
fraction of the number of training examples. The use of orthogonal coding
for amino acids increases the number of weights by an order of magnitude.
Is this a recommended input transformation or not, and why?
In order to prevent the NN from memorizing the mapping from target and data, the number of
parameters in a neural network should ideally be a small fraction of the number of training
examples.
Orthogonal Coding: orthogonal coding replaces each letter in the alphabet (20 letters for
amino acids) by a binary 20-bit string with zeros in all positions except one. This increases the
number of weights by an order of magnitude.
For example in case of DNA sequences: A codes for 0000 and C codes for 0100. etc Thus
uncertainty can be handled in a better way, for example if we get 1100 we know its an A or C.
However it requires 4 bits as compared to 2 bits required by log2a.
Similarly for amino acids it requires 20 bits as compared to 4,3 bits required by log 2a.
In any network, the large number of parameters is a result of the large input layer. Although
orthogonal coding handles uncertainty better, it might not be a recommended approach
because it increases the number of weights, which might force the network to memorize the
mappings.
4
Q4: Give an example of a sequence analysis problem where it makes
sense to use a noncontiguous sliding window as input to a neural
network.
RNA Synthesis: Process by which non-coding sequences of base pairs (introns) are
subtracted from the coding sequences (exons) of a gene in order to transcribe DNA into
messenger RNA (mRNA.)
One of the most important stages in RNA processing is RNA splicing. In many genes, the
DNA sequence coding for proteins, or "exons", may be interrupted by stretches of non-coding
DNA, called "introns". In the cell nucleus, the DNA that includes all the exons and introns of
the gene is first transcribed into a complementary RNA copy called "nuclear RNA," or nRNA.
In a second step, introns are removed from nRNA by a process called RNA splicing. The edited
sequence is called "messenger RNA," or mRNA.
Since the start and end sequences for an intron is known (GT and AG respectively) we can use
a non-contiguous sliding window for the separation of introns from the exons.
5
Q5: Slide 12 of lecture 9 shows a line that divides the input space into two
regions of the drug effectiveness classification problem. Show how the line
was derived.
p1
w1,1 = 1
Neuron
p2
b=-1.5
w1,2=1
In the above diagram, we can clearly see that
Output = (p1 * w1, 1) + (p2 * w1, 2) + (b)
We also know that W, P are vectors such that W = [w1,1 , w2,1] and P = [p1 , p2]
Hence the above expression becomes
Output = (W * P) + (b)
For the example of drug effectiveness classification problem, we see that the problem
converges when
W = [1, 1] and b = -1.5
Substituting these values in equation 1 we get
p1 + p2 + (-1.5) = 0
Hence p1 + p2 = 1.5
Substituting p1 = 0; we get p2 = 1.5
Substituting p2 = 0; we get p1 = 1.5
Since for plotting a line, availability of 2 points is the necessary and sufficient condition, hence
we see that the line of separation passes through points (1.5, 0) and (0, 1.5)
p2
2
1
W
1
2
p2
6
Q6: Consider an MLP that solves the XOR problem. It consists of a single 2neuron hidden layer; a single output neuron and uses the hardlimit function
as the activation function for each of the three neurons.
Derive and draw the two lines that divide the input space correctly for predicting the result of
the 4 possible inputs.
In the above diagram, we can clearly see that
H1 = (p1 * w1, 1) + (p2 * w1, 2) + (b)
H2 = (p1 * w2, 1) + (p2 * w2, 2) + (b)
Output = (h1 * wo, 1) + (h2 * wo, 2) + (b)
We also know that W, P are vectors such that Wi,1 = [w1,1 , w2,1], Wi,2 = [w1,2 , w2,2] and
P = [p1 , p2]
For the XOR example problem, we see that the problem converges for the values of the
weights and the biases shown above.
Hence the above expressions become
H1 = (2*p1) + (2*p2) + (-1)
H1 = (-1 *p1) + (-1 * p2) + (1.5)
Output = (h1 + h2) + (-1.5)
Solving the above equations for WP + b = 0 we get
2p1 + 2p2 + (-1) = 0
Hence p1 = 0.5; p2 = 0.5
-p1 - p2 + 1.5 = 0
Hence p1=0.75; p2=0.75
h1 + h2 – 1.5 = 0
Hence h1 = 0.75; h2 = 0.75
Hence we draw the 2 graphs:
7
p2
1
p1 + p2 = 1.5
p1+p2=0.5
1
1
p1
h2
1
h1 + h2 = 1.5
1
h1
1
8
Q7: Consider the network shown in question 6, but assume the log-sigmoid
activation function for the neurons in the hidden layer and the linear
function for the output neuron. Assume an initial configuration where all the
weights are set to random values between -1 and 1. Iterate using
backpropagation till the network converges to give the right answer for
each of the 4 possible inputs of the XOR problem.
We have written a program in C for this problem. Back propagation has been
implemented using the Forward pass and the Reverse pass (Steepest Descent
Algorithm). The initial weights taken are [0.2, 0.4, 0.6, -0.4, 1.0] and the biases
are [-1, 0.5, -0.5].
We are including a copy of the program for your perusal.
The initial few iterations for the output have been attached.
9
Q8: In principle, a neural network should be able to approximate any
function. Demonstrate this by designing and optimizing the parameters of a
minimal neural network that approximates f(x) = x2. Plot the output of your
converged network for values of x in the interval -3 to 3 and superimpose
this on the curve for the actual function [Suggestion: Use activation
functions similar to those in problem 7].
A program in C language has been written to implement the network for back
propagation. We have used the log sigmoid activation in the hidden layer and linear
in the output.
We are including a copy of the program for your perusal.
The initial iterations for the output have been attached.
10
Download