Computer Vision Course Project

advertisement
02/15/16
Course Project Proposal
Prof. Davi Geiger
Jong Oh
Computer Vision (V22.0480-001), Spring 2000
Objective
The project objective is implementing and experimenting with a handwritten-digits recognizer
using neural network, and comparing the output performances when different input
representations and network architectures are used. Students will try three different types of
input representations in turn:
1. Raw input image
2. Curvature map representation: each pixel’s gray-level value is replaced by the local
curvature of the pixel point
3. Coarser scale curvature map: the curvature map is constructed in coarser scale.
For each type of representation, students will construct a neural network and the gradient descent
learning module. The training of neural network will take place on a set of training dataset of
digit images (numerals from 0 to 9). After training, recognition performance is tested on the
training data to measure the degree of learning the network acquired. To test network’s
generalization capability, another test on the testing dataset, which the network did not see
beforehand, will be performed.
Data format and availability
The training and testing data are sets of computer files containing images of decimal digits
(http://cs.nyu.edu/courses/spring00/V22.0480-001/index.htm ). Each image has a resolution of
2828 pixels each of which is one byte of gray-level information. Each digit image has been
size-normalized into a frame of 2020 preserving the original aspect ratio and is also centered,
i.e. the image’s centroid has been translated to the central pixel of the image. Therefore there is
an extra 4 pixels width/length of background.The data is ready to be used without further
preprocessing. This is courtesy of Dr. Yann Lecun of AT&T Laboratory who is an expert in the
field. Interested students may take a look at his homepage (http://www.research.att.com/~yann). The
project dataset will be a subset of the data available from there.
Network Architecture: Input, g function, and Output variables
Students may start with a simple fully connected single hidden layer network and may elaborate
more on the network structure later to build a priori information into the system.
02/15/16
The Hidden Layer and The Input Layer: The input is an 8 byte image. There are a total of 28
x 28 input units (pixels) each one with value range between 0-255. .The hidden layer will
consist of 320 units (64 units x 5 =320) each one being represented by a floating point value
(between 0 and 1). Each set of 64 units represent an image feature in a coarse resolution (8 x 8
pixels) being extracted, and we are allowing to up to 5 different features to be extracted. Because
of this representation, one could consider each unit (pixel) of the hidden layer to be connected to
a local region of the input layer. For example each pixel on the hidden layer could be connected
to a 5 x 5 window of the input layer. In this case units 1, 65, 129, 193, 257 would be connected
to the same 5 x 5 window on the input layer.
Again each unit on the hidden layer takes a continuous value between 0 and 1. For that we will
be using the function
28 x 28
g (  wij  j )  1
j 1
 w
(1  e  )
where wij are the weights to be learned and  is the input. Here we let j  1,..., 28 x 28 assuming
that it is a fully connected network. If for each hidden unit only an 8 x 8 window on the input is
considered than we would have j  1,..., 64  8 x8 . This function is known as the sigmoid
function. Note that
x
g ' ( x)  g ( x) (1  g ( x))  e
(1  e  x ) 2
and you will need this formula when writing the learning rule.
The Output Layer and The Hidden Layer: The output layer contain 10 units, one unit for each
numeral. 0—9. The units are also float number varying from 0 to 1 according to the same
function g (x ) except that now the weights belong to the second layer and the “input” is the
output of the hidden layer. There are 10 x 320 weights on the second layer as the second layer is
fully connected (between hidden units and output units). The value of the output reflects the
strength of the numeral, the closer to 1 the output the more likely is that numeral to correspond to
the input.
Learning with Gradient Descent
Make sure you create a debugged learning machine: Create small network of two unit input,
three hidden layer units and two output units and test on some simple functions. Then construct
the full network. There is no point on learning numerals if you are not sure your network code is
debugged.
Download