02/15/16 Course Project Proposal Prof. Davi Geiger Jong Oh Computer Vision (V22.0480-001), Spring 2000 Objective The project objective is implementing and experimenting with a handwritten-digits recognizer using neural network, and comparing the output performances when different input representations and network architectures are used. Students will try three different types of input representations in turn: 1. Raw input image 2. Curvature map representation: each pixel’s gray-level value is replaced by the local curvature of the pixel point 3. Coarser scale curvature map: the curvature map is constructed in coarser scale. For each type of representation, students will construct a neural network and the gradient descent learning module. The training of neural network will take place on a set of training dataset of digit images (numerals from 0 to 9). After training, recognition performance is tested on the training data to measure the degree of learning the network acquired. To test network’s generalization capability, another test on the testing dataset, which the network did not see beforehand, will be performed. Data format and availability The training and testing data are sets of computer files containing images of decimal digits (http://cs.nyu.edu/courses/spring00/V22.0480-001/index.htm ). Each image has a resolution of 2828 pixels each of which is one byte of gray-level information. Each digit image has been size-normalized into a frame of 2020 preserving the original aspect ratio and is also centered, i.e. the image’s centroid has been translated to the central pixel of the image. Therefore there is an extra 4 pixels width/length of background.The data is ready to be used without further preprocessing. This is courtesy of Dr. Yann Lecun of AT&T Laboratory who is an expert in the field. Interested students may take a look at his homepage (http://www.research.att.com/~yann). The project dataset will be a subset of the data available from there. Network Architecture: Input, g function, and Output variables Students may start with a simple fully connected single hidden layer network and may elaborate more on the network structure later to build a priori information into the system. 02/15/16 The Hidden Layer and The Input Layer: The input is an 8 byte image. There are a total of 28 x 28 input units (pixels) each one with value range between 0-255. .The hidden layer will consist of 320 units (64 units x 5 =320) each one being represented by a floating point value (between 0 and 1). Each set of 64 units represent an image feature in a coarse resolution (8 x 8 pixels) being extracted, and we are allowing to up to 5 different features to be extracted. Because of this representation, one could consider each unit (pixel) of the hidden layer to be connected to a local region of the input layer. For example each pixel on the hidden layer could be connected to a 5 x 5 window of the input layer. In this case units 1, 65, 129, 193, 257 would be connected to the same 5 x 5 window on the input layer. Again each unit on the hidden layer takes a continuous value between 0 and 1. For that we will be using the function 28 x 28 g ( wij j ) 1 j 1 w (1 e ) where wij are the weights to be learned and is the input. Here we let j 1,..., 28 x 28 assuming that it is a fully connected network. If for each hidden unit only an 8 x 8 window on the input is considered than we would have j 1,..., 64 8 x8 . This function is known as the sigmoid function. Note that x g ' ( x) g ( x) (1 g ( x)) e (1 e x ) 2 and you will need this formula when writing the learning rule. The Output Layer and The Hidden Layer: The output layer contain 10 units, one unit for each numeral. 0—9. The units are also float number varying from 0 to 1 according to the same function g (x ) except that now the weights belong to the second layer and the “input” is the output of the hidden layer. There are 10 x 320 weights on the second layer as the second layer is fully connected (between hidden units and output units). The value of the output reflects the strength of the numeral, the closer to 1 the output the more likely is that numeral to correspond to the input. Learning with Gradient Descent Make sure you create a debugged learning machine: Create small network of two unit input, three hidden layer units and two output units and test on some simple functions. Then construct the full network. There is no point on learning numerals if you are not sure your network code is debugged.