STAT 503X Assignment 2 Recognizing Letters (10pts) Due: in class Feb 28

advertisement
STAT 503X Assignment 2
Recognizing Letters (10pts)
Due: in class Feb 28
The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet.
The character images were based on 20 dierent fonts and each letter within
these 20 fonts was randomly distorted to produce a le of 20,000 unique stimuli.
Each stimulus was converted into 16 primitive numerical attributes (statistical
moments and edge counts) which were then scaled to t into a range of integer
values from 0 through 15. We typically train on the rst 16000 items and then
use the resulting model to predict the letter category for the remaining 4000.
The number of instances (cases) is 20000, and there are 17 attributes (variables):
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
lettr
x-box
y-box
width
high
onpix
x-bar
y-bar
x2bar
y2bar
xybar
x2ybr
xy2br
x-ege
xegvy
y-ege
yegvx
capital letter (26 values from
horizontal position of box
vertical position of box
width of box
height of box
total # on pixels
mean x of on pixels in box
mean y of on pixels in box
mean x variance
mean y variance
mean x y correlation
mean of x * x * y
mean of x * y * y
mean edge count left to right
correlation of x-ege with y
mean edge count bottom to top
correlation of y-ege with x
1 to 26)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
(integer)
The numbers in each class are as follows:
789 A
734 H
766 B
755 I
736 C
747 J
1
805 D
739 K
768 E
761 L
775 F
792 M
773 G
783 N
753 O
764 V
803 P
752 W
783 Q
787 X
758 R
786 Y
748 S
734 Z
796 T
813 U
This data has been broken into 3 groups, A-I, J-R, S-Z, each about 6500
long. Each working group will be given one of these data sets to classify. The
XGobi-ready les on Vincent are (in /home/dicook/data/):
letter-atoi.dat, .col, .row
letter-jtor.dat, .col, .row
letter-stoz.dat, .col, .row
This data is from the UCI machine learning database:
.
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/letter-recognition/
-- Creator: David J. Slate
-- Odesta Corporation; 1890 Maple Ave; Suite 115; Evanston, IL 60201
-- Donor: David J. Slate (dave@math.nwu.edu) (708) 491-3867
-- Date: January, 1991
||||||||{
Some hints on working with this data. First step is to break it into training
and test data sets. Use graphics to get a rough break out of important variables for dierent letters, then use various numerical classiers to nd good
classication schemes.
The grade for the assignment will depend on neatness of the report, how
comprehensive is the analysis (are all possible questions answered), clarity of
results presentation, and preciseness of nal conclusions.
2
Download