here - Winona State University

advertisement
1. Letter Image Recognition Data
2. Source Information
-- Creator: David J. Slate
-- Odesta Corporation; 1890 Maple Ave; Suite 115; Evanston, IL 60201
-- Donor: David J. Slate (dave@math.nwu.edu) (708) 491-3867
-- Date: January, 1991
3. Past Usage:
-- P. W. Frey and D. J. Slate (Machine Learning Vol 6 #2 March 91):
"Letter Recognition Using Holland-style Adaptive Classifiers".
The research for this article investigated the ability of several
variations of Holland-style adaptive classifier systems to learn to
correctly guess the letter categories associated with vectors of 16
integer attributes extracted from raster scan images of the
letters.
The best accuracy obtained was a little over 80%. It would be
interesting to see how well other methods do with the same data.
4. Relevant Information:
The objective is to identify each of a large number of black-and-white
rectangular pixel displays as one of the 26 capital letters in the
English alphabet. The character images were based on 20 different
fonts and each letter within these 20 fonts was randomly distorted to
produce a file of 20,000 unique stimuli. Each stimulus was converted
into 16 primitive numerical attributes (statistical moments and edge
counts) which were then scaled to fit into a range of integer values
from 0 through 15. We typically train on the first 16000 items and
then use the resulting model to predict the letter category for the
remaining 4000. See the article cited above for more details.
5. Number of Instances: 20000
6. Number of Attributes: 17 (Letter category and 16 numeric features)
7. Attribute Information:
1.
letter capital letter (26 values from A to Z)
2.
x-box horizontal position of box
(integer)
3.
y-box vertical position of box
(integer)
4.
width width of box
(integer)
5.
high height of box
(integer)
6.
onpix total # on pixels
(integer)
7.
x-bar mean x of on pixels in box
(integer)
8.
y-bar mean y of on pixels in box
(integer)
9.
x2bar mean x variance
(integer)
10.
y2bar mean y variance
(integer)
11.
xybar mean x y correlation
(integer)
12.
x2ybr mean of x * x * y
(integer)
13.
xy2br mean of x * y * y
(integer)
14.
x-ege mean edge count left to right
(integer)
15.
xegvy correlation of x-edge with y (integer)
16.
17.
y-edge
yedgvx
mean edge count bottom to top
(integer)
correlation of y-edge with x (integer)
8. Missing Attribute Values: None
9. Class Distribution:
789 A
766 B
734 H
755 I
753 O
803 P
764 V
752 W
736
747
783
787
C
J
Q
X
805
739
758
786
D
K
R
Y
768
761
748
734
E
L
S
Z
775 F
792 M
796 T
773 G
783 N
813 U
Problem: Use different classification methods we have examined in this class predict the
letter from the 17 measured attributes. See if you can beat the 20% accuracy achieved by the
researchers who previously examined these data. Be sure to use a train/test set approach to
checking accuracy as mentioned above. Use the first 16,000 cases as the training set and the
remaining 4,000 cases as the test set. Summarize your findings.
2. U.S. Colleges and Universities Data (College Data (reduced).JMP and Colleges,
Colleges.Public, and Colleges.Private in R)
a) Problem: Use different clustering methods to perform cluster analysis using all the
numeric variables in this data set. Choose one you think produces reasonable clusters.
Also choose a number of clusters k that you think are reasonable and then use
discrimination methods to determine what the clusters have in common. Are the
clusters homogenous in predictable ways: private vs. public, DI vs. DII & DIII,
geographic, prestige/reputation, etc.?
b) Problem: Repeat (a) for the private colleges and universities.
c) Problem: Repeat (a) for the public colleges and universities.
Some of the colleges and universities have missing values for some of the variables. One way to
eliminate observations with missing data in R is to use the command na.omit.
> Colleges.na = na.omit(Colleges)
> Public.na = na.omit(Colleges.Public)
> Private.na = na.omit(Colleges.Private)
3. Digit Recognition Data
(ZipTrain.JMP and ZipTest.JMP and zip.train, and zip.test in R from the library
ElemStatLearn which you will need to install)
Problem: Develop a model to predict the digit based upon the information in the training data
set. The data consists of grayscale intensities in a 16 X 16 grid. To view the data you can use
the following command structure:
> image(zip2image(zip.train,line))
where line is a the row number corresponding to one of 7291 observations in the training
data set. Use various classification methods to predict the digits in the test data set. A
misclassification rate of 2.5% for the test data is considered outstanding, can you do it?
4. Human Tumor Microarray Data
(NCI.JMP and NCI Transpose.JMP and nci & ncitranspose in R from the library
ElemStatLearn which you will need to install)
a)
Treating tumors as observations and the 6830 gene expressions as variables cluster the tumor
types. Do the different tumor types tend to cluster together? Try different clustering
strategies, methods within strategy, and metrics where appropriate. Show the results of the
“best” clustering you found. Use ncitranspose in R and NCI Transpose.JMP.
b) Use K-means (kmeans) and/or K-medoids (pam) clustering to cluster the tumor types. What
number of clusters seems “optimal”? Use ncitranspose.
c) Perform a two-way clustering in JMP using the NCI.JMP data. Is there evidence of clustering of
the different genes, and if so what disease(s) are these gene clusters associated with? Again
try different methods and metrics choosing one you like “best”.
Download