Final Project

advertisement
Data Preparation
Initially, the data set contained 7071 columns, one for each gene and one for a serial number for
each instance. The information about each patient was recorded in rows. There were 70 rows, 69 for
each patient and one with names of each gene.
Your first step was to normalize the data. Domain experts had previously established that values
within the range of 20 and 16000 were reliable for this experiment. You probably need to write a
small program (e.g., Java) to read the data set and changed any gene value less than 20 to have the
value of 20 and any gene value greater than 16000 to have the value of 16000. Subsequently, we
proceeded to select the genes that seemed to be correlated to the outcome.
Since many learning algorithms look for non-linear combinations of features, having a large data
set with few records and several genes might give us false positives. Gene reduction thus increases
classification accuracy. To do that you probably need to write a small java program to remove genes
with a fold difference of less than 2. Fold difference is defined as (max-min)/2 where. max and min are
the maximum and minimum values of the gene expression for all the instances, respectively. A fold
difference of less than 2 means that across the samples, the gene value does not change much and as
such, cannot influence the class significantly.
Then you are needed to calculate T-values of each class for all genes. T-value is a linear method
and helps in the gene reduction, which, as observed above, helps in the classification accuracy. In this
case, we take the absolute values of the T-values and only use the highest. T-value is calculated as
follows:
( Avg1  Avg2 )
( 12 / N1   22 / N 2 )
Avg1 is the average for one class across the gene sample and Avg2 is the average for the other 4
classes. Stdev1 is the standard deviation for one class and Stdev2 is the standard deviation for the
other classes. Similarly, N1 is the number of samples that have the class whose T-value we are
interested in, and N2 is the number of samples that does not have the T-value that we are interested in.
You can use Microsoft Excel and the CSV file to calculate the T-values of all classes.
Then you can follow the instructions on the website.
Download