12_30_2011_machine_l..

advertisement
Krishnakumar Sridharan
Machine learning progress – 12/13/2011 to 12/29/2011
Questions:
In this update I will be describing the trials I performed to answer the following questions:
A. How does ML performance vary when different sequence-similarity thresholds are considered
for redundancy reduction of training data sets?
B. How can we prove that the performance observed so far is significant? Could comparing very
similar datasets prove that our ROC values are better than random guessing?
C. Is the methodology species-specific? That is, will the ML classifier need to be trained species by
species, group by group, or is the classifier more generic?
Description of trials and their results:
A. Trial of different sequence-similarity thresholds:
1. Previous implementations of redundancy reduction were based on 100% pair wise
sequence-similarity. That is, given 2 sequences which are 100% similar, one was chosen for
the training data set and the other was discarded.
2. Following YM’s suggestion, I tried different similarity thresholds to observe any changes in
performance.
3. Training data: GENESEQER and EST2TSS run on full Arabidopsis Chr 1 (~12,000 positive and
~5,300 negative). [-50, +50] TSS-flanking region considered.
Redundancy reduction performed using CD-HIT-EST (Li et. al, Bioinformatics 2006) software
4. Testing data: GENESEQER and EST2TSS run on full Arabidopsis Chr 2 (~7,400 positive and
~3,222 negative). [-50, +50] TSS-flanking region considered.
5. Random Forest with 100 trees is performed
Training data:
Pair-wise sequence similarity
threshold
Number of sequences in TSSPositive
Number of sequences in TSSNegative
80%
6,672
5,179
85%
7,289
5,201
90%
8,213
5,220
95%
9,916
5,251
100%
11,811
5,305
Table 1: Number of data instances per trial
*The lowest allowed/possible value for sequence-similarity is 80% in CD-HIT-EST software
Brendel Group
Krishnakumar Sridharan
Results:
Pair-wise
sequence
similarity
threshold
True
Positive
Rate
False
Positive
Rate
Precision
Recall
F-Measure
Area Under
Curve (AUC)
80%
0.9
0.5
0.8
0.89
0.84
0.79
85%
0.92
0.57
0.79
0.92
0.85
0.79
90%
0.94
0.63
0.77
0.94
0.85
0.80
95%
0.97
0.72
0.76
0.97
0.85
0.80
100%
0.98
0.79
0.74
0.98
0.85
0.79
Table 2: Performance with different sequence similarity thresholds
Conclusions:
a. Although the performance of different sequence similarity thresholds is very similar, it is
observed that as the similarity threshold for elimination decreases, the precision of predicting
instances of “TSS” class increases while Recall decreases.
b. It is important to note here that in both the training and test data sets, the number of “TSS” and
“non-TSS” data instances is unequal. (When these similarity thresholds were tried on a balanced
testing data set, results similar to what we see here were observed). If I could try lower
similarity thresholds, I would expect the Recall to decrease and Precision to increase further for
predicting “TSS” class data.
c. The Recall to Precision trade-off is at a satisfactory value of 94% Recall and 77% Precision, with
good F-measure and AUC, at a sequence-similarity threshold of 90%
B. Trial of ML methodology to classify very similar datasets:
1. In order to better substantiate the performance values that we observe in our classifications, it
is essential to show that the values we have obtained are not by chance but due to observable
trends
2. One way to do this is to perform a classification in which the two classes of data have been
derived using identical methods from identical sources. If the performance is markedly better in
any one class derived as above, then the methodology is flawed or biased
3. Detail of trial:
- GENESEQER and EST2TSS are run on Arabidopsis chromosome 1 to get a TSSpositive dataset (>5 5’-EST ends with good GENESEQER quality scores) with ~12,000
sequences
- This large data set is randomized and then partitioned equally into 2 subsets
- Redundancy reduction using 90% sequence-similarity is applied to give ~4,600 nonredundant sequences in each subset to which our workflow is applied
Brendel Group
Krishnakumar Sridharan
-
Random Forest with 100 trees is used with 5-fold cross-validation
Results:
Class label
True
Positive
Rate
False
Positive
Rate
Precision
Recall
F-Measure
Area Under
Curve (AUC)
TSS-Subset1
0.51
0.55
0.485
0.51
0.5
0.48
TSS-Subset2
0.45
0.49
0.483
0.45
0.47
0.48
Table 3: Performance on very similar datasets
Conclusions:
a. Given two highly identical datasets, the performance of our workflow is as bad as random
guessing of class labels
b. This proves that the performance values that we observe are meaningful, and that our
workflow is not biased in any observable way
C. Inter-species- TSS classifications :
1. In order to build a final ML classifier, we need to find out the best source of training data for our
Random Forest algorithm. Until now, I had trained and tested within one species and had
obtained my performance values.
2. There were two dimensions to this trial, which were as follows:
- A preliminary trial where I train the classifier on a single chromosome of one species
and test it on a single chromosome of another species
- A large scale trial where I train the ML classifier on one species and test it on
another full species
C.I. Single Chromosome Trial:
Training data set: EST2TSS run on Arabidopsis chromosome 1—12,075 TSS-positive and 5,300
TSS-negative
Testing data sets: a) EST2TSS run on Arabidopsis chromosome 2 – 2000 instances of TSS-positive
and TSS-negative b) EST2TSS run on Rice chromosome 1 – 2000 instances of TSS-positive and
TSS-negative c) 5-fold cross-validation
[-50, +50] TSS-flanking region is considered for feature extraction and a Random Forest
algorithm with 100 trees is run
Results:
Testing
Method
True
Positive
Rate
False
Positive
Rate
Precision
Recall
F-Measure
Area Under
Curve
(AUC)
Arabidopsis
Chr 2
0.98
0.79
0.56
0.98
0.71
0.8
Brendel Group
Krishnakumar Sridharan
Rice Chr 1
0.91
0.9
0.5
0.91
0.65
0.49
5-fold crossvalidation
0.98
0.81
0.73
0.98
0.84
0.78
Table 4: Performance with training on a single chromosome
Preliminary Conclusions:
a. In this small scale trial where we test the classifier on a single chromosome of another
species, a species-specific design seems to be in order
b. This trial and its results could be possibly confounded due to Rice being a Monocot while
Arabidopsis is a Dicot.
c. Also, the trial was performed on only one chromosome. Thus, it could be limited in scope
and robustness while possibly being an over-simplification of a real-world ML classifier.
C.II. Whole species trial
The reasons for performing this larger and more time-consuming trial were as follows:
1. The previous single chromosome trial could be biased due to the small size of data sets
2. A more comprehensive test would incorporate the groups that species belong to
Data used for trial:
a) EST2TSS was run on GeneSeqer with parameters as follows:
1) TSS-Positive data was generated when EST2TSS found >80% Coverage and Similarity in
the GeneSeqer output, along with at least 3 5’-EST ends supporting a TSS
2) TSS-Negative data was extracted from a length of genome at least 1500bp, which had
no 5’-EST ends aligning anywhere within this genomic region.
b) [-50, +50] region flanking the TSS was used to extract features
c) CD-HIT-EST software was used with a sequence similarity threshold of 90% for redundancy
reduction in sequences
d) Random Forest of 100 trees was used to build the classifier
Group Label of Species Used for
Training and Testing Classifier
Monocots
Dicots**
Training Dataset Species and
Number of Data Instances
Testing Dataset Species and
Number of Data Instances
Sorghum bicolor
TSS-Positive: 12,602
TSS-Negative: 18,311
Vitis vinifera
TSS-Positive: 33,656
TSS-Negative: 40,200
Brachypodium distachyon
TSS-Positive: 5,489
TSS-Negative: 12,776
Solanum lycopercicum
TSS-Positive: 19,933
TSS-Negative: 23,377
Table 5: Data used in Monocot and Dicot** species trials to account for any ambiguities due to differing groups of plants.
**The whole-genome dataset trial for Dicot species has been running for quite a long time (>55 hrs) on
WEKA at the time of this update. The results of this trial will be added as soon as it reaches completion.
Brendel Group
Krishnakumar Sridharan
**Meanwhile, I have presented the results from the RF_100 classifier with the following data: a) Training
data: ~10,000 TSS-positive and TSS-negative from Vitis vinifera genome; b) Testing data: ~6000 TSSpositive and TSS-negative from Solanum lycopercicum genome
Results
Group Label
of Species
True
Positive
Rate
False
Positive
Rate
Precision
Recall
F-Measure
Area Under
Curve
(AUC)
Monocots
0.7
0.69
0.65
0.7
0.58
0.71
Dicots*
0.55
0.54
0.58
0.55
0.4
0.58
Table 6: WEIGHTED AVERAGE performance of trials in different groups of flowering plants
*Dicot trial results presented here are derived from limited, randomly selected data points for training
and testing from the 2 species. The larger trial is pending completion, but the results are expected to be
the same
Conclusions:
a. From the single chromosome as well as whole-genome inter-species trial of the ML classifier
done so far, it is observed that the ML classifier performance better with species-specific
training.
b. As per my results, the ML classifier would need to be trained species by species so as to achieve
good performance.
Summary of findings so far:
A. I plan to incorporate CD-HIT-EST software into my workflow, and use a 85-90% pair-wise
sequence similarity threshold, for redundancy reduction in training datasets
B. The performance reported in all trials is meaningful as the classification of highly identical
datasets shows a poor performance at par with random guessing of class labels
C. The single chromosome and whole species trials, support a species-specific training protocol for
the ML classifiers
Future work for coming week:
a. Implement the classifier on a species/chromosome and compare it with annotated TSS data
b. Compare strong promoters (>10-15 5’-EST end support) with other negative data
Brendel Group
Download