machine_learning_pro..

advertisement
Machine Learning progress:
1. ML trial with other TSS-positive data:
The object of this trial was to check if performance of machine-learning algorithms is dependent
on the datasets selected. That is, given another TSS-positive dataset would the observed
performance of predicting “TSS” class change radically which would indicate a bias in the
machine learning methodology
a. TSS-positive dataset was extracted from TAIR (Instead of from EST2TSS)
b. TSS-negative datasets were extracted from EST2TSS (Regions with absolutely no ESTs
aligning to genome) and from random sequences with conserved 3-mer order
c. 1000 Instances per class, 21*100=2100 Features, 10-fold cross-validation
d. Results:
Bayesian Network
Negative
dataset
used
True
Positive
Rate
False
Positive
Rate
Precision
Recall
F-measure
ROC area
EST2TSS
0.696
0.349
0.666
0.696
0.681
0.75
Random
0.67
0.334
0.667
0.67
0.669
0.727
J48 Tree
Negative
dataset
used
True
Positive
Rate
False
Positive
Rate
Precision
Recall
F-measure
ROC area
EST2TSS
0.628
0.355
0.639
0.628
0.633
0.637
Random
0.613
0.406
0.602
0.613
0.607
0.608
e. Conclusions:
The performance in these trials is very close to that of trials in which EST2TSS-positive is
compared to EST2TSS-negative (Difference >0.04 for performance measure). The machine
learning methods do not seem to be biased towards the classification of any particular
dataset pairs (That is, TAIR vs. EST2TSS-negative, or EST2TSS-positive vs. EST2TSS-negative).
2. Trial of differing flanking region sizes: I tried feature values corresponding to different sizes of
flanking region, that is, [-20,+20] [-35,+35] [-50,+50] flanking the TSS. The performance increase
(as quantified by ROC, Recall and Precision) was observed to be very marginal, a 0.005-0.007
increase per 30bp increment
3. Trial of Random Forest:
As Yasser had suggested, I tried the Random Forest approach with differing number of trees. I
incremented the number of trees trial by trial till I reached a plateau of performance values for
predicting instanes of class “TSS”.
a. TSS-positive data: Sites on genome with more than 5 5’-transcript ends aligned to them
(using EST2TSS)
b. TSS-negative data: Sites with no ESTs aligned to them at all (using EST2TSS)
c. 4000 data instances (2000 per class), 21*100 features (corresponding to [-35,+35] region
flanking TSS), 5-fold cross validation
d. Results:
Number of trees in
Random Forest
Recall
Precision
10
54%
65%
20
61%
67%
50
68%
70%
100
70%
70%
200
73%
72%
300
73.5%
70.4%
400
74%
71%
ROC area
0.684
0.719
0.763
0.777
0.787
0.791
0.794
e. Conclusions:
(i)
As the number of trees is increased, the overall performance also increases
(ii)
With number of trees increasing, the gap in performance between predicting
instances of class “non-tss” and predicting instances of class “TSS” decreases. That is,
both classes are predicted with equal performance at about 100 trees in Random
Forest
(iii)
When the number of trees is greater than 200, the “TSS” class instances are
predicted with better performance than “non-tss” class instances
(iv)
Random Forest trial was stopped at 400 trees (memory issues) but this method
looks promising and shows better performance than the Bayesian Network model
(which was previously the best).
Download