Krishnakumar Sridharan Machine learning progress – 12/13/2011 to 12/29/2011 Questions: In this update I will be describing the trials I performed to answer the following questions: A. How does ML performance vary when different sequence-similarity thresholds are considered for redundancy reduction of training data sets? B. How can we prove that the performance observed so far is significant? Could comparing very similar datasets prove that our ROC values are better than random guessing? C. Is the methodology species-specific? That is, will the ML classifier need to be trained species by species, group by group, or is the classifier more generic? Description of trials and their results: A. Trial of different sequence-similarity thresholds: 1. Previous implementations of redundancy reduction were based on 100% pair wise sequence-similarity. That is, given 2 sequences which are 100% similar, one was chosen for the training data set and the other was discarded. 2. Following YM’s suggestion, I tried different similarity thresholds to observe any changes in performance. 3. Training data: GENESEQER and EST2TSS run on full Arabidopsis Chr 1 (~12,000 positive and ~5,300 negative). [-50, +50] TSS-flanking region considered. Redundancy reduction performed using CD-HIT-EST (Li et. al, Bioinformatics 2006) software 4. Testing data: GENESEQER and EST2TSS run on full Arabidopsis Chr 2 (~7,400 positive and ~3,222 negative). [-50, +50] TSS-flanking region considered. 5. Random Forest with 100 trees is performed Training data: Pair-wise sequence similarity threshold Number of sequences in TSSPositive Number of sequences in TSSNegative 80% 6,672 5,179 85% 7,289 5,201 90% 8,213 5,220 95% 9,916 5,251 100% 11,811 5,305 Table 1: Number of data instances per trial *The lowest allowed/possible value for sequence-similarity is 80% in CD-HIT-EST software Brendel Group Krishnakumar Sridharan Results: Pair-wise sequence similarity threshold True Positive Rate False Positive Rate Precision Recall F-Measure Area Under Curve (AUC) 80% 0.9 0.5 0.8 0.89 0.84 0.79 85% 0.92 0.57 0.79 0.92 0.85 0.79 90% 0.94 0.63 0.77 0.94 0.85 0.80 95% 0.97 0.72 0.76 0.97 0.85 0.80 100% 0.98 0.79 0.74 0.98 0.85 0.79 Table 2: Performance with different sequence similarity thresholds Conclusions: a. Although the performance of different sequence similarity thresholds is very similar, it is observed that as the similarity threshold for elimination decreases, the precision of predicting instances of “TSS” class increases while Recall decreases. b. It is important to note here that in both the training and test data sets, the number of “TSS” and “non-TSS” data instances is unequal. (When these similarity thresholds were tried on a balanced testing data set, results similar to what we see here were observed). If I could try lower similarity thresholds, I would expect the Recall to decrease and Precision to increase further for predicting “TSS” class data. c. The Recall to Precision trade-off is at a satisfactory value of 94% Recall and 77% Precision, with good F-measure and AUC, at a sequence-similarity threshold of 90% B. Trial of ML methodology to classify very similar datasets: 1. In order to better substantiate the performance values that we observe in our classifications, it is essential to show that the values we have obtained are not by chance but due to observable trends 2. One way to do this is to perform a classification in which the two classes of data have been derived using identical methods from identical sources. If the performance is markedly better in any one class derived as above, then the methodology is flawed or biased 3. Detail of trial: - GENESEQER and EST2TSS are run on Arabidopsis chromosome 1 to get a TSSpositive dataset (>5 5’-EST ends with good GENESEQER quality scores) with ~12,000 sequences - This large data set is randomized and then partitioned equally into 2 subsets - Redundancy reduction using 90% sequence-similarity is applied to give ~4,600 nonredundant sequences in each subset to which our workflow is applied Brendel Group Krishnakumar Sridharan - Random Forest with 100 trees is used with 5-fold cross-validation Results: Class label True Positive Rate False Positive Rate Precision Recall F-Measure Area Under Curve (AUC) TSS-Subset1 0.51 0.55 0.485 0.51 0.5 0.48 TSS-Subset2 0.45 0.49 0.483 0.45 0.47 0.48 Table 3: Performance on very similar datasets Conclusions: a. Given two highly identical datasets, the performance of our workflow is as bad as random guessing of class labels b. This proves that the performance values that we observe are meaningful, and that our workflow is not biased in any observable way C. Inter-species- TSS classifications : 1. In order to build a final ML classifier, we need to find out the best source of training data for our Random Forest algorithm. Until now, I had trained and tested within one species and had obtained my performance values. 2. There were two dimensions to this trial, which were as follows: - A preliminary trial where I train the classifier on a single chromosome of one species and test it on a single chromosome of another species - A large scale trial where I train the ML classifier on one species and test it on another full species C.I. Single Chromosome Trial: Training data set: EST2TSS run on Arabidopsis chromosome 1—12,075 TSS-positive and 5,300 TSS-negative Testing data sets: a) EST2TSS run on Arabidopsis chromosome 2 – 2000 instances of TSS-positive and TSS-negative b) EST2TSS run on Rice chromosome 1 – 2000 instances of TSS-positive and TSS-negative c) 5-fold cross-validation [-50, +50] TSS-flanking region is considered for feature extraction and a Random Forest algorithm with 100 trees is run Results: Testing Method True Positive Rate False Positive Rate Precision Recall F-Measure Area Under Curve (AUC) Arabidopsis Chr 2 0.98 0.79 0.56 0.98 0.71 0.8 Brendel Group Krishnakumar Sridharan Rice Chr 1 0.91 0.9 0.5 0.91 0.65 0.49 5-fold crossvalidation 0.98 0.81 0.73 0.98 0.84 0.78 Table 4: Performance with training on a single chromosome Preliminary Conclusions: a. In this small scale trial where we test the classifier on a single chromosome of another species, a species-specific design seems to be in order b. This trial and its results could be possibly confounded due to Rice being a Monocot while Arabidopsis is a Dicot. c. Also, the trial was performed on only one chromosome. Thus, it could be limited in scope and robustness while possibly being an over-simplification of a real-world ML classifier. C.II. Whole species trial The reasons for performing this larger and more time-consuming trial were as follows: 1. The previous single chromosome trial could be biased due to the small size of data sets 2. A more comprehensive test would incorporate the groups that species belong to Data used for trial: a) EST2TSS was run on GeneSeqer with parameters as follows: 1) TSS-Positive data was generated when EST2TSS found >80% Coverage and Similarity in the GeneSeqer output, along with at least 3 5’-EST ends supporting a TSS 2) TSS-Negative data was extracted from a length of genome at least 1500bp, which had no 5’-EST ends aligning anywhere within this genomic region. b) [-50, +50] region flanking the TSS was used to extract features c) CD-HIT-EST software was used with a sequence similarity threshold of 90% for redundancy reduction in sequences d) Random Forest of 100 trees was used to build the classifier Group Label of Species Used for Training and Testing Classifier Monocots Dicots** Training Dataset Species and Number of Data Instances Testing Dataset Species and Number of Data Instances Sorghum bicolor TSS-Positive: 12,602 TSS-Negative: 18,311 Vitis vinifera TSS-Positive: 33,656 TSS-Negative: 40,200 Brachypodium distachyon TSS-Positive: 5,489 TSS-Negative: 12,776 Solanum lycopercicum TSS-Positive: 19,933 TSS-Negative: 23,377 Table 5: Data used in Monocot and Dicot** species trials to account for any ambiguities due to differing groups of plants. **The whole-genome dataset trial for Dicot species has been running for quite a long time (>55 hrs) on WEKA at the time of this update. The results of this trial will be added as soon as it reaches completion. Brendel Group Krishnakumar Sridharan **Meanwhile, I have presented the results from the RF_100 classifier with the following data: a) Training data: ~10,000 TSS-positive and TSS-negative from Vitis vinifera genome; b) Testing data: ~6000 TSSpositive and TSS-negative from Solanum lycopercicum genome Results Group Label of Species True Positive Rate False Positive Rate Precision Recall F-Measure Area Under Curve (AUC) Monocots 0.7 0.69 0.65 0.7 0.58 0.71 Dicots* 0.55 0.54 0.58 0.55 0.4 0.58 Table 6: WEIGHTED AVERAGE performance of trials in different groups of flowering plants *Dicot trial results presented here are derived from limited, randomly selected data points for training and testing from the 2 species. The larger trial is pending completion, but the results are expected to be the same Conclusions: a. From the single chromosome as well as whole-genome inter-species trial of the ML classifier done so far, it is observed that the ML classifier performance better with species-specific training. b. As per my results, the ML classifier would need to be trained species by species so as to achieve good performance. Summary of findings so far: A. I plan to incorporate CD-HIT-EST software into my workflow, and use a 85-90% pair-wise sequence similarity threshold, for redundancy reduction in training datasets B. The performance reported in all trials is meaningful as the classification of highly identical datasets shows a poor performance at par with random guessing of class labels C. The single chromosome and whole species trials, support a species-specific training protocol for the ML classifiers Future work for coming week: a. Implement the classifier on a species/chromosome and compare it with annotated TSS data b. Compare strong promoters (>10-15 5’-EST end support) with other negative data Brendel Group