Machine Learning progress: 1. ML trial with other TSS-positive data: The object of this trial was to check if performance of machine-learning algorithms is dependent on the datasets selected. That is, given another TSS-positive dataset would the observed performance of predicting “TSS” class change radically which would indicate a bias in the machine learning methodology a. TSS-positive dataset was extracted from TAIR (Instead of from EST2TSS) b. TSS-negative datasets were extracted from EST2TSS (Regions with absolutely no ESTs aligning to genome) and from random sequences with conserved 3-mer order c. 1000 Instances per class, 21*100=2100 Features, 10-fold cross-validation d. Results: Bayesian Network Negative dataset used True Positive Rate False Positive Rate Precision Recall F-measure ROC area EST2TSS 0.696 0.349 0.666 0.696 0.681 0.75 Random 0.67 0.334 0.667 0.67 0.669 0.727 J48 Tree Negative dataset used True Positive Rate False Positive Rate Precision Recall F-measure ROC area EST2TSS 0.628 0.355 0.639 0.628 0.633 0.637 Random 0.613 0.406 0.602 0.613 0.607 0.608 e. Conclusions: The performance in these trials is very close to that of trials in which EST2TSS-positive is compared to EST2TSS-negative (Difference >0.04 for performance measure). The machine learning methods do not seem to be biased towards the classification of any particular dataset pairs (That is, TAIR vs. EST2TSS-negative, or EST2TSS-positive vs. EST2TSS-negative). 2. Trial of differing flanking region sizes: I tried feature values corresponding to different sizes of flanking region, that is, [-20,+20] [-35,+35] [-50,+50] flanking the TSS. The performance increase (as quantified by ROC, Recall and Precision) was observed to be very marginal, a 0.005-0.007 increase per 30bp increment 3. Trial of Random Forest: As Yasser had suggested, I tried the Random Forest approach with differing number of trees. I incremented the number of trees trial by trial till I reached a plateau of performance values for predicting instanes of class “TSS”. a. TSS-positive data: Sites on genome with more than 5 5’-transcript ends aligned to them (using EST2TSS) b. TSS-negative data: Sites with no ESTs aligned to them at all (using EST2TSS) c. 4000 data instances (2000 per class), 21*100 features (corresponding to [-35,+35] region flanking TSS), 5-fold cross validation d. Results: Number of trees in Random Forest Recall Precision 10 54% 65% 20 61% 67% 50 68% 70% 100 70% 70% 200 73% 72% 300 73.5% 70.4% 400 74% 71% ROC area 0.684 0.719 0.763 0.777 0.787 0.791 0.794 e. Conclusions: (i) As the number of trees is increased, the overall performance also increases (ii) With number of trees increasing, the gap in performance between predicting instances of class “non-tss” and predicting instances of class “TSS” decreases. That is, both classes are predicted with equal performance at about 100 trees in Random Forest (iii) When the number of trees is greater than 200, the “TSS” class instances are predicted with better performance than “non-tss” class instances (iv) Random Forest trial was stopped at 400 trees (memory issues) but this method looks promising and shows better performance than the Bayesian Network model (which was previously the best).