Summary for week ending August 12th Krishnakumar Sridharan Project A – TSS Prediction using Machine-learning 1. Summary of project status before Summer: A. EST2TSS created by KS. Uses Geneseqer to match EST sequences to Genomic segment and outputs genomic positions with more than certain number of 5’-EST matches, which satisfy certain conditions, as possible Transcription Start Sites (TSSs) B. XK, KS and HL jointly created a Genomic features calculation package, DNA Features (presented to group on 03/04/2011) for use in Transcription Start and Transposon Insertion Site predictions C. Features calculated currently include: k-mer frequencies, Nucleotide compositions, structural/physical properties, TRANSFAC motif detection and nucleosome prediction. D. In the presentation, KS demonstrated usage of package with 1) TSS-Positive dataset [-200,200] genomic segment with TSS from EST2TSS output at position 0 2) TSS-Negative dataset [ [-400,-200] + [200,400] ] genomic segment after checking, and removing, for any presence of TSSs from EST2TSS output in that interval E. Results 1) Negative dataset showed certain trends when feature scores were visualized, these looked promising. 2) VB advised proceeding by trying other TSS-Negative datasets like introns, random sequences etc and then finding significant “features” that is the ones that make any statistical difference. 3) Preliminary literature review by KS and XK suggested using Kolmogorov-Smirnov test, Correlations and t tests/chi-square tests for this purpose. 2. Literature review: A. Supervised Learning methods have been used for both binary and multi-class classification problems. Support Vector Machines are non-probabilistic binary linear classifiers. Guyon et al 2002 used a Support Vector Machinelearning (SVM) method to study causative genes for cancer by analyzing microarray data B. Guyon et al used a method called SVM – Recursive Feature Elimination (SVM-RFE) in which a “feature” space is analyzed by a linear classifier that adds weights to the best features recursively and thus, eliminates “bad” features C. KS modified and is implementing a variation of this SVM-RFE algorithm using an R Package, e1071. The data that KS used is a sample output from previous runs of the DNAFeatures project. Currently, this code is running out of memory and takes quite long (> 12-15 hrs) to run till error. I am working on it to eliminate these problems. D. Other ways to rank features in feature selection algorithms include using : Fisher’s criterion, Pearson Correlation Coefficients, using R-Square from linear regression and using variable’s direct impact on the score. E. Some notes on feature selection :1) There are two types of Feature Selection algorithms – Feature ranking, which tries to find a “satisfactory” set of features and is reasonably fast and Subset Selection, which tries to find an “Optimal” set of features for the classifications and is very time-consuming 2) For classification problems in statistics the models usually implement a stepwise regression to remove “bad” features and then this regression is stopped using cross validation 3) 3 types of Subset Selection algorithms Mappers – Generic model, evaluates each subset by model fit and is computationally expensive. Embedded – Finds features that are good in a specific model and Filters – Similar to Wrapper but use filters instead. Filter metrics can be Correlations or Mutual Information 4) The ultimate aim is to find features that are “highly correlated with the classification but highly un-correlated with other features” 5) Optimality/Evaluation criterion of machine-learning based feature selection methods can be AIC, BIC or Information Loss (As implemented in BioBayes) 3. Future goals for the coming week: A. Trying different negative datasets and using visualization by DNAFeatures as a method to get the best negative dataset. (Timeline = Early next week). There are 4 ways that I will be comparing: 1) Intergenic regions without flanking exons, from HL’s Plant Genome Table. I will take out the unknown N residues from the ends of sequences and chose only those sequences which have greater than 400-500 bp, put file in DNAFeatures readable format and visualize the scores to see a trend. 2) Intron sequences 3) Random nucleotides put into sequences 4) [ [-400,-200] + [200,400] ] genomic segments B. Trying to choose and implement the SVM-RFE and other ways to rank feature subsets, so as to find a meaningful set (Timeline = End of next week). C. I will be attempting to get the negative datasets from one of the above approaches and run a feature selection/ranking algorithm on it, have some results by next week and summarize my findings in the next update. Project B – Promoter architecture across species I. II. Current status: A. Poster presented in ISCB Student Council meeting in Vienna, Austria by TR. Project workflow and abstract in previous updates. B. Datasets - EST/CAGE/Other forms of TSS data from 12 species from various online resources have been collected and readied to run scripts on. Some personal requests for data are pending (PPDB access is in hand). C. Over personal/skype meetings during the summer semester, KS and TR have compiled the data into meaningful formats to run scripts and also, formalized the idea and workflow of the overall project (See previous updates) D. Scripts – TR’s R scripts and KS’s perl scripts will be integrated in a commandline based package TSSPred. (Timeline = mid-September). Currently using an online MobileMe repository, will shift to svn soon. Future work in the coming week: A. KS will start compiling the package together using all scripts and write linker scripts in perl/R scripts to deal with data transfer between modules in the workflow B. We will also be formalizing a large-table format for all input data to the TSSPred package (Timeline = End of next week)