Summary for August 25th-Sept 1st 2011 Krishnakumar Sridharan Project A – TSS Prediction using Machine-learning 1. Framework for getting more meaningful results from DNAFeatures: A. While I was testing different negative datasets using DNAFeatures (See previous update), I was unable to get results for the nucleosome prediction features for all datasets. I met with XK regarding this and he pointed out a few errors in my approach, and how to improve the nucleosome prediction part B. To get meaningful results from the NuPoP code used for nucleosome-based features, larger sequences (-5kb, +5kb) need to be supplied at the “boundaries”, to take into account “boundary effects” as said by authors of the NuPoP package. I was not taking these into consideration in my 400bp long sequences and thus, the nucleosome prediction results I got for some datasets would not mean anything. C. The new approach that I plan on using in the future to get nucleosome-prediction features is as follows: Calculate all features, except nucleosome-prediction features, from DNA Features package using many single standard-length (400bp in examples) fasta sequences, and get feature values. For nucleosome-based features, extend the standard-length sequence to about 5kbp on both sides,with either “Filler” (Random nucleotides and/or Unknown nucleotides, “N”) or “Actual” (Upstream and downstream sequences), and then run NuPoP on them for the limited region [5001,5400] The above approach will take care of the “boundary effects”, while giving the nucleosome-based feature values for only the selected 400bp (or any standardlength) sequence. 2. Discussions on machine-learning: I also discussed with XK about the machine-learning approach that he was following, and got some good directions from him regarding this. Some interesting pointers were that: A. The actual difference is not based on the machine-learning algorithm used, but instead, it’s heavily based on the datasets and the “features” used for prediction B. BioBayesNet is a good place to start these analyses, since it has been used before in the group (YZ and XK) and is user-friendly. WEKA is the next go-to tool since it has capabilities to run many different types of analyses C. The Support Vector Machines (SVM) approach that I tried to implement is a popular and well-used algorithm in machine-learning circles, but is considered computationally expensive (Frequently runs out of memory, see previous updates). Since SVM is often recommended and used in literature, I will be definitely coming back to it once I have implemented a run in any of the other machine-learning algorithms. 3. Research Discussion with Chris Eiesley (Dr.Dorman’s student): I. II. III. IV. V. After a conversation at my poster last Saturday, I met Chris Eisley (CE) from Dr. Karin Dorman’s lab over lunch and discussed opportunities to collaborate in research. We explained each of our research methodologies, approaches and progress to each other so as to think of areas to work together on. CE works on the IMM-based models and codes by Mike Sparks from our group, he is currently working on expanding some of the models in this work to predict coding VS non-coding sequences. CE works on estimating the effects of various hidden states, in the markov model, that correspond to genomic features such as G/C content. He mentioned developing a probability-based model that performs a binary classification in predicting whether a given sequence is coding or non-coding. I explained to him about how I use genomic features as predictors to predict whether or not a sequence has a Transcription Start Site and if it does, where it has this TSS. These discussions helped me think of a statistically sound methodology for doing the following things for my project: A. Formatting and making sure of the integrity of the positive and negative training data, and testing data too. An idea that I came up with for testing data for the eventual machine-learning based algorithm is to take a genomic DNA/Chromosome sequence and fragment it. These fragments can together form an unbiased test set. B. Cross-validation of the eventual TSS-predicting algorithm – I could use either a Random or Leave-one-out-cross-validation approach and separate 4/5th of the genomic DNA fragments into training data and 1/5th as testing data. C. CE suggested a machine-learning approach of his choice, Random Forest, and we discussed how that would compare to SVMs or other machine-learning methods In conclusion, based on the discussions we had, some of the possible opportunities to collaborate with CE would be as follows: 1) Adding to the genomic features: We discussed implementing a model loosely based on the one that CE is working on, which will tell if a sequence has TSSs, based on the 200 kmer upstream sequence (Promoter elements). This model will output a probability score that is higher in case the given upstream sequence is upstream of a TSS; it is a chain-based model that has been used for predicting coding sequences. The concerns we would have are that the sequences with no TSS in them would have no defined “upstream sequence” and also, choosing the right sized k-mer CE required a little more time to look at the finer details of implementing such an algorithm for a promoter sequence and I agreed to provide him with some +ve and –ve sequences once he is ready (he will notify me by email) . If we could add this feature, it might be able to increase the prediction efficiency of my approach. 2) Estimating more hidden states: CE mentioned that the prediction efficiency for the markov chain, which he is working on, increases if there is an estimate for the G/C content. I suggested that the DNAFeatures we have might help him estimate more “hidden” states or genomic fetaures, so that he may test if that increases the prediction accuracy for him. 3) Machine-learning: Since CE has applied the Random Forest approach in some previous works, I could learn how to apply that method from him. Also, the machine-learning part of my work could benefit his side of things. 4. Future Work in the coming weeks: Based on these discussions I have had and the analyses I have done previously, my future plans for the coming weeks are as follows: A. Convert data into C4.5 machine-learning compatible data format (used in BioBayesNet and WEKA) using perl scripts of my own (Timeline = Mid-next week) B. In parallel, I will be working to implement the newer approach to calculate nucleosome-based features using perl scripts to format data and the NuPoP code within DNAFeatures package to predict the actual feature values (Timeline= End of next week) C. Once the C4.5 format is done, I will run these data in machine-learning tools (Starting with BioBayesNet) and see what I get (Timeline = Within next 1.5 weeks). Project B – Transcription Initiation and Promoter architecture across species 1. Objective of work this week: To predict the transcription initiation and promoter architecture data for one species end-to-end, on a proof-of-concept basis, to see the challenges associated with this task and to observe the hierarchy of script usage. 2. Tasks done: Running protist (Plasmodium Falciparum) data in GeneSeqer and EST2TSS and following through till the TSS prediction step. Also, look at the different datasets available for a given species and formulate an approach to capture information from the various formats of available data 3. Plasmodium falciparum : Malaria parasite which was sequenced very recently (data released May 2011), the sequenced data is available at plasmodb (http://plasmodb.org/plasmo/) 4. Approach used: A. P.falciparum data was extracted from plasmodb in form of GFF, EST and Genomic sequence files B. A perl script was written to parse the GFF file into coordinates corresponding to Transcription Start Sites C. Geneseqer was run with the EST sequences and each of the 14 chromosomes in the genomic sequence file (Running time: 30-40 minutes per chromosome). The GeneSeqer output was given as input to EST2TSS based on which predictions are given for possible Transcription Start Sites along with their orientation D. EST2TSS can give both the individual TSSs and can cluster TSS-matches, within a user-specified window, into a prospective TSR (window size to bundle/cluster TSSs is usually good between 40-100bp, best predictions for example chromosome 14 seen with 40bp window size) E. Both GFF and EST files are admittedly “preliminary” by authors, so the data might not point to an exact TSS, but using the strict criteria for matches in EST2TSS, we can compress the various nearby TSSs into plausible TSRs. F. This little exercise helps us shape an approach for 2 data types- GFF and EST/Genomic Sequence file 5. Plans for each format of data: A. GFF – Use perl or R scripts to run data and extract possible TSS positions B. EST/Genomic sequence – Use GeneSeqer+EST2TSS and tweak input parameters in EST2TSS to give a better supported annotation than the one given in GFF file C. CAGE – R scripts by TR D. SAGE and RNA-Seq – Interesting, but attempt only if the package development process is done or close to done Contingency plan for incomplete/absent GFF files for some species: For species with incomplete GFF data, optimize EST2TSS parameters on existing data and extrapolate other data from runs of EST2TSS with optimized parameters. For absent GFF files, use EST2TSS with a set of more lenient/non-restrictive parameters to get some genome annotations. 6. Research discussion over meeting today: A. Over our weekly meeting, we updated each other on the progress that we had made since our last meeting. We discussed the progress till present from both our ends and the directions to follow after this. B. TR has been working on the CAGE datasets for Humans in the FANTOM dataset. He uses the BioMart package within R to handle these datasets and obtains an output that contains the Gene Name, Gene Start Position, Gene Orientation and Gene Sequence. I showed him the outputs from EST2TSS C. We decided upon the first of many data-format-based “checkpoints” in our workflow. I will be consolidating the outputs that I get from EST2TSS and the GFF files into the following formats: (1) .ClusterFormat – This file extension has Gene Names as columns and all start positions and strands they occur in, as rows. This is the input format for the TSS Clustering part of our workflow (2) .mod.gsq – Modified GeneSeqer output format contains the columns – Gene Name, Gene Start, Gene Orientation, Gene Description, Gene Sequence. The purpose is to connect EST2TSS outputs to the existing annotations, where EST2TSS is used as a quality control tool to select only “strongly-supported” TSSs for the next step. D. These formats will form a common basis of comparison of different data types (EST, GFF, CAGE) from different species and also, act as a checkpoint before we proceed to our next step which is clustering TSSs into a TSR. E. TR will probably travel to Ames in the week of September 24th and we aim to have a substantial amount of our single-kingdom analyses, some preliminary results and the set of scripts that we plan to put together in one package, by September 24th 7. Future work in the coming weeks: (i) Use Perl scripts to format data into the two afore-mentioned formats (Timeline – Next Wed./Thurs) (ii) Once the format is set, run these scripts on the next protist species Toxoplasma Gondii and get species, chromosome-specific results (Timeline – Next Wed./Thurs) (iii) Provide these files as input to the x-means clustering algorithm and see what we get (Timeline – End of Next week)