Summary for August 14-25th 2011 Krishnakumar Sridharan Project A – TSS Prediction using Machine-learning 1. Objective: To quantitatively analyze and compare different type of non-TSS “negative” datasets, and finally choose the best one for further usage in feature-selection and machinelearning. 2. Materials and Methods: A. Downloaded data, for Arabidopsis alone, from TAIR and Plant Genome Table (HL), and used perl scripts to format the data and cut it to the right length. B. The DNAFeatures (XK,KS and HL unpublished) package was used for scoring sequence sets for “features” such as k-mer frequencies, Nucleotide compositions, structural/physical properties, TRANSFAC motif detection and nucleosome prediction. 3. Datasets for Analyses: I downloaded, cleaned, formatted and generated 4 different types of “negative” dataset and evaluated them against one TSS-containing “positive” dataset. All sequences were 400 bp in length. The datasets are derived as follows:A. POSITIVE DATASET – The genomic sequence from [-200,+200], where 0 position is the TSS. This data can be generated by both EST2TSS and TAIR10 GFF files, I used the TAIR10 GFF file-based method to generate this data for this particular analysis. B. NEGATIVE DATASET 1 - 100us_and_300ds – The genomic region [-300,-200] + [+200, +500] (400 bp region) was calculated and checked for the presence of any TSSs (from TAIR10 GFF file), before the sequence was extracted by the BLAST+ package. The data used for this particular analysis is from Chromosome 1 alone. C. NEGATIVE DATASET 2 – INTRONIC – Intronic sequences were downloaded from TAIR and perl scripts were run on these to format the data and cut up 400 bp sequences. The perl script (data_cleaner_intronic.pl) ensured that 1) Intronic sequences shorter than 150bp are discarded 2) Sequences 151-399bp long are extended by adding ‘N’ characters at the end 3) Sequences > 400bp long are cut up to 400bp, counted from the left end of the sequence. A total of ~49,860 sequences were thus obtained and 100 of these were used for our analysis, these sequences are from different chromosomes. D. NEGATIVE DATASET 3 – INTERGENIC – Intergenic sequences (Without Flanking Exons) were obtained from the Plant Genome Table (HL) and formatted using the same criteria as for Intronic Sequences by perl scripts. A total of ~23,100 sequences were obtained, out of which 1000 400bp-long sequences were used for this analysis. E. NEGATIVE DATASET 4 – RANDOM NUCLEOTIDE SEQUENCES – Random sequences were generated using a perl script and a 1000 400bp-long sequences were used for this analysis. 4. Results: The results are displayed as MS Excel graphs for a limited set of features, for each of the different datasets, as follows: A. B. C. D. Nucleotide Composition Features’ Scores Mono-nucleotide Structural Features’ Scores Di-nucleotide Structural Features’ Scores Tri-nucleotide Structural Features’ Scores Graphs continued on next page.. A. Nucleotide Composition Features’ Scores 2 1 0 -1 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 401 FEATURE SCORE POSITIVE DATA - Nucleotide Posn VS Score Graph 2 1 0 -1 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 401 FEATURE SCORE NEGATIVE_DATA_1: Position VS Score %G+C %A+T GC_Skew AT_Skew Keto_Skew Purine_Skew AT/GC_Ratio %G+C %A+T GC_Skew AT_Skew Keto_Skew Purine_Skew AT/GC_Ratio 3 2 1 0 -1 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 FEATURE SCORE NEGATIVE_DATA_2 - Position VS Score 2 1 0 -1 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 FEATURE SCORE NEGATIVE_DATA_3 - Position VS Score %G+C %A+T GC_Skew AT_Skew Keto_Skew Purine_Skew AT/GC_Ratio %G+C %A+T GC_Skew AT_Skew Keto_Skew Purine_Skew AT/GC_Ratio 1.5 1 0.5 0 -0.5 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 FEATURE SCORE NEGATIVE_DATA_4 -Position VS Score %G+C %A+T GC_Skew AT_Skew Keto_Skew Purine_Skew AT/GC_Ratio B. Mono-nucleotide Structural Features’ Scores 1 0.5 0 -0.5 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 FEATURE SCORE POSITIVE DATA - Nucleotide Posn VS Score Graph Hbondw1 Hbondw2 Hbondw3 Hbondw3' Hbondw2' Hbondw1' 1 0.5 0 -0.5 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 FEATURE SCORE NEGATIVE_DATA_1: Position VS Score NEGATIVE_DATA_2 - Position VS Score Hbondw3 0 -1 NEGATIVE_DATA_3 - Position VS Score 1 Hbondw2' Hbondw1 Hbondw2 Hbondw3 0.5 Hbondw3' 0 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 FEATURE SCORE Hbondw3' Hbondw1' -2 -0.5 NEGATIVE_DATA_4 -Position VS Score Hbondw2' Hbondw1' Hbondw1 Hbondw2 1 Hbondw3 0.5 Hbondw3' 0 -0.5 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 FEATURE SCORE Hbondw1 Hbondw2 1 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 FEATURE SCORE 2 Hbondw1 Hbondw2 Hbondw3 Hbondw3' Hbondw2' Hbondw1' Hbondw2' Hbondw1' C. Di-nucleotide Structural Features’ Scores POSITIVE DATA - Nucleotide Posn VS Score 100 di-DNAbendstiff FEATURE SCORE di-DNAdenature 60 di-duplexstab.disruptenergy 40 di-duplexstab.freeenergy di-pinduceddeform 20 di-propellertwist di-proteinDNAtwist -20 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 0 NEGATIVE_DATA_1: Position VS Score 100 80 FEATURE SCORE di-BDNAtwistOhler di-BDNAtwistOlson 80 60 40 20 -20 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 0 NEGATIVE_DATA_2 - Position VS Score 120 100 80 60 40 20 0 -20 -40 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 FEATURE SCORE di-Aphylicity di-stackingenergy di-ZDNAstabenergy di-Aphylicity di-BDNAtwistOhler di-BDNAtwistOlson di-DNAbendstiff di-DNAdenature di-duplexstab.disruptenergy di-duplexstab.freeenergy di-pinduceddeform di-propellertwist di-proteinDNAtwist di-stackingenergy di-ZDNAstabenergy di-Aphylicity di-BDNAtwistOhler di-BDNAtwistOlson di-DNAbendstiff di-DNAdenature di-duplexstab.disruptenergy di-duplexstab.freeenergy di-pinduceddeform di-propellertwist di-proteinDNAtwist di-stackingenergy di-ZDNAstabenergy NEGATIVE_DATA_3 - Position VS Score 100 FEATURE SCORES 80 60 40 20 -20 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 0 NEGATIVE_DATA_4 -Position VS Score 100 60 40 20 0 -20 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 FEATURE SCORE 80 di-Aphylicity di-BDNAtwistOhler di-BDNAtwistOlson di-DNAbendstiff di-DNAdenature di-duplexstab.disruptenergy di-duplexstab.freeenergy di-pinduceddeform di-propellertwist di-proteinDNAtwist di-stackingenergy di-ZDNAstabenergy di-Aphylicity di-BDNAtwistOhler di-BDNAtwistOlson di-DNAbendstiff di-DNAdenature di-duplexstab.disruptenergy di-duplexstab.freeenergy di-pinduceddeform di-propellertwist di-proteinDNAtwist di-stackingenergy di-ZDNAstabenergy D. Tri-nucleotide Structural Features’ Scores 5 tri-gccontent 0 -5 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 401 FEATURE SCORE POSITIVE DATA - Nucleotide Posn VS Score tri-bendability tri-nuclpref -10 5 tri-gccontent 0 -5 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 401 FEATURE SCORE NEGATIVE_DATA_1: Position VS Score tri-bendability tri-nuclpref -10 10 tri-gccontent 0 -10 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 FEATURE SCORE NEGATIVE_DATA_2 - Position VS Score tri-bendability tri-nuclpref -20 5 tri-gccontent 0 -5 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 FEATURE SCORE NEGATIVE_DATA_3 - Position VS Score tri-bendability tri-nuclpref -10 6 4 tri-gccontent 2 tri-bendability 0 tri-nuclpref -2 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 369 385 FEATURE SCORE NEGATIVE_DATA_4 -Position VS Score 5. Conclusions: This analysis and the subsequent visualizations served two purposes – Looking for the best negative dataset and Looking for the most-distinguishing features. The conclusions from this work were as follows: A. Best Negative Dataset: 1) As evaluated by the package, the best “negative” datasets which can give the clearest differentiation from the “positive” dataset (which is needed for feature selection algorithm) are the Negative Datasets 3 and 4, or Intergenic and Random Sequences. 2) Negative Dataset 1 gives some trends at 100bp which look like they are from the BLAST+ package(as suggested by Hong), since they were also present in a previous visualization at the interval where blastdbcmd cuts and perl joins the sequences 3) The Intron Sequence(Negative Dataset 2) visualization suggest that cutting 10bp from the 5’ end of that sequence will be a good idea 4) This strongly suggests that Intergenic and Random sequences (and maybe trimmed Intronic Sequences) will be best for the next step. B. Possible Best Features: 1) This analysis also serves to give an estimate of the features that will be returned from the feature-selection algorithm. This will help me to cross-check the results of my feature-selection algorithm and give me some inputs as to the relative significance of different features from the DNAFeatures package 2) Based on the above visualizations, some good features seem to be: Tri-nucleotide preference, Di-nucleotide DNA Denaturing, Di-nucleotide Bend Stiff (flexibility), Mononucleotide Hydrogen Bonding-w1;w1';w2;w2';w3;w3', Nucleotide Compositional measures- %GC,%AT,AT skew, GC skew, Keto skew, Purine skew and AT/GC Ratio. 6. Future work in the coming weeks: A. Feature Selection algorithm: The SVM-RFE algorithm that I wrote has been constantly giving me errors about memory allocation” and other things. I think this problem is because I have been using poorly-formatted data from the example files of DNAFeatures. In the coming weeks, I want to either implement this or any other feature ranking methods to select significant features. I will abandon my old trials and start afresh for this step. B. A statistical test can be implemented for the output of DNAFeatures to see if there are significant differences in outputs for “positive” and “negative” datasets C. Formatting data to machine-learning compatible format: The data needs to be formatted to a form which machine-learning methods can read. This step will come after construction of the feature-selection algorithm. Project B – Promoter architecture across species Current Status and Future Work: A. TR and KS had a meeting on 08/22/2011 in which we discussed formatting some data for our Cluster of Orthologous group-analyses-part of the project. We exchanged some data that TR had collected and formatted. This is a good idea, but not immediately required. B. What was immediately needed was a trial run from start to end for atleast one species. As the Protist data is immediately available in good detail (See PlasmoDB), we settled on running our analyses on this. C. GeneSeqer was run on the 14 Chromosome Plasmodium falciparum genomic sequence data (~6-7 hrs on my desktop), I will be applying EST2TSS on the results to see the predicted TSS and provide the data files to Taylor to run his scripts on. The objective is to complete an end-to-end run for one dataset using all the scripts that will be included in the package, TSRCalc. D. I requested TR for an input/output format description for his scripts, so that I can begin tool integration, as TR is out of town, I am awaiting news on this from him. E. TR emailed and talked to other people who can provide more data for our analyses (PPDB – Arabidopsis CAGE tag data from Japan, and Tommy Chu – C.Cinerea data at Hong Kong University). Tommy agreed to provide the data but PPDB is yet to get back to both of us. F. Work for the coming weeks includes: Completing EST2TSS run on P.flaciparum data, providing results to TR to input into his script, laying framework for scripts’ integration into TSRCalc package and meeting Taylor regarding this and future steps.