Genomic “Features” Table: Feature Lists Source of Feature idea Implementation idea/outlook List of Promoter motifs that can be matched to the sequence at hand with some sequence CoreBoost paper – Zhao et al similarity quality measures (Use Plant Promoter Motifs, Core scores for each element, promoter elements (Resource – Plantprom DB) maybe)-- PlantpromDB Also, Weighted score of pairs b/w TATA, Inr and CCAAT and GC- box Weighted maximal scores for Transcription Factor Binding CoreBoost weight matrix from TRANSFAC Sites and also, density of TFBS Weighted energy/flexibility scores around position -25 and +1 Mechanical properties “ Average energy/flexibility scores (Resource - DNAlive) Correlation with the empirical average energy/flexibility profile (Higher flexibility downstream of TSS) Likelihood ratios from 3rd order HMMs, that is, calculation logMarkovian Scores “ likelihood ratios b/w promoter, upstream and downstream (all pairwise P VS U, U vs D etc) Freq. of 1 or 2-mers related to A,T,G and C 1. Preference for G and C at posns 4 and 5, where 1,2,3 is ATG (conserved CoreBoost and “features of At feature across species) k-mer frequencies genes” paper (Nickolai N. 2. Scarcity of G at-2 and C Alexandrov) at -3 3. A and T peaks at promoter region at -30 4. A peaks at TSS 5. A rare at -2 6. CG skew max at TSS 7. CpG islands at promoters *NOTE: More than 40% of genes have more than one frequent start sites, 50-60% have single Start sites Details about the features: The features that were used in the CoreBoost tool (Used for human polymerase II promoters), were organized into 3 categories Motif features (include core promoter elements and TFBS): Weight matrices were used to score the core promoter elements (TATA, Inr and CCAAT and GC-box) in the genomic region of interest. Pairwise scores also computed for these 4 elements = Sum of scores of each element weighted by the empirical distribution of distances for that pair. Other promoter elements such as downstream promoter elements, TFIIB recognition element are searched for in the form of regular expressions. TSS weight matrix created from a 10 bp window [-5, +5] from the training dataset and this was used to score the region from position 246 to 255 of a test segment. 365 vertebrate weight matrices from TRANSFAC used to scan DNA sequences with the tool featuretab (part of the CREAD suite of tools). The score of TF depends on the empirical distribution of the positional preference of that TF, which was calculated by the MATCH tool. The [-250, +50] region was split into 6 bins from which multinomial distributions were calculated and used to calculate the avg log-likelihood scores of the TFBS locations for each sequence, which is the density feature. Mechanical properties of promoter DNA: 2 peaks were observed by the authors at positions -25 and +1. Following this, for each peak of a 10 bp window, a weighted score was calculated that is based on the weights from the empirical score distribution. The difference between the weighted score of a peak and the avg score of its surroundings is also used, as well as the average score of the promoter regions, are also features that were used in this analyses. Energy/Flexibility profiles – The correlation between a vector of smoothed energy/flexibility scores of a test sequence and the avg energy flexibility profiles up to 250bp, 500bp and 1300bp around the TSSs (Size of smoothing window is 5, 150 and 500) Markovian modeling of promoter sequences: Homogenous 3rd-order markov models are estimated from the upstream, promoter and downstream sequences. The log-likelihood ratios between promoter and upstream and b/w promoter and downstream were used as features. The frequencies of 1-mers or 2-mers related to C or G were also calculated.There are also certain features of the At genome that can also be used as features. Like the A and T composition and frequency at certain genomic positions and the nucleotide abundance at certain positions.