features_table_for_e..

advertisement
Genomic “Features” Table:
Feature Lists
Source of Feature idea
Implementation idea/outlook
List of Promoter motifs that can
be matched to the sequence at
hand with some sequence
CoreBoost paper – Zhao et al
similarity quality measures (Use
Plant Promoter Motifs, Core
scores for each element,
promoter elements
(Resource – Plantprom DB)
maybe)-- PlantpromDB
Also, Weighted score of pairs
b/w TATA, Inr and CCAAT and
GC- box
Weighted maximal scores for
Transcription Factor Binding
CoreBoost
weight matrix from TRANSFAC
Sites
and also, density of TFBS
Weighted energy/flexibility
scores around position -25 and
+1
Mechanical properties
“
Average energy/flexibility scores
(Resource - DNAlive)
Correlation with the empirical
average energy/flexibility profile
(Higher flexibility downstream of
TSS)
Likelihood ratios from 3rd order
HMMs, that is, calculation logMarkovian Scores
“
likelihood ratios b/w promoter,
upstream and downstream (all
pairwise P VS U, U vs D etc)
Freq. of 1 or 2-mers related to
A,T,G and C
1. Preference for G and C
at posns 4 and 5, where
1,2,3 is ATG (conserved
CoreBoost and “features of At
feature across species)
k-mer frequencies
genes” paper (Nickolai N.
2. Scarcity of G at-2 and C
Alexandrov)
at -3
3. A and T peaks at
promoter region at -30
4. A peaks at TSS
5. A rare at -2
6. CG skew max at TSS
7. CpG islands at promoters
*NOTE: More than 40% of genes have more than one frequent start sites, 50-60% have single Start sites
Details about the features:
The features that were used in the CoreBoost tool (Used for human polymerase II promoters), were
organized into 3 categories
Motif features (include core promoter elements and TFBS):
Weight matrices were used to score the core promoter elements (TATA, Inr and CCAAT and GC-box) in
the genomic region of interest.
Pairwise scores also computed for these 4 elements = Sum of scores of each element weighted by the
empirical distribution of distances for that pair.
Other promoter elements such as downstream promoter elements, TFIIB recognition element are
searched for in the form of regular expressions.
TSS weight matrix created from a 10 bp window [-5, +5] from the training dataset and this was used to
score the region from position 246 to 255 of a test segment.
365 vertebrate weight matrices from TRANSFAC used to scan DNA sequences with the tool featuretab
(part of the CREAD suite of tools). The score of TF depends on the empirical distribution of the positional
preference of that TF, which was calculated by the MATCH tool. The [-250, +50] region was split into 6
bins from which multinomial distributions were calculated and used to calculate the avg log-likelihood
scores of the TFBS locations for each sequence, which is the density feature.
Mechanical properties of promoter DNA:
2 peaks were observed by the authors at positions -25 and +1. Following this, for each peak of a 10 bp
window, a weighted score was calculated that is based on the weights from the empirical score
distribution.
The difference between the weighted score of a peak and the avg score of its surroundings is also used,
as well as the average score of the promoter regions, are also features that were used in this analyses.
Energy/Flexibility profiles – The correlation between a vector of smoothed energy/flexibility scores of a
test sequence and the avg energy flexibility profiles up to 250bp, 500bp and 1300bp around the TSSs
(Size of smoothing window is 5, 150 and 500)
Markovian modeling of promoter sequences:
Homogenous 3rd-order markov models are estimated from the upstream, promoter and downstream
sequences. The log-likelihood ratios between promoter and upstream and b/w promoter and
downstream were used as features.
The frequencies of 1-mers or 2-mers related to C or G were also calculated.There are also certain
features of the At genome that can also be used as features. Like the A and T composition and frequency
at certain genomic positions and the nucleotide abundance at certain positions.
Download