Biological relevance of unusual motifs:

advertisement
Chapter 1 Sequence statistics
4-letter nucleotide alphabet N = {A,C,G,T} Sequencing projects:
s = s1s2…sn
Identify main structures & predict biological function
Multinomial sequence model
Nucleotides are independent (correlations among neighbouring nucleotides and identically distributed
(i.i.d): stationarity (symbol freq in various regions), n
p = {pA,pC,pG,pT}, pA + pC + pG + pT = 1 P(s)   p(s(i ))
i 1
Markov sequence model
Probability start state π
State transition matrix T
Models local correlations
Order?  model selection (hypothesis testing)
n
P(s)   (s1 ) p(s(i  1), s(i ))
Multinomial is special case
i 1
•
•
•
•
•
•
•
Base composition (amount, frequency) (CG and AT close but different)
sliding window plot
# essential genes   from chromosomal mean recombination rate
GC content (% or SW)
• Detect foreign genetic material (horizontal gene transfer)
• change point analysis
• AT denatures (=splits) at lower temperatures
• Thermophylic Archaeabacteriae: high CG
• Evolution: Archaea > Eubacteriae > Eukaryotes
Finding unusual/most frequent DNA words (dimers, trimers, k-mers)
• Frequent words
• Repetitive elements
• Sequences with biological functions (e.g. gene regulatory features)
• Rare motives
• Binding sites for transcription factors
• Undesirable structural properties (CTAG: kinking)
• Internal immune system (restriction sites)
word’s observed/expected ratio; Expected under multinomial model takes into account relative
proportionality pA, pC, pG, pT
Biological relevance of unusual motifs:
• Mutational
• Selective
Pattern matching versus pattern discovery
Genomics & Proteomics
Proteomics more complicated than genomics. Genome rather constant entity. Proteome differs from cell to
cell and constantly changing through biochemical interactions with genome and environment. Different
protein expression in different parts of same organism’s body, in different stages of life cycle and in
different environmental conditions.
Protein Structure = protein function
Primary: AA sequence, Secondary: Alpha-helix/Beta-sheet, Super-secondary
Transcription proofreading mechanisms fewer and less effective than DNA replication.
Transcription & DNA replication in 5' → 3' direction (old polymer read 3' → 5'; new, complementary
fragments generated 5' → 3')
Proteins with no genes coding them P1 + P2  P3
There exist genes that are always on. For regulation the produced enzymes are cut by other enzymes
(inefficient)
Chapter 2 Gene Finding
Number of genes hard to estimate because of:
• Pseudogenes
• DNA segments that seemed parts of separate genes which are part of the same gene
64 possible triplets (for 20 AAs)
Only one codon for methionine, so ATG (initiation codon) specifies start of reading frame.
Stop (termination) codons: TGA, TAA, TAG (no associated amino acid)
Mutations (due to biotic / abiotic factors):
- Change one nucleotide to another (negligible effect on protein function): 1/0 ifferent amino acid
- Indels (e.g. frame-shift mutation: wrong reading frame)
Gene-finding methods:
- Ab initio: based on statistical properties; basic ones suffice for prokaryotes, but insufficient for
eukaryotic nucleic genomes (find candidate-genes)
- Markov sequence model
- Homology-based: comparing sequences (with known / unknown genes / genomes)
Because prokaryotic genes do not contain introns, gene finding can be simplified into search for ORFs.
ORF-finder (does not identify short genes)
Given a DNA sequence s and a positive integer k
For each possible reading frame (3 per strand, also for the reverse compliment of s  6)
decompose the sequence into triplets
find all stretches of triplets starting with START-codon and ending with STOP-codon
Output all ORFS longer than or equal to the prefixed threshold k
Finding k
A pattern in DNA can arise from pure chance, so minimum ORF length to be called a gene.
How unlikely does the ORF have to be (under background model) for us to accept it?
Tail of ORF-length distribution in original DNA longer than in randomized DNA
H0 = randomized DNA seq.: keep ORFs in original DNA longer than (most) ORFs in randomized DNA
(,z) = (0.01,2.33), (0.05,1.65)
α-significance represents false positive rate of one single test
type I-error (FP = False Positive) of H0
type II-error (FN = False Negative) of H0
Computing a p-value for ORFs
• Pstop = P(TAA) + P(TAG) + P(TGA)
• P(k non-stop codons) = (1 - Pstop)k
• For a significance-level α we need k* codons with: (1 - Pstop)k* = α
• a priori probability non stop-codon = 61/64
• 95%-significance: (61/64)k ≈ 0.05  floor(k) ≈ 62  64 codons (add start & stop)
Randomization tests
• Generate a string with the same statistical properties of the original data
• Permutation
• Bootstrapping: sampling with replacement
• Per nucleotide? per triplet? per … ?
• p-value: rank of observed test statistic in null distribution: significant if its percentile < α
Problems with multiple testing
• If we conduct 100 tests 5 false positives are expected
• Finding 5 significant genes out 100 with α = 0.05 does not mean anything biologically
Chapter 3 Sequence alignment
Genome annotation:
- Find candidate genes
- Assign protein function
Uses of sequence alignment:
- Prediction of function of similar genes (no experimental evidence)
- Database searching for homologs (func. known or not; requires approx. methods)
- Gene finding (too short for ab-initio / stronger evidence than ORF-finder)
- Sequence divergence (variation within populations and between species)
- Evolution of the sequences from a common ancestor.
- Mismatches in the alignment correspond to mutations,
- and gaps correspond to insertions or deletions.
- Sequence assembly / can reconstruct missing parts since % same
Protein domains most important part of molecule, so evolve slowly (few mutations allowed), so easier to
align.
Blast (pairwise heuristic local alignment)
Inputs:
length
l
threshold
theta
1. Find all l-length words of database sequences that align with words from the query with alignment score
> theta (hotspots = hits)
2. Extend each hit to find if it is contained within larger segment pair with score > theta
Global alignment deprecated (closely-related sequences also identified by local alignment)
Local alignment has the advantage that related regions which appear in a different order in the two proteins
(which is known as domain shuffling) can be identified as being related
Multiple sequence alignment is NP-Hard.
Smith-waterman: O(mn)
Statistical analysis of alignments
 Generate randomized sequences based on the string 2
 Determine the optimal alignments of sequence 1 with these randomized sequences
 Compute a histogram and rank the observed score in this histogram
 The relative position defines the p-value
Chapter 4 Hidden Markov Models
Odorant Receptors (ORs) sense certain molecules outside the cell and signal inside the cell. ORs contain 7
transmembrane domains. Hidden states: in, out. Labda-phage states: AT rich, CT rich.
HMM Applications DNA-segmentation
VITERBI & FORWARD Dynamic Programming algorithm
Given a sequence s of length n and an HMM with params (T,E):
1. Create table V of size |H|x(n+1);
2. Initialize i=0; V(0,0)=1; V(k,0)=0 for k>0;
3. For i=1:n, compute each entry using the recursive relation:
VITERBI
V(j,i) = E(j,s(i))*maxk {V(k,i-1)*T(k,j) }
FORWARD F(j,i) = E(j,s(i))*∑k {F(k,i-1)*T(k,j) }
pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) }
4.
VITERBI
OUTPUT: P(s,h*) = maxk {V(k,n)}
FORWARD OUTPUT: P(s) = ∑k {F(k,n)}
P(s)   P(s, h j )   P(s | h j ) P(h j )
VITERBI
h j Hn
h j Hn
5. Trace-back: i=n:1, using: h*i-1 = pointer(i, h*i)
h*  arg maxn P(s, h)
6. OUTPUT: h*(n) = maxk {V(k,n)}
hH
The EM (Expectation Maximization) algorithm
Given a sequence s and an HMM with unknown (T,E):
1. Initialize h, E and T;
2. Given s and h estimate E and T just by counting the symbols;
3. Given s, E and T estimate h e.g. with Viterbi-algorithm;
4. Repeat steps 2 and 3 until some criterion is met.
HMMs:
 better in detecting genes than
sequence alignment
 can detect introns and exons
 Downside: computationally much
more demanding!
Profile hidden Markov models (pHMM)
Characterize sets of homologous genes and proteins based on common patterns in their sequence.
 pHMM allow to summarize the salient features of a protein alignment in one single model
 Also pHMM can be used to produce multiple alignments
Alternative to:
 multiple alignments of all elements in the family
 Position Specific Scoring Matrices (PSSM) (Cannot handle variable lengths or gaps)
Viterbi Exon-Intron Locator (VEIL)
• Gene finder with a modular structure:
• Uses a HMM which is made up of sub-HMMs each to describe a different bit of the sequence:
upstream noncoding DNA, exon, intron, …
• Assumes test data starts and ends with noncoding DNA and contains exactly one gene.
• Uses biological knowledge to “hardwire” part of HMM, eg. start + stop codons, splice sites.
Chapter 5 Variation within and between species (Jukes-Cantor & Kimura)
α
μ
ρ
N







Substitution probability per site per second
Mutation rate
Substitution rate: rate at which a species fixes new mutations
Population size
Mutations originate in single individuals, accumulate on germ line (distancerelation)
Mitoch. inherited only via maternal line  not reshuffled  suit for comparing evolution
Polymorphism due to SNP & STR
For neutral mutations ρ = K/(2T) = 2Nμ*1/(2N) = μ (diploid genome  *2)
Substitutions are independent (?), random & symmetric (prob(A→T) = prob(T→A)) (no!)
Sequence evolution is Markov process: s(t) depends only on s(t-1)
Back-mutations mutate a nucleotide back to an earlier value (K ≥ d)
MJC (i,i)=1-α
MJC (i,j)=α/3
M(t)=MJCt
T
λ1
=1
v1 = 1/4 (1 1 1 1)
λ2..4 = 1-4α/3
v2 = 1/4 (-1 -1 1 1)T v3 = 1/4 (-1 -1 -1 1)T v4 = 1/4 (1 -1 1 -1)T
t
t
T
MJC =∑i λi vivi
MJCt (i,i) = r(t)
MJCt (i,j) = s(t)
d = s(t) = ¼ - ¼ (1 - 4α/3)t
3
For small α :
t  
ln 1  43 d 
d : Observed distance (proportio n of difference s)
4
K = t   43 ln 1  43 d  Actual distance (proportio n)
For small d ln(1+x) ≈ x:
For saturation: d ↑ ¾ :
K ≈ d actual distance ≈ observed distance
K →∞ d random sequence-distance → K indeterminate
 K 
 K 
2
2
If K  f ( d ) then K  
d  K  
 d
 d 
 d 
2
 K 
So Var ( K )  
 Var ( d )
 d 
Generation of sequence of length n with substituti on rate d is binomial process :
2
n
Prob( k )   d k (1  d ) n k
k 
and therefore with variance :
Var( d )  d( 1-d)/n
Because of the Jukes - Cantor formula :
K
1

d 1  43 d
Var ( K ) 
d (1  d )
n(1  43 d ) 2
α
Transition probability (G↔A (purines) and T↔C (pyramidine)) per site per second
β
Transversion probability (G↔T, G↔C, A↔T, and A↔C) per site per second
P(t) Fraction of transitions per site after t generations
Q(t) Fraction of transversions per site after t generations
K ≈ - ½ ln(1-2P-Q) – ¼ ln(1 – 2Q)
d=P+Q
Validation: generate K* mutations, Count d, estimate K(d) with JC, Plot K(d) – K*
Adaptations: transition/transversion types, Amino-acid substitutions matrices
Chapter 6 Natural selection at the molecular basis (selection pressure)
HIV recognizes helper T-cells of the human immune system. Infected T-cells have viral proteins sticking out
that can be recognized by the immune system (evolutionary arms race).
Fast evolution (can be visualized with phylogenetic tree) because of:
 Short reproduction span: 1.5 days to reproduce
 RNA  High error rate
How to measure whether mutations are neutral, deleterious, or advantageous?
 Experimentally difficult: short-lived organisms & large populations (typically virus)
 Alternative: count number of mutations that can change the protein and those that don’t
Synonymous mutation
Non-synonymous mutations
KA
KS
f0
v
α
new codon translates for same amino-acid
new codon translates for different amino-acid
#non-synonymous substitutions per non-synonymous site
#synonymous substitutions per synonymous site
fraction of non-synonymous mutations that are neutral.
mutation rate
fraction of non-synonymous mutations that are advantageous
A priori many more non-synonymous mutations possible than synonymous
Reasoning:
 Advantageous mutations are very rare (most non-neutral mutations detrimental)
 Deleterious mutations will ‘not’ spread through a population
 Therefore, most mutations are neutral
Strong negative selection → Few non-synonymous substitutions
KA / KS tells us about the strength and form of the natural selection
# non-synonymous mutations after time t
# synonymous mutations after time t
K A = v f0 t
KS = v t
KA / KS = f0
f0 < 1 strong negative selection
f0 > 1 evidence 4 advantageous non-syn, mutations
After time t (averaged over the gene!):
all synonymous mutations neutral
negative selection dominates
positive selection dominates
KA = v(f0 + α)t
KA / KS = f0 + α
Nei-Gojobori Algorithm
Linear in the size of the sequences
Assumptions:
- Assume that rate of transitions and transversions is the same
- There is no bias towards codon usage (i.e. no information on the ensuing protein)
s1,s2 aligned, homologous sequences without gaps (excluding stop-codon)
r
(length of s1) = (length of s2) in codons
ck
kth codon pair e.g. (TTA, TTT)
fi(c) fraction of changes at ith position of codon c resulting in synonymous change (i=1,2,3)
sc/[d](ck)#synonymous sites/[differences] in kth codon [pair]
sc(ck) = ∑ fi(ck)
th
ac/[d](ck)#non-synonymous sites/[differences] in k codon [pair] ac(ck) = 3-sc(ck) = 3-∑ fi(ck)
Sc/d
#synonymous sites/differences between s1 and s2
Ac/d
#non-synonymous sites/differences between s1 and s2
Sd + Ad#differences between the two sequences
ds/a
synonymous/non-synonymous distance
STEP 1: Count A and S sites
Compute sc1(ck) and sc2(ck) k = 1, …, r
Compute sd(ck) and ad(ck)
k = 1, …, r
Sc[1,2] = ∑k=1:r sc[1,2](ck)
Ac[1,2] = 3r - Sc[1,2]
Ŝc = ½(Sc1+ Sc2)
Âc = ½(Ac1+ Ac2)
Codon TTA codes for Leucine
6 synonyms for Leucine: TTA CTA CTG CTC CTT TTG
f1 : ATA(-),GTA(-),CTA(+) from 3 changes, so: 1/3
f2 : TAA(-),TGA(-),TCA(-) from 3 changes, so: 0/3
f3 : TTG(+),TTC(-),TTT(-) from 3 changes, so: 1/3
So:
sc(ck) = ∑ fi = 2/3
ac(ck) = 3 - sc(ck) = 7/3
STEP 2: Count A and S differences
Sd = ∑k=1:r sd(ck)
Ad = ∑k=1:r ad(ck)
n differences  n! pathways (n=0,1,2,3)
n=1: sd((GTT, GTA)) = 1 ad((GTT, GTA)) = 0
n=2: pathway 1: TTT (Phe) ↔ GTT (Val) ↔ GTA (Val)
pathway 2: TTT (Phe) ↔ TTA (Leu) ↔ GTA (Val)
sd((TTT, GTA)) = (1+1)/2=0.5
ad((TTT, GTA)) = (1+2)/2=1.5
STEP 3: Compute KA and KS
ds = Sd/Ŝc
da = Ad/Âc
Ks = -¾ ln(1-4/3 ds)
Ka = -¾ ln(1-4/3 da)
//Jukes-Cantor
Application
1. ORF finding
2. Nei-Gojobori to find high KA/KS ratios with sliding window plot. Smaller scale  more informative
than KA/KS analysis over gene (averages over positive and negative selection)
Chapter 7 Phylogenetic Trees
Show evolutionary interrel. among entities believed to have common ancestor (species).
 Node: taxonomic unit. Internal node: most recent common ancestor of the descendants; Referred to as
Hypothetical Taxonomic Units (HTUs) as they cannot be directly observed.
 Edge lengths: time estimates.
Phylogenetic analysis helps to answer (SARS Corona Virus):
 Kind of virus that caused original infection? palm civet
 Source (host) of the infection?
Tree
 When, where did virus cross species border When: d(pc,sample(t)) linear in t
Where: Tree, MDS [variation highest]
 Key mutations that enabled this switch?
 Trajectory of relationships/spread of virus? Tree
In case of e.g. horizontal gene transfer/recombination phylogenetic network more appropriate
Additive distances: distance over path from i → j is: d(i,j)
Three-point formula
Lx + Ly = dAB
Lx = (dAB+dAC-dBC)/2
Lx + Lz = dAC
Ly = (dAB+dBC-dAC)/2
Ly + Lz = dBC
Lz = (dAC+dBC-dAB)/2
Four-point formula
Four-point condition (when (1,2) and (i,j) are neighbors):
d(1,2) + d(i,j) < d(i,1) + d(2,j)
Minimize d(i,j) AND total distance in tree (∑all branch lengths)
M(i,j) = (n-2)d(i,j) – Ri – Rj with Ri = ∑j d(ti ,tj)
M(i,j) < M(i,k) for all k not equal to j
If i and j are neighbours
NJ (neighbor join) algorithm
Input:
nxn distance matrix D (JC-corrected) and an outgroup
Output:
rooted phylogenetic tree T
1. Compute new table M using D – select smallest value of M to select two taxa to join
2. Join the two taxa ti and tj to a new vertex V - use 3-point formula to calculate the updates distance
matrix D’ where ti and tj are replaced by V.
3. Compute branch lengths from tk to V using 3-point formula, T(V,1) = ti and T(V,2) = tj and TD(ti) =
L(ti,V) and TD(ti) = L(ti,V).
4. The distance matrix D’ now contains n – 1 taxa. If there are more than 2 taxa left go to step 1. If two taxa
are left join them by an branch of length d(ti,tj).
5. Define the root node as the branch connecting the outgroup to the rest of the tree. (Alternatively,
determine the so-called “mid-point” or info on relative rates of divergence)
UPGMA (Unweighted Pair Group Method)
Ultrametricity condition valid for real tee, but in practice generates erroneous trees (noise).
Can use D instead of M.
Chapter 8 Whole genome comparisons
Looks at the differences between the entire set of genes between two organisms
 Intracellular symbionts have become entirely dependent on the host to provide them with nutrients,
oxygen, specific proteins they previously had to synthesize themselves
 In the process they have lost many genes necessary to produce such products themselves
 As a result, intracellular obligate symbionts have the smallest genomes – both in total size as in number
of genes, which make them a perfect case study for whole genome comparisons
 High conservation of the order of the genes and virtually no horizontal gene transfer since:
o Few genes  change likely to be lethal
o Cloistered lifestyle of endo-symbiotic organisms shields them from viruses and other bacteria
that may induce gene rearrangement
-
Gene/ genomes duplication is basis for new functions as extra genes are free to evolve
Gene loss (redundancy, changed environment)
Genes (genome content): comparison of genome as comparison of individual genes
1. Find which genes are present in both: use ORF-finder with threshold of 100 codons
2. Fill out a matrix with alignment scores between each possible pair of sequences
3. Use Needleman-Wunsch or BLAST to compute similarity scores (normalize by length)
4. Identifying gene families
Only consider genes > 50% similar; ‘closely’ related and probably have similar ‘function’
o Clustering method for finding ‘similar’ genes: (hierarchical) clustering (NJ- or UPGMA algorithm)
 Draw-back: all clustering methods have some degree of arbitrariness
 Cluster both genomes simultaneously, then count #genes in each cluster (=gene family)
 Chlamydia: large # of small gene families, small # of large families: CT CP function
o Phylogenetic tree (orthologs appear as siblings / host & symbiont subtree in common)
 Ortholog genes are separated by a speciation event, so phylogenetic tree is useful metaphor
 Phylogenetic tree is better representation, but less amenable to automated analysis
Identifying orthologous and paralogous genes
 Result of evolution of homologs (dupl/del in one but not in other): m:n relationship
 Best Reciprocal similarity Hits (BRHs)
o possible: ORFs without BRH (lost from other species)
o possible: ORF with ortholog in other species and a paralog in the same species
o example: paralog more similar than ortholog  duplication after species split
Genes that cooperate tend to move close together
 Because of inversions, duplications, transpositions (translocations), deletions, chromosomal
rearrangements an alignment of entire genomes will not work
 multiple single-gene analysis (Genome = beanbag of genes + junk-DNA)
Similarity between pairs of genes informs about:
 The differences in their genomes tells us something about the function of their retained genes
 Evolution:
o Blocks of conserved gene order (orthologs: estimate #substitutions  #time)
o Changes in size of gene families
o Nucleotide substitutions between orthologous genes
-
Major mechanisms of reshuffling of synteny are inversions and transpositions
Noise on synteny is caused by insertions, duplications, and deletions
Chromosomes (gene position)
Visualising Synteny
Dot-plot: x: position on genome_1, y: position on genome_2, dot for homologous gene
Synteny allows for:
- Phylogenetic Footprinting (identification of homologous intergenetic regions)
Use syntenic coding regions as anchors to find (short) intergenic (=non-coding) regions (not selected
for → fast evolution) that are highly conserved (may be RNA-coding or regulatory) → use to
compute mutation rate
- Annotation of non-coding sequences
- Comparing lost and gained genes
Sorting by reversals
Minimum number of inversions to transform one genome into the other
A metric for the syntenic distance
Number of genomic rearrangements that separate the two species.
METRIC = smallest number of operations (=inversion or transposition) that transform one genome into the
other.
Given a permutation of N numbers find the shortest series of reversals that can sort the back into
their original order (can solve overlapping reversals)
1. designate one sequence as the standard s and the other as t
2. i=1, increase(i) until s(i) ≠ t(i) or i=length(t)
3. j=i; increase(j) until t(j) = s(i), reverse(t(i:j)
4. i=j+1; if i=length(t), stop, else goto 2
Chapter 9 Clustering gene expression profiles: time series of expression levels
fermentation yeast: diauxic shift (sugar supply exhausted) respiration
C6H12O6 + H2O  CO2 + C2H6
2C2H6 + 7O2  4CO2 + 6H2O
C2H6 originally perhaps as near-toxic protection new pathways formed & old shut off
Microarrays
 Gene as on-off switch; RNA and proteins as messengers between genes
 Purpose: snap-shot of the expression level of every gene in the cell
 Measure concentrations of mRNA and reverse-compute DNA belonging to this mRNA.
 As RNA can be spliced due to exons, the backward computed DNA is not entirely equal to the real
DNA: it is called cDNA: complementary DNA.
 cDNA hints to an expressed gene. cDNA is stored as an EST: Expressed Sequence Tag.
 EST sequencing can identify genes that are ‘missed’ with ab initio gene-finding methods
Microarray technologies (visualize hybridization of fluoriscent molecules inserted on DNA):
- cDNA arrays:
o No prior knowledge of gene sequence needed (capture all transcripts expressed in cell
(reverse transcriptase); sequence DNA in interesting spot later)
o Different fluorescent dyes (Cy3,5=g,r)  relative changes in expression because #transcripts
differs  #cDNA differs over sequences (normal/tumor)
o Cy3,5 differ in size & rate of decay  dye swap  average ratios
- Oligonucleotide arrays (one/two-dye)
o Know sequences apriori (synthesize oligonucleotide e.g. based on EST)
o Few probes per gene; pick oligonucleotides specific to individual genes
Paralogs:
longer probes more likely to complement similar sequences
Mismatch probes as correction for non-specific hybridization
o One-dye: average expression levels of “independent” (overlapping) probes for single gene
(more 3’  higher fluorescence)
-
-
Reference design (compare to t=0)
Explicitly test for effects of condition / mutation (cancerous/non-cancerous, effect of drugs on cellular
function) on gene expression (significance: replication)
Reconstruct the gene regulatory networks
Functional annotation
Fold change: relative change in activity f=valuenew/valueold; fold-change = f > 1 ? f : –1/f
Clustering:
o Distance measure:  (don’t take into account similarity in magnitude)
o C = ∑ij in same cluster dij - ∑ij in different cluster dij
o Hierarchical clustering: can try different k without re-computing distances
Neighbouring genes often expressed similarly
Data Visualisation
 In a tree using Hierarchic clustering
 In a plane using MDS
Genes with similar expression profiles have similar functions.
Pre-processing
 Select only genes with ‘enough’ fold-change abs(fold-change) > threshold
 Delete missing values
Chapter 10 Identification of regulatory sequences







Internal clock synchronizes functions (metabolism, activity/awareness level, body temp.)
Rather than moving, plants react to external stress by changing their internal condition
o Herbivore? → Chemical repellent! (e.g. nicotine)
o Falling temperature? → Anti-freeze proteins!
Plants that can ‘anticipate’ changes have a competitive advantage: photosynthesis
Plants have cell-autonomous circadian clock
Removing day-night stimulus (keep in constant light/dark): Mammals keep circadian clock running for
months
Eubacteria: rigid motifs at -10: TATAAT, at -35: TTGACA
Eukaryota: different RNA polymerase → different motifs;
o TATA-box (= TATAA[A/T]) at ~ -40
o Other docking sites at +/- -1000 up to - 250,000
Finding TFBS motifs is complex:
1. TFBS are very short and will therefore appear by chance alone
2. There is a high variability (ATAATC, ATAATT, ATACTC, …)
3. We don’t know the TFBS motif nor the TFBS location nor the length
Algorithmically finding motif by optimizing scoring function is computationally expensive
Heuristics: e.g. Gibbs sampling (randomized and greedy)
What is a significant result: compare the sequence with the background model: the chance based on the
current set that the motif occurs by pure chance
Where to look for a TFBS?
 Area’s on the gene with high conservation
 Co-regulated genes (have same TF): look for shared motifs +/-1000 upstream
Focus on ungapped fixed sequence motif (assume no variation) with fixed length
Identifying motifs
A motif is interesting if unlikely under the background distribution: column in PSSM is more unbalanced.
Input:
PSSM = scoring from multiple alignment
Scoring function for imbalance: Kullback–Leibler divergence (KL divergence) :
S KL    pi [k ] log qpii [[kk ]]
positioni letter k
pi[k] is probability of observing symbol k at position i
qi[k] is multinomial background model for symbol k at i
To avoid zero entries and resulting divergences (log 0), a statistical trick is to add pseudocounts: add 1 at
each entry
Finding high-scoring motifs given PSSM
Input:
Sequence s of length n > L
PSSM of length L
Output:
starting position (j with highest value)
most probable motif (argmax of L)
Slide the PSSM along the sequence and compute the (log) likelihood:
L( j ) 
j  n 1
 p [s[i]]
i
i j

Gibbs sampling to avoid local optima
EM-algorithm for finding high-scoring motifs
0. Start with random location j and random PSSM
Iteration:
1. With fixed j optimize PSSM
2. With fixed PSSM optimize j
Until the result has converged
Finding the motif-length
Scree-plot: log-likelihood score relative to the background model vs motif length L
Biological validation
 Compare motif with standard TFBS databases like cisRED
 Perform biological experiments to test the hypothesis: Harper et al. attached a fluorescent moleculecomplex to the TFBS and could thus with a scintillation counter
 Because not all genes are directly regulated by the first few TFs in the circadian regulatory cascade, the
presence or absence of EE enables to reveal the exact sequence of events that occur during circadian
control.
 With the EE we can now look for other locations on the DNA with same or similar motifs.
Case study the circadian rhythm
 Cluster the expression profiles and consider the clusters with appropriate periodicity: they are candidates
for containing the EE (evening element)
 Look in this cluster for shared motifs upstream up to -1000
 Consider all words of length 9 whose frequency in the evening cluster is very different from its
frequency in the rest of the data.
o Examine all words of fixed length 9 in both sequences seq1 and seq2 (considering also the reverse
complement).
o Motifs found are scored and sorted in descending order by margin (the difference between their
frequency in cluster 2 and that in cluster 1-3). The top 10 of 9-mers are computed and shown.
 Remove repeats (of 1- or 2-mers) from obtained set of motifs (no biological significance)
 Most significant EE element is AAAATATCT.
 We know from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases
found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3
top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC).
 To assess the significance of the value found for the margin of the evening element we perform 100
random splits of the data and measure the margin of the highest-scoring element. In 100 trials we never
observe a margin larger than 0.000147462.
 We can look in detail at the frequency of the evening element among all the clock regulated genes:
Circadian time: 0
4
8
12
16
20
Number of genes: 78
45
124 67
30
93
EE count:
5
6
49
27
8
8
The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening
element, nor this motif is limited only to these genes.
Allelle
Bifurcating tree
Base
Chromosome
cis-regulatory DNA
Cladistics
Codon
Consensus sequence
Degradase
Directionality
DNA
Epitope
Exons
Mutation, fixed
Genetic code
Genetic drift
Genetic pathway
Genome
Genomics
Hybridization
Homeobox
Homologs
Horiz. gene transfer
Intergenic regions
Intracellular
symbionts / obligate
endo-symbiont
Introns
Ligase
Linkage
Mutation
Gene variant: One of multiple possibilities for a nucleotide (due to polymorphism)
Internal nodes have degree 3, external nodes degree 1, root degree 2
Chemical element
Stretch of DNA
Promotor: element (required by RNA polymerase) regulating transcriptional
activity that is located on the same DNA molecule as the transcribed gene (transregulatory: molecules separate from the gene containing DNA molecule), -locking
Hierarchical classification of species (eg based on morph. data)
Nucleotide-triplet = 3-nucleotide unit (used by every organism)
Most probable sequence
Protein: cuts apart molecules no longer needed
Nucleic acids synthesized 5' to 3' direction, as polymerase used to assemble new
strands attaches new nucleotide to the 3' hydroxyl group via a phosphodiester
bond.
Single DNA & RNA strand sequences are written in 5' to 3' direction.
Relative positions of structures along strand of nucleic acid (incl. genes,
transcription factors, polymerases) noted as being upstream (towards the 5' end)
or downstream (towards the 3' end).
Deoxyribonucleic acid: long polymer of nucleotides encoding sequence of the
amino acid residues in proteins using genetic code: contains genetic instructions
specifying biological development of all cellular forms of life (and most viruses)
Part of macromolecule recognized by immune system (antibodies)
Part of gene that is transcribed and eventually specifies mRNA
Remaining mutation when all lineages carrying alternative mutations have died
off. Fixed mutations may never reach 100% frequency in the population, as further
mutations at the same site may arise (all sharing a common ancestor which had the
fixed mutation) ~ observed as differences between individuals
n:1 lookup table from codon to amino acid [~thereby DNA sequences to proteins]
(identical in nearly all organisms: standard GC)
Change in rel. freq. with which allele occurs in population that results from fact
that alleles in offspring are random sample of those in parents (most fixed
mutations are neutral).
Network of those genes that are connected by causal relations in their expressions
Complete genetic sequence on one set of chromosomes (one of the two sets that a
diploid individual carries in every somatic cell)
Study of organism's genome and the use of the genes
Chemical binding
DNA sequence found within genes that are involved in the regulation of
development (morphogenesis) of animals, fungi and plants
Different versions of the same gene (genes that have a common ancestor); induced
from sequence similarity
Translocation between two organisms
Non-protein coding regions of the genome
Symbionts which have moved permanently into the cells of the host
Transcribed, but not translated sequences in eukaryotic genes (spliced out before
travels to ribosome)
Protein: joins molecules together
Single linkage
dxy
= min i,j ||x[i] – y[j]||.
Average linkage
dxy
= mean i,j ||x[i] – y[j]||.
Centroid distance
dAB
= ||mA – mB||
A nucleotide at a certain location is replaced by another nucleotide
Nucleotides
Nucleotide mutation
Nucleotide subst.
Oligonucleotide
ORF
Orthologous genes
Paralogous genes
Parsimony
Phylogenetics
Polymorphism
Polyploid individual
Prions
Promotor
Proteomics
Pseudogenes
PSSM (=profile)
Reading frame
Retrovirus
Saturation
Sequence space
SNP (“snip”)
STR
Symbionts
Synteny
Synteny, blocks of
Taxa
Transcription
Transcription factor
Transcription Factor
Binding Site
Ultrametricity
Molecules distinct from each other in one base
Base change s.t. mutant and wild-type forms coexist in a population
Base change between two populations (nucleotide mutation only becomes
nucleotide substitution when most recent common ancestor of entire population
carried that mutation)
Short DNA sequence
Open reading frame: portion of organism's genome which contains a sequence of
bases that could potentially encode a protein
ATG [...]* (TGA)|(TAA)|(TAG) – end
Genes found in separate species deriving from same parental sequence
(homologous genes in different organisms)
Homologous genes in 1 organism/genome deriving from gene duplication and
subsequent specialization
less =better concept in arriving @hypothesis/course of action parcere
Study of evolutionary relatedness among various groups of organisms (e.g.
species, populations)
Multiple possibilities for a nucleotide: allelle
Duplicated genomes
Example of auto-replicating proteins
Region on the DNA just before (=upstream) the gene that indicates where the
transcription starts
Large-scale study of proteins (particularly structures and functions)
Vestiges of genes that once worked but wrecked by mutations
Position Specific Scoring Matrix; multinomial model of sequence depending on
position on sequence P[position,symbol] symbol={CTGA} or 20AA
Non-overlap. DNA decomp. into codons (3 possib./strand*2 strands)
Enveloped virus possessing RNA genome. Replicates via DNA intermediate. Rely
on reverse-transcriptase reverse transcribe its genome from RNA into DNA, which
can then be integrated into the host's genome with an integrase enzyme.
On average one substitution per site
Space of all sequences (up to a certain length)
Single Nucleotide Polymorphism point mutation: DNA fingerprinting
Short Tandem Repeats (microsatelites)
Organisms that live together in a beneficial relation
syn- = together, tenia = ribbon, band; the relative ordering of genes on the same
chromosomes
Long DNA stretches where rel. ordering of orth. genes is conserved
Units under comparison
Process through which DNA seq. is enzymatically copied by RNA polymerase
producing complementary RNA (Thymine → Uracil)
Protein that regulates transcription. TFs regulate binding of RNA polymerase and
initiation of transcription. A TF binds upstream or downstream to either enhance
or repress transcription of a gene by assisting or blocking RNA polymerase
binding
The location on the DNA molecule where a TF can physically attach; has specific
sequence of nucleotides for the TF to attach (motif) e.g. RNA polymerase BS; can
be multiple per gene
Distance from the root to all leafs of tree is equal
Download