Chapter 1 Sequence statistics 4-letter nucleotide alphabet N = {A,C,G,T} Sequencing projects: s = s1s2…sn Identify main structures & predict biological function Multinomial sequence model Nucleotides are independent (correlations among neighbouring nucleotides and identically distributed (i.i.d): stationarity (symbol freq in various regions), n p = {pA,pC,pG,pT}, pA + pC + pG + pT = 1 P(s) p(s(i )) i 1 Markov sequence model Probability start state π State transition matrix T Models local correlations Order? model selection (hypothesis testing) n P(s) (s1 ) p(s(i 1), s(i )) Multinomial is special case i 1 • • • • • • • Base composition (amount, frequency) (CG and AT close but different) sliding window plot # essential genes from chromosomal mean recombination rate GC content (% or SW) • Detect foreign genetic material (horizontal gene transfer) • change point analysis • AT denatures (=splits) at lower temperatures • Thermophylic Archaeabacteriae: high CG • Evolution: Archaea > Eubacteriae > Eukaryotes Finding unusual/most frequent DNA words (dimers, trimers, k-mers) • Frequent words • Repetitive elements • Sequences with biological functions (e.g. gene regulatory features) • Rare motives • Binding sites for transcription factors • Undesirable structural properties (CTAG: kinking) • Internal immune system (restriction sites) word’s observed/expected ratio; Expected under multinomial model takes into account relative proportionality pA, pC, pG, pT Biological relevance of unusual motifs: • Mutational • Selective Pattern matching versus pattern discovery Genomics & Proteomics Proteomics more complicated than genomics. Genome rather constant entity. Proteome differs from cell to cell and constantly changing through biochemical interactions with genome and environment. Different protein expression in different parts of same organism’s body, in different stages of life cycle and in different environmental conditions. Protein Structure = protein function Primary: AA sequence, Secondary: Alpha-helix/Beta-sheet, Super-secondary Transcription proofreading mechanisms fewer and less effective than DNA replication. Transcription & DNA replication in 5' → 3' direction (old polymer read 3' → 5'; new, complementary fragments generated 5' → 3') Proteins with no genes coding them P1 + P2 P3 There exist genes that are always on. For regulation the produced enzymes are cut by other enzymes (inefficient) Chapter 2 Gene Finding Number of genes hard to estimate because of: • Pseudogenes • DNA segments that seemed parts of separate genes which are part of the same gene 64 possible triplets (for 20 AAs) Only one codon for methionine, so ATG (initiation codon) specifies start of reading frame. Stop (termination) codons: TGA, TAA, TAG (no associated amino acid) Mutations (due to biotic / abiotic factors): - Change one nucleotide to another (negligible effect on protein function): 1/0 ifferent amino acid - Indels (e.g. frame-shift mutation: wrong reading frame) Gene-finding methods: - Ab initio: based on statistical properties; basic ones suffice for prokaryotes, but insufficient for eukaryotic nucleic genomes (find candidate-genes) - Markov sequence model - Homology-based: comparing sequences (with known / unknown genes / genomes) Because prokaryotic genes do not contain introns, gene finding can be simplified into search for ORFs. ORF-finder (does not identify short genes) Given a DNA sequence s and a positive integer k For each possible reading frame (3 per strand, also for the reverse compliment of s 6) decompose the sequence into triplets find all stretches of triplets starting with START-codon and ending with STOP-codon Output all ORFS longer than or equal to the prefixed threshold k Finding k A pattern in DNA can arise from pure chance, so minimum ORF length to be called a gene. How unlikely does the ORF have to be (under background model) for us to accept it? Tail of ORF-length distribution in original DNA longer than in randomized DNA H0 = randomized DNA seq.: keep ORFs in original DNA longer than (most) ORFs in randomized DNA (,z) = (0.01,2.33), (0.05,1.65) α-significance represents false positive rate of one single test type I-error (FP = False Positive) of H0 type II-error (FN = False Negative) of H0 Computing a p-value for ORFs • Pstop = P(TAA) + P(TAG) + P(TGA) • P(k non-stop codons) = (1 - Pstop)k • For a significance-level α we need k* codons with: (1 - Pstop)k* = α • a priori probability non stop-codon = 61/64 • 95%-significance: (61/64)k ≈ 0.05 floor(k) ≈ 62 64 codons (add start & stop) Randomization tests • Generate a string with the same statistical properties of the original data • Permutation • Bootstrapping: sampling with replacement • Per nucleotide? per triplet? per … ? • p-value: rank of observed test statistic in null distribution: significant if its percentile < α Problems with multiple testing • If we conduct 100 tests 5 false positives are expected • Finding 5 significant genes out 100 with α = 0.05 does not mean anything biologically Chapter 3 Sequence alignment Genome annotation: - Find candidate genes - Assign protein function Uses of sequence alignment: - Prediction of function of similar genes (no experimental evidence) - Database searching for homologs (func. known or not; requires approx. methods) - Gene finding (too short for ab-initio / stronger evidence than ORF-finder) - Sequence divergence (variation within populations and between species) - Evolution of the sequences from a common ancestor. - Mismatches in the alignment correspond to mutations, - and gaps correspond to insertions or deletions. - Sequence assembly / can reconstruct missing parts since % same Protein domains most important part of molecule, so evolve slowly (few mutations allowed), so easier to align. Blast (pairwise heuristic local alignment) Inputs: length l threshold theta 1. Find all l-length words of database sequences that align with words from the query with alignment score > theta (hotspots = hits) 2. Extend each hit to find if it is contained within larger segment pair with score > theta Global alignment deprecated (closely-related sequences also identified by local alignment) Local alignment has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related Multiple sequence alignment is NP-Hard. Smith-waterman: O(mn) Statistical analysis of alignments Generate randomized sequences based on the string 2 Determine the optimal alignments of sequence 1 with these randomized sequences Compute a histogram and rank the observed score in this histogram The relative position defines the p-value Chapter 4 Hidden Markov Models Odorant Receptors (ORs) sense certain molecules outside the cell and signal inside the cell. ORs contain 7 transmembrane domains. Hidden states: in, out. Labda-phage states: AT rich, CT rich. HMM Applications DNA-segmentation VITERBI & FORWARD Dynamic Programming algorithm Given a sequence s of length n and an HMM with params (T,E): 1. Create table V of size |H|x(n+1); 2. Initialize i=0; V(0,0)=1; V(k,0)=0 for k>0; 3. For i=1:n, compute each entry using the recursive relation: VITERBI V(j,i) = E(j,s(i))*maxk {V(k,i-1)*T(k,j) } FORWARD F(j,i) = E(j,s(i))*∑k {F(k,i-1)*T(k,j) } pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) } 4. VITERBI OUTPUT: P(s,h*) = maxk {V(k,n)} FORWARD OUTPUT: P(s) = ∑k {F(k,n)} P(s) P(s, h j ) P(s | h j ) P(h j ) VITERBI h j Hn h j Hn 5. Trace-back: i=n:1, using: h*i-1 = pointer(i, h*i) h* arg maxn P(s, h) 6. OUTPUT: h*(n) = maxk {V(k,n)} hH The EM (Expectation Maximization) algorithm Given a sequence s and an HMM with unknown (T,E): 1. Initialize h, E and T; 2. Given s and h estimate E and T just by counting the symbols; 3. Given s, E and T estimate h e.g. with Viterbi-algorithm; 4. Repeat steps 2 and 3 until some criterion is met. HMMs: better in detecting genes than sequence alignment can detect introns and exons Downside: computationally much more demanding! Profile hidden Markov models (pHMM) Characterize sets of homologous genes and proteins based on common patterns in their sequence. pHMM allow to summarize the salient features of a protein alignment in one single model Also pHMM can be used to produce multiple alignments Alternative to: multiple alignments of all elements in the family Position Specific Scoring Matrices (PSSM) (Cannot handle variable lengths or gaps) Viterbi Exon-Intron Locator (VEIL) • Gene finder with a modular structure: • Uses a HMM which is made up of sub-HMMs each to describe a different bit of the sequence: upstream noncoding DNA, exon, intron, … • Assumes test data starts and ends with noncoding DNA and contains exactly one gene. • Uses biological knowledge to “hardwire” part of HMM, eg. start + stop codons, splice sites. Chapter 5 Variation within and between species (Jukes-Cantor & Kimura) α μ ρ N Substitution probability per site per second Mutation rate Substitution rate: rate at which a species fixes new mutations Population size Mutations originate in single individuals, accumulate on germ line (distancerelation) Mitoch. inherited only via maternal line not reshuffled suit for comparing evolution Polymorphism due to SNP & STR For neutral mutations ρ = K/(2T) = 2Nμ*1/(2N) = μ (diploid genome *2) Substitutions are independent (?), random & symmetric (prob(A→T) = prob(T→A)) (no!) Sequence evolution is Markov process: s(t) depends only on s(t-1) Back-mutations mutate a nucleotide back to an earlier value (K ≥ d) MJC (i,i)=1-α MJC (i,j)=α/3 M(t)=MJCt T λ1 =1 v1 = 1/4 (1 1 1 1) λ2..4 = 1-4α/3 v2 = 1/4 (-1 -1 1 1)T v3 = 1/4 (-1 -1 -1 1)T v4 = 1/4 (1 -1 1 -1)T t t T MJC =∑i λi vivi MJCt (i,i) = r(t) MJCt (i,j) = s(t) d = s(t) = ¼ - ¼ (1 - 4α/3)t 3 For small α : t ln 1 43 d d : Observed distance (proportio n of difference s) 4 K = t 43 ln 1 43 d Actual distance (proportio n) For small d ln(1+x) ≈ x: For saturation: d ↑ ¾ : K ≈ d actual distance ≈ observed distance K →∞ d random sequence-distance → K indeterminate K K 2 2 If K f ( d ) then K d K d d d 2 K So Var ( K ) Var ( d ) d Generation of sequence of length n with substituti on rate d is binomial process : 2 n Prob( k ) d k (1 d ) n k k and therefore with variance : Var( d ) d( 1-d)/n Because of the Jukes - Cantor formula : K 1 d 1 43 d Var ( K ) d (1 d ) n(1 43 d ) 2 α Transition probability (G↔A (purines) and T↔C (pyramidine)) per site per second β Transversion probability (G↔T, G↔C, A↔T, and A↔C) per site per second P(t) Fraction of transitions per site after t generations Q(t) Fraction of transversions per site after t generations K ≈ - ½ ln(1-2P-Q) – ¼ ln(1 – 2Q) d=P+Q Validation: generate K* mutations, Count d, estimate K(d) with JC, Plot K(d) – K* Adaptations: transition/transversion types, Amino-acid substitutions matrices Chapter 6 Natural selection at the molecular basis (selection pressure) HIV recognizes helper T-cells of the human immune system. Infected T-cells have viral proteins sticking out that can be recognized by the immune system (evolutionary arms race). Fast evolution (can be visualized with phylogenetic tree) because of: Short reproduction span: 1.5 days to reproduce RNA High error rate How to measure whether mutations are neutral, deleterious, or advantageous? Experimentally difficult: short-lived organisms & large populations (typically virus) Alternative: count number of mutations that can change the protein and those that don’t Synonymous mutation Non-synonymous mutations KA KS f0 v α new codon translates for same amino-acid new codon translates for different amino-acid #non-synonymous substitutions per non-synonymous site #synonymous substitutions per synonymous site fraction of non-synonymous mutations that are neutral. mutation rate fraction of non-synonymous mutations that are advantageous A priori many more non-synonymous mutations possible than synonymous Reasoning: Advantageous mutations are very rare (most non-neutral mutations detrimental) Deleterious mutations will ‘not’ spread through a population Therefore, most mutations are neutral Strong negative selection → Few non-synonymous substitutions KA / KS tells us about the strength and form of the natural selection # non-synonymous mutations after time t # synonymous mutations after time t K A = v f0 t KS = v t KA / KS = f0 f0 < 1 strong negative selection f0 > 1 evidence 4 advantageous non-syn, mutations After time t (averaged over the gene!): all synonymous mutations neutral negative selection dominates positive selection dominates KA = v(f0 + α)t KA / KS = f0 + α Nei-Gojobori Algorithm Linear in the size of the sequences Assumptions: - Assume that rate of transitions and transversions is the same - There is no bias towards codon usage (i.e. no information on the ensuing protein) s1,s2 aligned, homologous sequences without gaps (excluding stop-codon) r (length of s1) = (length of s2) in codons ck kth codon pair e.g. (TTA, TTT) fi(c) fraction of changes at ith position of codon c resulting in synonymous change (i=1,2,3) sc/[d](ck)#synonymous sites/[differences] in kth codon [pair] sc(ck) = ∑ fi(ck) th ac/[d](ck)#non-synonymous sites/[differences] in k codon [pair] ac(ck) = 3-sc(ck) = 3-∑ fi(ck) Sc/d #synonymous sites/differences between s1 and s2 Ac/d #non-synonymous sites/differences between s1 and s2 Sd + Ad#differences between the two sequences ds/a synonymous/non-synonymous distance STEP 1: Count A and S sites Compute sc1(ck) and sc2(ck) k = 1, …, r Compute sd(ck) and ad(ck) k = 1, …, r Sc[1,2] = ∑k=1:r sc[1,2](ck) Ac[1,2] = 3r - Sc[1,2] Ŝc = ½(Sc1+ Sc2) Âc = ½(Ac1+ Ac2) Codon TTA codes for Leucine 6 synonyms for Leucine: TTA CTA CTG CTC CTT TTG f1 : ATA(-),GTA(-),CTA(+) from 3 changes, so: 1/3 f2 : TAA(-),TGA(-),TCA(-) from 3 changes, so: 0/3 f3 : TTG(+),TTC(-),TTT(-) from 3 changes, so: 1/3 So: sc(ck) = ∑ fi = 2/3 ac(ck) = 3 - sc(ck) = 7/3 STEP 2: Count A and S differences Sd = ∑k=1:r sd(ck) Ad = ∑k=1:r ad(ck) n differences n! pathways (n=0,1,2,3) n=1: sd((GTT, GTA)) = 1 ad((GTT, GTA)) = 0 n=2: pathway 1: TTT (Phe) ↔ GTT (Val) ↔ GTA (Val) pathway 2: TTT (Phe) ↔ TTA (Leu) ↔ GTA (Val) sd((TTT, GTA)) = (1+1)/2=0.5 ad((TTT, GTA)) = (1+2)/2=1.5 STEP 3: Compute KA and KS ds = Sd/Ŝc da = Ad/Âc Ks = -¾ ln(1-4/3 ds) Ka = -¾ ln(1-4/3 da) //Jukes-Cantor Application 1. ORF finding 2. Nei-Gojobori to find high KA/KS ratios with sliding window plot. Smaller scale more informative than KA/KS analysis over gene (averages over positive and negative selection) Chapter 7 Phylogenetic Trees Show evolutionary interrel. among entities believed to have common ancestor (species). Node: taxonomic unit. Internal node: most recent common ancestor of the descendants; Referred to as Hypothetical Taxonomic Units (HTUs) as they cannot be directly observed. Edge lengths: time estimates. Phylogenetic analysis helps to answer (SARS Corona Virus): Kind of virus that caused original infection? palm civet Source (host) of the infection? Tree When, where did virus cross species border When: d(pc,sample(t)) linear in t Where: Tree, MDS [variation highest] Key mutations that enabled this switch? Trajectory of relationships/spread of virus? Tree In case of e.g. horizontal gene transfer/recombination phylogenetic network more appropriate Additive distances: distance over path from i → j is: d(i,j) Three-point formula Lx + Ly = dAB Lx = (dAB+dAC-dBC)/2 Lx + Lz = dAC Ly = (dAB+dBC-dAC)/2 Ly + Lz = dBC Lz = (dAC+dBC-dAB)/2 Four-point formula Four-point condition (when (1,2) and (i,j) are neighbors): d(1,2) + d(i,j) < d(i,1) + d(2,j) Minimize d(i,j) AND total distance in tree (∑all branch lengths) M(i,j) = (n-2)d(i,j) – Ri – Rj with Ri = ∑j d(ti ,tj) M(i,j) < M(i,k) for all k not equal to j If i and j are neighbours NJ (neighbor join) algorithm Input: nxn distance matrix D (JC-corrected) and an outgroup Output: rooted phylogenetic tree T 1. Compute new table M using D – select smallest value of M to select two taxa to join 2. Join the two taxa ti and tj to a new vertex V - use 3-point formula to calculate the updates distance matrix D’ where ti and tj are replaced by V. 3. Compute branch lengths from tk to V using 3-point formula, T(V,1) = ti and T(V,2) = tj and TD(ti) = L(ti,V) and TD(ti) = L(ti,V). 4. The distance matrix D’ now contains n – 1 taxa. If there are more than 2 taxa left go to step 1. If two taxa are left join them by an branch of length d(ti,tj). 5. Define the root node as the branch connecting the outgroup to the rest of the tree. (Alternatively, determine the so-called “mid-point” or info on relative rates of divergence) UPGMA (Unweighted Pair Group Method) Ultrametricity condition valid for real tee, but in practice generates erroneous trees (noise). Can use D instead of M. Chapter 8 Whole genome comparisons Looks at the differences between the entire set of genes between two organisms Intracellular symbionts have become entirely dependent on the host to provide them with nutrients, oxygen, specific proteins they previously had to synthesize themselves In the process they have lost many genes necessary to produce such products themselves As a result, intracellular obligate symbionts have the smallest genomes – both in total size as in number of genes, which make them a perfect case study for whole genome comparisons High conservation of the order of the genes and virtually no horizontal gene transfer since: o Few genes change likely to be lethal o Cloistered lifestyle of endo-symbiotic organisms shields them from viruses and other bacteria that may induce gene rearrangement - Gene/ genomes duplication is basis for new functions as extra genes are free to evolve Gene loss (redundancy, changed environment) Genes (genome content): comparison of genome as comparison of individual genes 1. Find which genes are present in both: use ORF-finder with threshold of 100 codons 2. Fill out a matrix with alignment scores between each possible pair of sequences 3. Use Needleman-Wunsch or BLAST to compute similarity scores (normalize by length) 4. Identifying gene families Only consider genes > 50% similar; ‘closely’ related and probably have similar ‘function’ o Clustering method for finding ‘similar’ genes: (hierarchical) clustering (NJ- or UPGMA algorithm) Draw-back: all clustering methods have some degree of arbitrariness Cluster both genomes simultaneously, then count #genes in each cluster (=gene family) Chlamydia: large # of small gene families, small # of large families: CT CP function o Phylogenetic tree (orthologs appear as siblings / host & symbiont subtree in common) Ortholog genes are separated by a speciation event, so phylogenetic tree is useful metaphor Phylogenetic tree is better representation, but less amenable to automated analysis Identifying orthologous and paralogous genes Result of evolution of homologs (dupl/del in one but not in other): m:n relationship Best Reciprocal similarity Hits (BRHs) o possible: ORFs without BRH (lost from other species) o possible: ORF with ortholog in other species and a paralog in the same species o example: paralog more similar than ortholog duplication after species split Genes that cooperate tend to move close together Because of inversions, duplications, transpositions (translocations), deletions, chromosomal rearrangements an alignment of entire genomes will not work multiple single-gene analysis (Genome = beanbag of genes + junk-DNA) Similarity between pairs of genes informs about: The differences in their genomes tells us something about the function of their retained genes Evolution: o Blocks of conserved gene order (orthologs: estimate #substitutions #time) o Changes in size of gene families o Nucleotide substitutions between orthologous genes - Major mechanisms of reshuffling of synteny are inversions and transpositions Noise on synteny is caused by insertions, duplications, and deletions Chromosomes (gene position) Visualising Synteny Dot-plot: x: position on genome_1, y: position on genome_2, dot for homologous gene Synteny allows for: - Phylogenetic Footprinting (identification of homologous intergenetic regions) Use syntenic coding regions as anchors to find (short) intergenic (=non-coding) regions (not selected for → fast evolution) that are highly conserved (may be RNA-coding or regulatory) → use to compute mutation rate - Annotation of non-coding sequences - Comparing lost and gained genes Sorting by reversals Minimum number of inversions to transform one genome into the other A metric for the syntenic distance Number of genomic rearrangements that separate the two species. METRIC = smallest number of operations (=inversion or transposition) that transform one genome into the other. Given a permutation of N numbers find the shortest series of reversals that can sort the back into their original order (can solve overlapping reversals) 1. designate one sequence as the standard s and the other as t 2. i=1, increase(i) until s(i) ≠ t(i) or i=length(t) 3. j=i; increase(j) until t(j) = s(i), reverse(t(i:j) 4. i=j+1; if i=length(t), stop, else goto 2 Chapter 9 Clustering gene expression profiles: time series of expression levels fermentation yeast: diauxic shift (sugar supply exhausted) respiration C6H12O6 + H2O CO2 + C2H6 2C2H6 + 7O2 4CO2 + 6H2O C2H6 originally perhaps as near-toxic protection new pathways formed & old shut off Microarrays Gene as on-off switch; RNA and proteins as messengers between genes Purpose: snap-shot of the expression level of every gene in the cell Measure concentrations of mRNA and reverse-compute DNA belonging to this mRNA. As RNA can be spliced due to exons, the backward computed DNA is not entirely equal to the real DNA: it is called cDNA: complementary DNA. cDNA hints to an expressed gene. cDNA is stored as an EST: Expressed Sequence Tag. EST sequencing can identify genes that are ‘missed’ with ab initio gene-finding methods Microarray technologies (visualize hybridization of fluoriscent molecules inserted on DNA): - cDNA arrays: o No prior knowledge of gene sequence needed (capture all transcripts expressed in cell (reverse transcriptase); sequence DNA in interesting spot later) o Different fluorescent dyes (Cy3,5=g,r) relative changes in expression because #transcripts differs #cDNA differs over sequences (normal/tumor) o Cy3,5 differ in size & rate of decay dye swap average ratios - Oligonucleotide arrays (one/two-dye) o Know sequences apriori (synthesize oligonucleotide e.g. based on EST) o Few probes per gene; pick oligonucleotides specific to individual genes Paralogs: longer probes more likely to complement similar sequences Mismatch probes as correction for non-specific hybridization o One-dye: average expression levels of “independent” (overlapping) probes for single gene (more 3’ higher fluorescence) - - Reference design (compare to t=0) Explicitly test for effects of condition / mutation (cancerous/non-cancerous, effect of drugs on cellular function) on gene expression (significance: replication) Reconstruct the gene regulatory networks Functional annotation Fold change: relative change in activity f=valuenew/valueold; fold-change = f > 1 ? f : –1/f Clustering: o Distance measure: (don’t take into account similarity in magnitude) o C = ∑ij in same cluster dij - ∑ij in different cluster dij o Hierarchical clustering: can try different k without re-computing distances Neighbouring genes often expressed similarly Data Visualisation In a tree using Hierarchic clustering In a plane using MDS Genes with similar expression profiles have similar functions. Pre-processing Select only genes with ‘enough’ fold-change abs(fold-change) > threshold Delete missing values Chapter 10 Identification of regulatory sequences Internal clock synchronizes functions (metabolism, activity/awareness level, body temp.) Rather than moving, plants react to external stress by changing their internal condition o Herbivore? → Chemical repellent! (e.g. nicotine) o Falling temperature? → Anti-freeze proteins! Plants that can ‘anticipate’ changes have a competitive advantage: photosynthesis Plants have cell-autonomous circadian clock Removing day-night stimulus (keep in constant light/dark): Mammals keep circadian clock running for months Eubacteria: rigid motifs at -10: TATAAT, at -35: TTGACA Eukaryota: different RNA polymerase → different motifs; o TATA-box (= TATAA[A/T]) at ~ -40 o Other docking sites at +/- -1000 up to - 250,000 Finding TFBS motifs is complex: 1. TFBS are very short and will therefore appear by chance alone 2. There is a high variability (ATAATC, ATAATT, ATACTC, …) 3. We don’t know the TFBS motif nor the TFBS location nor the length Algorithmically finding motif by optimizing scoring function is computationally expensive Heuristics: e.g. Gibbs sampling (randomized and greedy) What is a significant result: compare the sequence with the background model: the chance based on the current set that the motif occurs by pure chance Where to look for a TFBS? Area’s on the gene with high conservation Co-regulated genes (have same TF): look for shared motifs +/-1000 upstream Focus on ungapped fixed sequence motif (assume no variation) with fixed length Identifying motifs A motif is interesting if unlikely under the background distribution: column in PSSM is more unbalanced. Input: PSSM = scoring from multiple alignment Scoring function for imbalance: Kullback–Leibler divergence (KL divergence) : S KL pi [k ] log qpii [[kk ]] positioni letter k pi[k] is probability of observing symbol k at position i qi[k] is multinomial background model for symbol k at i To avoid zero entries and resulting divergences (log 0), a statistical trick is to add pseudocounts: add 1 at each entry Finding high-scoring motifs given PSSM Input: Sequence s of length n > L PSSM of length L Output: starting position (j with highest value) most probable motif (argmax of L) Slide the PSSM along the sequence and compute the (log) likelihood: L( j ) j n 1 p [s[i]] i i j Gibbs sampling to avoid local optima EM-algorithm for finding high-scoring motifs 0. Start with random location j and random PSSM Iteration: 1. With fixed j optimize PSSM 2. With fixed PSSM optimize j Until the result has converged Finding the motif-length Scree-plot: log-likelihood score relative to the background model vs motif length L Biological validation Compare motif with standard TFBS databases like cisRED Perform biological experiments to test the hypothesis: Harper et al. attached a fluorescent moleculecomplex to the TFBS and could thus with a scintillation counter Because not all genes are directly regulated by the first few TFs in the circadian regulatory cascade, the presence or absence of EE enables to reveal the exact sequence of events that occur during circadian control. With the EE we can now look for other locations on the DNA with same or similar motifs. Case study the circadian rhythm Cluster the expression profiles and consider the clusters with appropriate periodicity: they are candidates for containing the EE (evening element) Look in this cluster for shared motifs upstream up to -1000 Consider all words of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data. o Examine all words of fixed length 9 in both sequences seq1 and seq2 (considering also the reverse complement). o Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 of 9-mers are computed and shown. Remove repeats (of 1- or 2-mers) from obtained set of motifs (no biological significance) Most significant EE element is AAAATATCT. We know from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC). To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element. In 100 trials we never observe a margin larger than 0.000147462. We can look in detail at the frequency of the evening element among all the clock regulated genes: Circadian time: 0 4 8 12 16 20 Number of genes: 78 45 124 67 30 93 EE count: 5 6 49 27 8 8 The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes. Allelle Bifurcating tree Base Chromosome cis-regulatory DNA Cladistics Codon Consensus sequence Degradase Directionality DNA Epitope Exons Mutation, fixed Genetic code Genetic drift Genetic pathway Genome Genomics Hybridization Homeobox Homologs Horiz. gene transfer Intergenic regions Intracellular symbionts / obligate endo-symbiont Introns Ligase Linkage Mutation Gene variant: One of multiple possibilities for a nucleotide (due to polymorphism) Internal nodes have degree 3, external nodes degree 1, root degree 2 Chemical element Stretch of DNA Promotor: element (required by RNA polymerase) regulating transcriptional activity that is located on the same DNA molecule as the transcribed gene (transregulatory: molecules separate from the gene containing DNA molecule), -locking Hierarchical classification of species (eg based on morph. data) Nucleotide-triplet = 3-nucleotide unit (used by every organism) Most probable sequence Protein: cuts apart molecules no longer needed Nucleic acids synthesized 5' to 3' direction, as polymerase used to assemble new strands attaches new nucleotide to the 3' hydroxyl group via a phosphodiester bond. Single DNA & RNA strand sequences are written in 5' to 3' direction. Relative positions of structures along strand of nucleic acid (incl. genes, transcription factors, polymerases) noted as being upstream (towards the 5' end) or downstream (towards the 3' end). Deoxyribonucleic acid: long polymer of nucleotides encoding sequence of the amino acid residues in proteins using genetic code: contains genetic instructions specifying biological development of all cellular forms of life (and most viruses) Part of macromolecule recognized by immune system (antibodies) Part of gene that is transcribed and eventually specifies mRNA Remaining mutation when all lineages carrying alternative mutations have died off. Fixed mutations may never reach 100% frequency in the population, as further mutations at the same site may arise (all sharing a common ancestor which had the fixed mutation) ~ observed as differences between individuals n:1 lookup table from codon to amino acid [~thereby DNA sequences to proteins] (identical in nearly all organisms: standard GC) Change in rel. freq. with which allele occurs in population that results from fact that alleles in offspring are random sample of those in parents (most fixed mutations are neutral). Network of those genes that are connected by causal relations in their expressions Complete genetic sequence on one set of chromosomes (one of the two sets that a diploid individual carries in every somatic cell) Study of organism's genome and the use of the genes Chemical binding DNA sequence found within genes that are involved in the regulation of development (morphogenesis) of animals, fungi and plants Different versions of the same gene (genes that have a common ancestor); induced from sequence similarity Translocation between two organisms Non-protein coding regions of the genome Symbionts which have moved permanently into the cells of the host Transcribed, but not translated sequences in eukaryotic genes (spliced out before travels to ribosome) Protein: joins molecules together Single linkage dxy = min i,j ||x[i] – y[j]||. Average linkage dxy = mean i,j ||x[i] – y[j]||. Centroid distance dAB = ||mA – mB|| A nucleotide at a certain location is replaced by another nucleotide Nucleotides Nucleotide mutation Nucleotide subst. Oligonucleotide ORF Orthologous genes Paralogous genes Parsimony Phylogenetics Polymorphism Polyploid individual Prions Promotor Proteomics Pseudogenes PSSM (=profile) Reading frame Retrovirus Saturation Sequence space SNP (“snip”) STR Symbionts Synteny Synteny, blocks of Taxa Transcription Transcription factor Transcription Factor Binding Site Ultrametricity Molecules distinct from each other in one base Base change s.t. mutant and wild-type forms coexist in a population Base change between two populations (nucleotide mutation only becomes nucleotide substitution when most recent common ancestor of entire population carried that mutation) Short DNA sequence Open reading frame: portion of organism's genome which contains a sequence of bases that could potentially encode a protein ATG [...]* (TGA)|(TAA)|(TAG) – end Genes found in separate species deriving from same parental sequence (homologous genes in different organisms) Homologous genes in 1 organism/genome deriving from gene duplication and subsequent specialization less =better concept in arriving @hypothesis/course of action parcere Study of evolutionary relatedness among various groups of organisms (e.g. species, populations) Multiple possibilities for a nucleotide: allelle Duplicated genomes Example of auto-replicating proteins Region on the DNA just before (=upstream) the gene that indicates where the transcription starts Large-scale study of proteins (particularly structures and functions) Vestiges of genes that once worked but wrecked by mutations Position Specific Scoring Matrix; multinomial model of sequence depending on position on sequence P[position,symbol] symbol={CTGA} or 20AA Non-overlap. DNA decomp. into codons (3 possib./strand*2 strands) Enveloped virus possessing RNA genome. Replicates via DNA intermediate. Rely on reverse-transcriptase reverse transcribe its genome from RNA into DNA, which can then be integrated into the host's genome with an integrase enzyme. On average one substitution per site Space of all sequences (up to a certain length) Single Nucleotide Polymorphism point mutation: DNA fingerprinting Short Tandem Repeats (microsatelites) Organisms that live together in a beneficial relation syn- = together, tenia = ribbon, band; the relative ordering of genes on the same chromosomes Long DNA stretches where rel. ordering of orth. genes is conserved Units under comparison Process through which DNA seq. is enzymatically copied by RNA polymerase producing complementary RNA (Thymine → Uracil) Protein that regulates transcription. TFs regulate binding of RNA polymerase and initiation of transcription. A TF binds upstream or downstream to either enhance or repress transcription of a gene by assisting or blocking RNA polymerase binding The location on the DNA molecule where a TF can physically attach; has specific sequence of nucleotides for the TF to attach (motif) e.g. RNA polymerase BS; can be multiple per gene Distance from the root to all leafs of tree is equal