What is Comparative Genomics? Insights gained through comparison of genomes from different species How did it all start? • We needed some genomes to start comparing • Many Bacteria sequenced first • Model organisms • Yeast • Worm • Fruit fly • Thale cress • Finally, Human • Comparative genomics did not just happen • Enough data had to be accumulated • Development of new computational methods to meet the challenges of processing large amounts of data • “Informatics” techniques from applied math, computer science and statistics were adapted for biological sequences Comparing sequenced genomes • Comparison of genomic sequences from different species can help identify the following: • Gene structure • Gene function • Interaction between gene products • Non-coding RNAs • Regulatory sequences Evolution and sequence conservation • Genome comparisons are based on simple premise: conservation = functional importance • If there are no constraints on DNA sequence, random mutations will occur • Over large evolutionary times (millions of years), these random mutations make two related sequences different • Sequences from different genomes will be conserved if: • They code for proteins • They are important for regulation (protein binding) No-hypothesis-driven approach • Hypothesis-driven approaches • Develop goals based on available hypothesis • Design initial experiments (and backups if those fail) • When it yields results, go to NIH, NSF, DOE, ONR for funding • No hypothesis-driven approaches • Start with a general knowledge of the biological system • Collect large amount of data (usually high-throughput methods) and try extracting and/or amplifying signal from noisy data • Sometimes it works for reasons that are obvious • Sometimes it works for reasons that are NOT obvious • Sometimes it doesn’t work because the data is too noisy • Funding agencies are not likely to fund this kind of research Finding DNA regulatory motifs (protein binding sites) • Experimental approaches • Promoter Trapping • DNA Footprinting • In-vitro binding site selection (SELEX) • Computational approaches • Searching databases of known sites • Finding over-represented motifs in a group of sequences (Gibbs sampling, Expectation Maximization) • In promoters of homologous genes • In promoters of functionally linked genes • In promoters of interacting proteins • Ab initio methods • Positional conservation of (pseudo)palindromic DNA motifs Finding motifs in promoters of homologous genes • Perform all-versus-all proteomes BLAST search • Pool together promoters of related genes • Find conserved motifs (Gibbs sampling, Expectation Maximization) • Only DNA motifs in related genes can be identified Finding DNA motifs by positional conservation of palindromes • The approach targets sites for dimeric proteins and is particularly suited for helix-turn-helix proteins of Bacteria and Archea • HTH proteins bind as dimers usually with variable sequence spacing • Binding sites are palindromic with poorly conserved middle GGATTnnnAATCC GGATTnnAATCC GGATTnnAAGCC • Starting from a complete set of promoter sequences, we find imperfect palindromes of variable length • Remove sequence bias (A/T or G/C content > 80%) • Search all-versus-all and identify similar motifs YES Many potential binding sites are found ... Sulfate metabolism Transposons GTP-binding ATPase RNA Pol K Ribosomal proteins Short hypothetical proteins • The role of found motifs is difficult to predict Finding DNA motifs - the summary • In promoters of homologous genes • Easy to perform and interpret results • Works only for proteins with sequence homology • In promoters of interacting proteins • General approach, works even in the absence of sequence homology • Needs better coverage of interactions; High-throughput studies of species other than yeast will enable comparative analysis • Ab initio methods • General approach, requires no prior knowledge • Complementary approaches (experimental or computational) are needed to link the found sites to their DNA-binding proteins Evolution and sequence conservation • Genome comparisons are based on simple premise: conservation = functional importance • If there are no constraints on DNA sequence, random mutations will occur • Over large evolutionary times (millions of years), these random mutations make two related sequences different • Sequences from different genomes will be conserved if: • They code for proteins • They are important for regulation (protein binding) • Comparative genomics is needed to identify conservation Comparative genomics helps genome annotations • In prokaryotes, finding genes is relatively easy based on open reading frames (ORFs) • In eukaryotes, we have to look for ORFs, exons, introns, splice sites, polyA sites • Bad news: Predicted exons sometimes do not exist • More bad news: Pseudogenes • Bad news keep coming: Alternative splicing • Good news: In different species, the genes normally have similar exon-intron structure Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Transcription and … 1 2 3 RNA 4 polymerase RNA polymerase Courtesy of R. Breaker, Yale U. Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Transcription and intramolecular RNA folding continue. 1 1 2 2 3 3 4 RNA UUUUU AUG polymerase Courtesy of R. Breaker, Yale U. Case 1: Cellular concentration of metabolite is too low to occupy the riboswitch binding site. Translation is and Transcription initiated. intramolecular RNA folding continue. 1 2 3 4 UUUUU AUG Typically the new mRNA codes for a biosynthetic or transport protein that raises the intracellular level of the metabolite. Gene regulation (next case) is accomplished by variations in the interactions of the regions highlighted in orange. Ribos Courtesy of R. Breaker, Yale U. Case 2: Cellular concentration of metabolite (X) is high. RNA polymerase produces the long untranslated leader region. Intramolecular folding can lead to an alternate conformation. X X X X Nascent RNA X X RNA polymerase DNA template The alternate riboswitch conformation is stable when metabolite is bound. Courtesy of R. Breaker, Yale U. Case 2: Cellular concentration of metabolite (X) is high. Transcription RNA polymerase continues. produces the long untranslated leader region. Intramolecular folding can lead to an alternate conformation. X X X X X 1 2 3 RNA 4 polymerase X UUUUU The alternate riboswitch conformation is stable when metabolite is bound. Courtesy of R. Breaker, Yale U. Case 2: Cellular concentration of metabolite (X) is high. Transcription continues. Now, RNA folding leads to formation of an intrinsic terminator. X X X X X X 1 2 3 1 2 3 4 X 4 UUUUU RNA polymerase Courtesy of R. Breaker, Yale U. Case 2: Cellular concentration of metabolite (X) is high. Transcription continues. Now, RNA folding leads to formation of an intrinsic terminator. X X X X X 1 2 3 X 4 UUUUU RNA polymerase The transcript is never completed and the metabolite biosynthetic or transport protein is not produced. Courtesy of R. Breaker, Yale U. What does this ncRNA bind? Can we predict functions without strict measure of significance (no sequence or structural similarity)? This is done by machine-trained (objective) jury-like system using inference Comparative genomics predicts protein interactions (Rosetta Stone) • In yeast, topoisomerase II has two domains that correspond to gyrases A and B • Sequence comparisons show that these two domains are individual proteins in E. coli • The implication is that these two proteins interact, and that their fusion was favored during the evolution Predicting protein function by genome context What does gene colinearity mean? Krr1/Rrp20 Rio1/Rio2 Spo11 Tif11 Not much, unless supported by phylogeny and function The case of Fibrillarin/Nop56 colinearity Fibrillarin and Nop56 DO interact Functional clues for hypothetical proteins based on genomic context analysis High-throughput approaches • Had to be developed quickly to match the speed of genome sequencing • As a general rule, most experimental approaches can be adapted for highthroughput – Protein interactions (two hybrid, TAP) – Protein localizations – Gene regulations (microarray) – Structure determination (more recent, still gaining speed) What is a high-throughput experiment? • Usually done at the level of whole organism (whole genome) under different conditions • HT experiments are aided by: – Equipment miniaturization – Robotics – Other automated procedures • In almost all instances, heavy data analysis and processing is required General properties of HT experiments • Collect large amounts of data under many different conditions – Err on the side of collecting too much data, disk storage is cheap • Process raw data (computers) • Analyze data (computers) • Integrate data from various sources (computers) • Identify patterns and cluster the results based on similarity (computers) Integrating heterogonous data to predict protein interactions Analysis of different data types is usually based on Bayesian inference Example protein interactions: ● Proteins more likely to interact if they are co-expressed ● Proteins more likely to interact if they are co-localized in cell ● Proteins more likely to interact if they are co-localized in genome ● Proteins more likely to interact if they are parts of the same cellular process Predicting large protein complexes from individual parts Beware of erroneous annotations