Genome organization & its genetic implications Lander , ES (2011) Initial impact of the sequencing of the human genome. Nature 470:187 Feuillet, C, JE Leach, J Rogers, PS Schnable, K Eersole (2011) Crop genome sequencing: lessons and rationales. Trendt Plant Sci 16:77 DNA sequencing technologies Read length Speed Cost / human genome First gen (Sanger) 800 bases Next gen (454/Illumina/APG) 30-300 bases 0.1Gb/day $70, 000,000 1-5 Gb/day $75,000-$250,000 Metzker, M (2010) Sequencing technologies – the next generation. Nature Rev Genet 11:31 What are the challenges for the correct assembly of genome sequence information? • Genome size Eukaryotic genomes ~ 109 – 1010 bp • Genome composition Eukaryotic genomes ~ 50 % repetitive DNA Genome size – the C-value paradox genome size in basepairs Genome Size – the C value paradox: The amount of DNA in the haploid cell of an organism is not related to its evolutionary complexity or number of genes Genome composition • Complexity = length in nucleotides of longest nonrepeating sequence that can be formed by splicing together all unique sequence in a sample • Eukaryotic genomes contain different classes of DNA based on sequence complexity: highly repetitive middle repetitive unique Genome composition – DNA reassociation kinetics complexity in [moles of nucleotide / liter] x sec Genome composition - DNA re-association kinetics for a complex eukaryotic genome highly repetitive sequences middle repetitive sequences single copy sequences [moles of nucleotide / liter] x sec From genome composition to genome organization How are unique, middle repetitive and highly repetitive sequences organized in the genome? Genome organization E. coli S. cerevisiae H. sapiens gene desert = Gene gene island Z. mays = Repeat Genetic complexity • Eukaryotic genomes contain ~ 20,000 – 30,000 genes • 30% of protein coding genes are members of gene families duplication & divergence of sequence & gene function Gene complexity • What does a gene look like from a sequence or transcript perspective? no “typical gene” • Introns and exons introns can be numerous and long, i.e. some genes are more intron than exon! alternative splicing variants are common • Not all genes encode proteins non-coding structural RNAs (e.g. rRNA, tRNA, snRNA, snoRNA) non-coding regulatory RNAs (e.g. miRNA, lncRNA) Implications of gene and genetic complexity • Forward genetics: Have mutant – want gene • Via map-based cloning: Map your mutation Look at the genome sequence in the map interval to identify candidate genes • Candidate gene identification may not be trivial, even with good genome annotation! Especially an issue for plant genome sequences – only arabidopsis and rice are considered “finished” quality • Note further genetic tests required, even if the perfect candidate is identified. Gene identification - open reading frames 5'atgcccaagctgaatagcgtagaggggttttcatcatga frame 1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca taa M P K L N S V E G F S S * frame 2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat tgg C P S * I A * R G F H H How to tell real orfs from random chance orfs? • • • • Gene identification - short orfs can be translated! • e.g. the drosophila tarsal-less gene Galindo et al. PLoS Biol 5(5): e106 doi:10.1371/journal.pbio.0050106 Gene identification – database searching e.g. http://blast.ncbi.nlm.nih.gov/Blast.cgi Gene identification – shared synteny Preserved localization of genes on chromosomes of different species e.g. mouse chromosome 11 and parts of 5 different human chromosomes Perfect correspondence in order, orientation and spacing of 23 putative genes, and 245 conserved sequence blocks in noncoding regions Caution! Even regions of high synteny may not show perfect gene-for-gene correspondence from Gibson & Muse (2002) A Primer of Genome Science,Sinauer Inc. Gene identification – shared synteny Preserved localization of genes on chromosomes of different species e.g. maize – sorghum (G) rice (H) Schnable et al. Science 326:1112 Gene identification – promoter elements • TATA – box elements 5'-TATAAA-3' or variant plant and animal promoters • CpG islands Regions of higher than expected CpG dinucleotide content, un-methlylated in active promoters ~ 40% of mammalian promoters ~ 70% of human promoters but NOT in plant promoter regions • Y patch (pyrimidine-rich patch) plant not mammalian promoters Gene identification – introns & exons • Long gene space more intron than exon • Extreme example - human clotting factor VIII gene Gene identification – alternative splicing variants Pistoni et al. RNA Biol 7:441 Gene identification – trans-splicing Gingeras, Nature 461: 206 Gene identification – non-coding RNAs • non-coding structural RNAs rRNA & tRNA – transcription & translation snoRNA – small nucleolar RNAs guide chemical modification of rRNAs & tRNAs snRNA – small nuclear RNAs guide splicing reactions • non-coding regulatory RNAs miRNA & siRNA - small interfering RNAs RNAi pathway lncRNA - long noncoding RNAs Origins of long non-coding RNAs Overlapping transcriptional architecture • e.g. the human phosphatidylserine decarboxylase (PISD) gene Kapranov, Nature Rev Genet 8:413 Functions of lncRNAs Wilusz et al. Genes Dev. 23: 1494–1504 Genome - Transcriptome - Proteome • Genome Full complement of an organism’s hereditary information • Transcriptome Full set of RNA molecules, coding and non-coding, transcribed from the genome • Proteome Full set of proteins expressed from a genome • Not a 1:1:1 correspondence Implications of gene and genetic complexity • What is the take-home message for forward genetics? Implications of gene and genetic complexity • Reverse genetics: Have gene – want phenotype Predict phenotypes based on gene function in other organisms Knock out or knock down your gene of interest & look for corresponding changes in phenotype Gene families • Gene duplication followed by: Duplication of gene function Divergence of gene function Loss of gene function leading to a pseudogene • e.g. human globin gene family Gene families • Gene duplication followed by: Duplication of gene function Divergence of gene function Loss of gene function leading to a pseudogene • e.g. human beta-globin gene cluster chromosome 11 Five functional genes and two pseudogenes Gene families – paralogs & orthologs • Homologs Protein or DNA sequences having shared ancestry • Orthologs Homologs created by a speciation event May or may not retain the same function! • Paralogs Homologs created by a gene duplication event May or may not retain the same function! • It is not always easy or possible to distinguish orthologs from paralogs when comparing genes or proteins between species Gene families – paralogs & orthologs globin gene paralogs Gene families – paralogs & orthologs orthologs paralogs orthologs orthologs Storz et al. IUBMB Life 63:313 Implications of gene and genetic complexity • What are the implications of gene families for forward genetics (i.e. looking for candidate genes that condition a mutant phenotype?) •What are the implications of gene families for reverse genetics (i.e. altering gene function and looking for a phenotype)? Genome organization – repeated sequences ~ 50% of the genome • Segmental duplications and copy number variation • Tandemly repeated genes rRNA, tRNA and histone gene products needed in large amounts • Duplicated gene families • Transposons • Tandem simple sequence repeats centromeric & telomeric repeats minisatellites microsatellites Repeated sequences – segmental duplications & copy number variants • Segmental duplications > 1 kb block of duplicated sequence with > 90% sequence identity recombine to mediate further copy number variants Koszul & Fischer, C.R. Biologies 332:254 Repeated sequences – segmental duplications & copy number variants Repeated sequences – segmental duplications & copy number variants • Copy number variant (CNV) Deviation from diploid copy number at a locus • Copy number polymorphism (CNP) CNV present in >1% of a population • Recent association with human developmental syndromes Girirajan et al. Annu Rev Genet 45:203 Transposon-derived repeated sequences • ~ 45% of human & 85% of maize genome Transposon-derived repeated sequences • Many are truncated & inactive • Considered to be important in the evolution of genome organization & function Gogvadze & Buzdin Cell Mol Life Sci 66:3727 Repeated sequences – short tandem repeats • Centromeric Long array (~100,000 bp) of short tandem repeats ~ 5bp drosophila, ~150 bp maize, ~170 bp human not conserved across species in some cases not even conserved in all chromosomes of the same species Association with a centromere-specific histone H3 • Telomeric Length varies between species ~ 300 base pairs - 150 kilobasepairs Conserved, G-rich repeat sequence vertebrates TTAGGG ; most plants TTTAGGG Repeated sequences – short tandem repeats • Minisatellites (Variable number tandem repeats, VNTRs) 10-100 bp repeat units 500-30,000 bp arrays The original DNA fingerprinting marker via Southern blotting Now supplanted by microsatellites Repeated sequences – short tandem repeats • Microsatellites (Simple sequence repeats, SSRs) Di, tri or tetra-nucleotide repeats; 1-10 repeat units per locus Repeat numbers expand or contract over a short evolutionary, or even generational time-frame Amplified by PCR Primers based on unique flanking sequence Products fractionated by capillary or acrylamide gel electrophoresis Co-dominant mapping & fingerprinting markers Both alleles can be detected in a heterozygous individual variety A [CACACACA] variety B [CACA] [GTGTGTGT] [GTGT]