The Human Genome, impact in the biomedical domain Sonia ABDELHAK, PhD Molecular Investigation of Genetic Orphan Disorders Institut Pasteur de Tunis Human Genome Project • • • • • • Historical context. Goals of the HGP. Strategy. Results. Impact on Biomedical domain. Discussion. February 2001 « Finished » sequence April 1953-April 2003 Brief history of HGP 1984 to 1986 – first proposed at US DOE meetings 1988 – endorsed by US National Research Council (Funded by NIH and US DOE $3 billion set aside) 1990 – Human Genome Project started (NHGRI) Later – UK, France, Japan, Germany, China 1998. Celera announces a 3-year plan to complete the project years early First draft published in Science and Nature in February, 2001 Finished Human Genome sequence published in Nature 2003. Challenges • Genome Attributes – Size – Polymorphism – Repeats (Smaller repeats are technically difficult to sequence, some sequences are repeated all over the genome: How can these be placed?). • Available Technology – 600 bp per “read”(Sequencing works by extension from a primer/ gel electrophoresis. Limited by resolution of gel). – Error (~1 error per 600. Sequencing multiple times decreases error; same error unlikely in multiple reads. 10x Coverage = error rate ~1/10,000). – Relies on cloning (Some regions are difficult to clone Heterochromatin; some sequences rearrange or are deleted when cloned) Goals of HGP • Create a genetic and physical map of the 24 human chromosomes (22 autosomes, X & Y) • Identify the entire set of genes & map them all to their chromosomes • Determine the nucleotide sequence of the estimated 3 billion base pairs • Analyze genetic variation among humans • Map and sequence the genomes of model organisms Model organisms • • • • • • Bacteria (E. coli, influenza, several others) Yeast (Saccharomyces cerevisiae) Plant (Arabidopsis thaliana) Roundworm (Caenorhabditis elegans) Fruit fly (Drosophila melanogaster) Mouse (Mus musculus) Goals of HGP (II) • Develop new laboratory and computing technologies to make all this possible • Disseminate genome information • Consider ethical, legal, and social issues associated with this research Time-line large scale genomic analysis Identification de Polymorphismes de type microsatellites par analyse de séquence: IL-12p35AC F tggtggcagaaatcattgtctgaaaagtaattgttttacttttattcttttcgtgtgtgtgtgtgt gtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgcatgtgccagatttcttgtttgaaaggcaat gagcttcatccaagtatcaa 78.57% IL-12p35AC R IL-12p40AC F atttcaggtgtgagccactgtgcctggccagaactttttcaatgaatattcaagataattgtata cacattttatatatatatatatatatacacacacacacacacacacatatgtatacacaca ttatatatataatccatgttatatacatctctacattatatatatccactatatatattttacttataca tatagattttatttttatgaactaggatcaaattgta 69.23% IL-12p40AC R 1 174 170 166 2 3 4 5 EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence dbEST http://www.ncbi.nlm.nih.gov/dbEST/ sequence1 ESTs GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAG TAGTCA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTA TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGN clone xyz 80-100,000 AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGC genes TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus CGTACT sequence2 >IMAGE:275615 3', mRNA sequence 80-100,000 RNA NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTT - isolate unique clones gene products TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATT - sequence once from each end AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAA GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC make cDNA library 80-100,000 unique cDNA clones in library Chimie de séquençage Dye Terminator (6) amorce ADN A G C T A T ... TCGATA réaction de séquence Taq Electrophorèse Gel plat / capillaire AGCTA T Analyse automatique AGCT A AGC T AG C AG A A dépot détection G C T A T Two Competing Strategies for Human Genome • (Hierarchical shotgun) [Public human genome project] • Whole-genome Shotgun [Celera project] Sequencing BAC: Bacterial Artificial Chromosome clone Contig: joined overlapping collection of sequences or clones. Whole-genome shotgun sequencing Private company Celera used to sequence whole human genome • Whole genome randomly sheared three times – Plasmid library constructed with ~ 2kb inserts – Plasmid library with ~10 kb inserts – BAC library with ~ 200 kb inserts • Computer program assembles sequences into chromosomes • No physical map construction • Only one BAC library • Reduces problems of repeat sequences Différentes étapes d’analyse de séquence Vérification de la qualité de séquence A G C T A T Elimination des séquences contaminantes Blastn contre des banques de vecteurs, de bactéries, levures,… Assemblage, Phred, Phrap, Consed Identification des séquences potentiellement codantes Comparaison avec les banques de données, Logiciels de prédictions d’exons. Entrez NIH NCBI GenBan k EMBL •Submissions •Updates CIB NIG DDBJ •Submissions •Updates getentry •Submissions •Updates EBI SRS EMBL HTG Division: High Throughput Genome Records phase 1 Acc = AC008701 gi = 6601005 phase 2 Acc = AC008701 gi = 6671909 HTG HTG PRI phase 3 Acc = AC008701 gi = 7328720 40,000 to > 350,000 bp 2.88 Gbp 2,851,330,913 Gene prediction • Easy for procaryotes (single cell) – one gene, one protein • More difficult for eukaryotes (multicell) – one gene, many proteins • Very difficult for Human – short exons separated by non-coding long introns Gene recognition • Coding region and non-coding region have different sequence profiles – coding region is “protected” from mutation and is less random • Gene recognition by sequence alignment • Gene prediction by Hidden Markov Model trained by set of known genes • Many genes are homologs – similar in vastly different organisms Two predictions disagree John B. Hogenesch, et al Cell, Vol. 106, 413–415 August 24, 2001 “…predicted transcripts collectively contain partial matches to nearly all know genes, but the novel genes predicted by both groups are largely non-overlapping The Human Genome Human genome content Total length 3000 Mb ~ 40,000 genes (coding seq) Gene sequences < 5% Exons ~ 1.5% (coding) Introns ~ 3.5% (noncoding) Intergenic regions (junk) > 95% Repeats > 50% Global properties • Pericentromeric and subtelomeric regions of chromosomes filled with large recent transposable elements • Marked decline in the overall activity of transposable elements or transposons • Male mutation rate about twice female – most mutation occurs in males • Recombination rates much higher in distal regions of chromosomes and on shorter chromosome arms – > one crossover per chromosome arm in each meiosis Interspersed repeats: fixed transposable elements copied to non-homologous regions. Fig 17 transposables Total 45% Classes of transposable elements. LINE, long interspersed element. SINE short interspersed element. Genes are sometimes protected from repeats Fig 21 Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed repeats; blue bars, exons of known genes. Note the deficit of repeats in the HoxD cluster, which contains a collection of genes with complex, interrelated regulation. Important features of Human proteome • 30,000–40,000 protein-coding genes • Proteome (full set of proteins) more complex than those of invertebrates. – pre-existing components arranged into a richer architectures. • Hundreds of genes seem to come from horizontal transfer from bacteria questionable • Dozens of genes seem to come from transposable elements. Noncoding RNA genes • Transfer RNAs (tRNAs) – adaptors that translate triplet code of RNA into amino acid sequence of proteins • Ribosomal RNAs (rRNAs) – components of ribosome • Small nucleolar RNAs (snoRNAs) – RNA processing and base modification in nucleolus • Small nuclear RNAs (sncRNAs) - spliceosomes Human races have similar genes • Genome sequence centers have sequenced significant portions of at least three races • Range of polymorphisms within a race can be much greater than the range of differences between any two individuals of different race • Very few genes are race specific Genome Sizes (MegaBases) 3500 3000 2500 2000 Size 1500 1000 600000 500 500000 0 E.coli Yeast Worm Fly Fugu Human 400000 300000 200000 100000 0 Fly Fugu Human Wheat Amoeba Fig 35a Size distributions of exons in Human, Worm and Fly. Human have shorter exons. Fig 35c Size distributions of intons in Human, Worm and Fly. Human have longer introns. • Complexity of proteome increase from yeast to humans – More genes – Shuffling, increase, or decrease of functional modules – Alternative RNA splicing – humans exhibit significantly more – Chemical modification of proteins is higher in humans Combinatorial strategies • At DNA level – T-cell receptor genes are encoded by a multiplicity of gene segments Fig. 10.21 • At RNA level – splicing of exons in different orders Yeast • 70 human genes are known to repair mutations in yeast •Nearly all we know about cell cycle and cancer comes from studies of yeast •Advantages: •fewer genes (6000) •few introns • 31% of yeast genes give same products as human homologues Drosophila • nearly all we know of how mutations affect gene function come from Drosophila studies •We share 50% of their genes •61% of genes mutated in 289 human diseases are found in fruit flies •68% of genes associated with cancers are found in fruit flies •Knockout mutants •Homeobox genes C. elegans • 959 cells in the nervous system • 131 of those programmed for apoptosis • apoptosis involved in several human genetic neurological disorders •Alzheimers •Huntingtons •Parkinsons Mouse • known as “mini” humans •Very similar physiological systems •Share 90% of their genes Questions Remain about the Human Genome – Difficult to precisely estimate number of genes at this time • Small genes are hard to identify • Some genes are rarely expressed and do not have normal codon usage patterns – thus hard to detect Impact of HG on Biomedical domain Applications to medicine and biology • Disease genes – human genomic sequence in public databases allows rapid identification of disease genes in silico • Drug targets – pharmaceutical industry has depended upon a limited set of drug targets to develop new therapies – now can find new target in silico • Basic biology – basic physiology, cell biology… Hérédité liée au chromosome X Hérédité autosomique dominante Mm A1A2 A2A2 MM Mm A1A2 A1A2 Mm A1A1 mm mm A1A1 A1A1 mm Hérédité autosomique récessive Les mutations ponctuelles Création de codon stop CAG Gln TAG Positional cloning of genes Disease hromosomal calisation Function/ Protein Gene Disease Function/ Protein Chromosomal localisation Gene Recherche de familles -détermination du phénotype -collecte d'ADN anomalie cytogénétique Cartographie génétique -localisation chromosomique -localisation fine Cartographie physique et Isolement de clones spécifiques Isolement de gène (s) normal Recherche de mutations Etude fonctionnelle muté ... CCT GAG GAG... ... CCT GTG GAG... ... Pro Glu Glu ... ... Pro Val Glu ... 1 to 10 years! 11083 a) -1 1 1' -I I 9480 2 3 4 5 4405 6 7 8 9 10910 12 14 11 13 15 10 16 b) I' II III IV V VI VII VIII IX X c) EYA1 gene structure Bronchio-Oto-Renal Syndrome XI XII XIV XIII XV Recherche de familles -détermination du phénotype -collecte d'ADN anomalie cytogénétique Cartographie génétique -localisation chromosomique -localisation fine Cartographie physique et Isolement de clones spécifiques Isolement de gène (s) normal Recherche de mutations Etude fonctionnelle muté ... CCT GAG GAG ... ... CCT GTG GAG... ... Pro Glu Glu ... ... Pro Val Glu ... .... From in vivo to in vitro to in silico Problème de pénétrance Sous le mode dominant Famille EBDD-I I II III 4 3 3 m 7 7 3 3 M 8 IV 2 V 3 3 M 8 3 3 M 7 3 3 3 3 m M 7 10 3 3 M 8 3 3 M 7 3 3 3 3 M M 7 10 2 4 2 4 M M 11 5 4 3 3 3 3 m M 7 10 3 3 m M 7 5 2 M 9 2 3 2 3 M M 11 8 3 3 3 3 m M 6 10 3 3 3 3 M M 10 8 3 3 m 6 3 3 3 M 8 Environnement? Individu 1 G1 Malade Individu 2 ?? G1 Sain Maladie à pénétrance incomplète et expressivité variable G1/1 Epissage alternatif Non Sens mRNA decay Mécanisme de régulation post-transcriptionnelle G2 Gènes modificateurs G1/2 G3 Complex /common disorders: multifactoriel Environemental factors Genetic factors Complex Diseases : Genes & Environment Environmental Effect Genetic Component The potential benefits of identifying genes/variations involved in disease Predisposition Improve the understanding of disease etiology and mechanism Early disease risk assessment Discover new drug targets Disease prevention population or ethnic group variability Targeted screening Prevention Diagnosis Therapy Predictive medicine Pharmacogenomics: The Promise of Personalized Medicine O GOD! CREDIT: JOE SUTLIFF. SCIENCE, 2001 • • • • Acknowledgement: the following presentation has been prepared on the basis of Internet resources. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001). International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome., Nature 431: 931-945 (2004). Thank you