Bioinformatics: Definitions, Challenges and Impact on Health Care Systems Joyce Mitchell, Ph.D. University of Utah Sept 29, 2005 NLM’s Wood’s Hole Informatics Course 1 Outline for Talk 1. 2. 3. 4. 5. What is Bioinformatics? Health Informatics compared to Bioinformatics Problems considered in Bioinformatics • Genomics, proteomics, transcriptomics, etc Genomics data and patient care Impact of Bioinformatics on Health Information Systems 2 Central Dogma of Molecular Biology Transcription DNA Replication RNA Protein Phenotype Phenotype Translation This happens in Cells. 3 1. What is Bioinformatics? Definitions first 4 NIH Working Definition Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. http://www.bisti.nih.gov/CompuBioDef.pdf 5 Another Definition An interdisciplinary area at the intersection of biological, computer, and information sciences necessary to manage, process, and understand large amounts of data, for instance from the sequencing of the human genome, or from large databases containing information about plants and animals for use in discovering and developing new drugs. www.isye.gatech.edu/~tg/publications/ecology/eolss/node2.html 6 Another definition NCBI (National Center for Biotechnology Information Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights and to create a global perspective from which unifying principles in biology can be discerned. There are sub-disciplines in bioinformatics. http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html 7 2. Health Informatics Compared to Bioinformatics Same methods, different application domains 8 Different Areas of Strengths Bioinformatics has much more data available on the Internet than Health Informatics • Much more progress on database integration across multiple data sources Health Informatics has much more need for aggregation of national statistics • Much more progress on terminologies for integration of data 9 Bioinformatics & Health Informatics Bioinformatics is the study of the flow of information in biological sciences. Health Informatics is the study of the flow of information in patient care. These two field are on a collision course as genomics data becomes used in patient care. Russ Altman,MD, Ph.D., Stanford Univ. 10 3. Problems Considered in Bioinformatics OMES and OMICS 11 Omes and Omics Genomics • • Primarily sequences (DNA and RNA) Databanks and search algorithms Proteomics • • • Sequences (Protein) Mass spectrometry, X-ray crystallography Databanks, knowledge bases, terminologies Functional Genomics (transcriptomics) • • Microarray data Databanks, analysis tools, traversal techniques Systems Biology (metabolomics) • • Metabolites and interacting systems (interactomics) Graphs, visualization, modeling, networks of entities 12 Central Dogma of Molecular Biology DNA Genomics RNA Protein Transcriptomics Phenotype Phenotype Proteomics Functional Genetics 13 Genome and Genomics Genome – entire complement of DNA in a species • • Both nuclear and mitochondrial/chloroplast Variants among individuals Genomics – study of the sequence, structure and function of the genome. Study of whole sets of genes rather than single genes. Comparative genomics – study of the differences among species. Usually covers evolutionary studies of differences & conservation over time. 14 A Genome Database (e.g., GenBank) Consists of long strings of DNA bases – ATCG….. Consists of “annotations” of this database to attach meaning to the sequence data. Example entry from GenBank: • http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi ?val=NM_000410&dopt=gb Hemochromatosis gene HFE 15 Human Genome Project Human Genome Project - International research effort Determine sequence of human genome and other model organisms Began 1990, completed 2003 Next steps for ~20,000 genes • • • Function and regulation of all genes Significance of variations between people Cures, therapies, “genomic healthcare” 16 “The Human Genome Project has catalyzed striking paradigm changes in biology - biology is an information science.” Leroy Hood, MD, PhD Institute for Systems Biology Seattle, Washington 17 Genomes In Public Databases 12/4/01 10/3/02 8/28/03 9/16/05 72 104 156 297 Ongoing prokaryotic genomes: 255 316 386 737 218 246 526 Published complete genomes: Ongoing eukaryotic genomes: 158 1560 http://www.genomesonline.org/ 18 Genomics activities Sequence the genes and chromosomes – done by breaking the DNA into parts Map the location of various gene entities to establish their order Compare the sequences with other known sequences to determine similarity • • Across species, conserved sequence “motifs” Predict secondary structure of proteins • BLAST and its many forms Create large databases – GenBank, EMBL, DDBJ Develop algorithms and similarity measures 19 Central Dogma of Molecular Biology DNA RNA Protein Phenotype Phenotype Tissues Organs Organisms Genomics Transcriptomics Proteomics Functional Genetics 20 Proteome and Proteomics Proteome – the entire set of proteins (and other gene products) made by the genome. Proteomics – study of the interactions among proteins in the proteome, including networks of interacting proteins and metabolic considerations. Also includes differences in developmental stages, tissues and organs. 21 Protein Functions Catalysis Transport Nutrition and storage Contraction and mobility Structural elements • • Defense mechanisms Regulation • • Genetic Hormonal Buffering capacity Cytoskeleton Basement membranes 22 Protein Databases SwissProt PIR UniProt http://www.pir.uniprot.org/ GENE http://www.ncbi.nlm.nih.gov/gene InterPro http://www.ebi.ac.uk/interpro/ Correspond to (and derived from) Genome data bases All connected by Reference Sequences (NCBI) 23 Gene/Protein Database entries HFE record in Entrez GENE (NCBI) http://www.ncbi.nlm.nih.gov/entrez/query. fcgi?&db=gene&cmd=retrieve&dopt=Gra phics&list_uids=3077 24 Structure & Function Determination X-ray crystallography Nuclear magnetic resonance spectroscopy and tandem MS/MS Computational modeling Sequence alignment from others Homology modeling 25 Structure Databases Contain experimentally determined and predicted structures of biological molecules Most structures determined by X-ray crystallography, NMR Example – MMDB molecular modeling db http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml HFE Entry • http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv. cgi?form=6&db=t&Dopt=s&uid=9816 26 Protein Interaction Databases Record observations of protein-protein interactions in cells Attempts to detail interactions observed in thousands of small-scale experiments described in published articles Examples: • • • • • BIND: Biomolecular Interaction Network Database DIP: Database of Interacting Proteins MIPS: Munich Information Center for Protein Sequences PRONET: Protein interaction on the Web Many others, both academic and commercial 27 Central Dogma of Molecular Biology DNA Genomics RNA Protein Transcriptomics Phenotype Phenotype Proteomics Functional Genetics 28 Proteome vs Transcriptome Functional genomics (transcriptomics) looks at the timing and regulation of the gene products (both RNA and proteins) This is different from looking at what gene products can be produced – it looks at the circumstances under which production occurs. Involves experimental conditions. 29 Functional Genomics – Microarrays Transcriptome and transcriptomics High throughput technique designed to measure the increase in RNA (or sometimes proteins, tissues, etc) in a cell in response to an experiment. Also called “gene expression” analysis Microarrays called “gene chips” (although now there are protein and tissue chips) 30 How Do Microarrays Work? Conceptual description: • • • • Set of targets (cDNA, proteins, tissues, etc) are immobilized in predetermined positions on a substrate Solution containing tagged molecules capable of binding to the targets is placed over the targets Binding occurs between targets and tagged molecules. Fluorescent tags allow you to visualize which targets have been bound (and tell you something about the molecules that were present in your solution). 31 Animation of Microarrays http://www.bio.davidson.edu/courses/gen omics/chip/chip.html 32 How Do Microarrays Work? Conceptual description: • • • • Set of targets (cDNA, proteins, tissues, etc) are immobilized in predetermined positions on a substrate Solution containing tagged molecules capable of binding to the targets is placed over the targets Binding occurs between targets and tagged molecules. Fluorescent tags allow you to visualize which targets have been bound (and tell you something about the molecules that were present in your solution). 33 How Spotted Arrays Work Result: • Spots where cDNA from the reference sample • • • hybridized look green Spots where cDNA from the experimental sample hybridized look red Spots where cDNA from both samples hybridized look yellow (green+red=yellow) Spots with little/no cDNA hybridized look black 34 Uses of Expression Profiling Pharmaceutical research: • ID drug targets by comparing expression profile of drug-treated cells with those of cells containing mutations in genes encoding known drug targets Disease Dx and Tx: • • Distinguish morphologically similar cancers • DLBCL (Poulsen et al (2005) Microarray-based classification of diffuse large B-cell lymphomas European Journal of Haematology 74(6):453-65.)) Therapy potential • Rabson AB, Weissmann D. From microarray to bedside: targeting NF-kappaB for therapy of lymphomas. Clin Cancer Res. 2005 Jan 1;11(1)2-6. 37 Future Applications Diagnostic tool to screen for infective agents • Chip imprinted with set of pathogenic genomes used to identify bacterial, viral, or parasite genomic material in patient’s body fluids Diagnostic chip to check for mutations involved in drug-gene interactions. 38 Experimental Design (2) A fundamental challenge of microarray experiments: underdetermined systems Kohane IS, Kho AT, Butte AJ. Microarrays for an Integrative Genomics. (The MIT Press; Cambridge, MA; 2003), p. 11. MGED Microarray gene expression data “Standards for minimum data to be exchanged” “Standards for format of messages to exchange the data” MIAME MAGE minimum information that should be reported about a microarray experiment to enable its unambiguous interpretation and reproduction a standard transmission format for microarray experiment data http://www.mged.org/Workgroups/ MIAME/miame.html http://www.mged.org/Workgroups/ MAGE/mage.html Public Microarray Data Repositories Major public repositories: GEO (NCBI) • http://www.ncbi.nlm.nih.gov/geo/ ArrayExpress (EBI) • http://www.ebi.ac.uk/arrayexpress/ 41 Standards and Repositories Brazma, A, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nature Genetics. 2001 Dec;29(4):373. http://www.nature.com/cgitaf/DynaPage.taf?file=/ng/journal/v29/n4/full/ng1201365.html Ball, CA, et al. Submission of Microarray Data to Public Repositories. PLoS Biology. 2004 September; 2 (9): e317 http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool= pubmed&pubmedid=15340489 42 Controlled Vocabularies Genomics, proteomics, and especially microarray techniques have created a large need for controlled vocabularies to assist the analyses across multiple entities & species. Taxonomy – systematic classification of objects according to relationships. Ontologies – • An organizational framework for concepts 43 Controlled Vocabularies in Bioinformatics The Gene Ontology http://www.geneontology.org/ • • Knowledge capture (the ontology itself) Annotation of gene products (for comparisons) The MGED Ontology (arising from MIAME) • http://mged.sourceforge.net/ • Annotation of microarray experiments for public repositories Clinical Bioinformatics Ontology: • • Annotation of gene tests in electronic medical records http://www.cerner.com/cbo MIAPE from Proteomics Standards Initiative (PSI) • http://psidev.sourceforge.net/ 44 4. Genomics Data and Patient Care From genotype to phenotype 45 Bioinformatics and Patient Care Understanding a person’s genome ushers the era of “Personalized Medicine” Obviously you should keep track of healthrelated genetic data in the EMR. The 9-11 disaster showed you need to know the genomic variant information as well. • Cash et al. Forensic bioinformatics in the wake of the World Trade Center Disaster. PSB 2003:638-653. 46 Human Disease Gene Specifics Genes linked to human diseases (9-2004) + 425 in 2 yrs 1700/20,000 = 9% of loci 1800 1600 1400 1200 1000 800 600 400 200 0 Loci 2002 2003 2004 47 Genetic Medicine is not new Karl Landsteiner started genetic medicine over 100 years ago (1903) Blood transfusions worked off the ABO blood group system. Landsteiner got the Nobel Prize in 1930 for his work. http://nobelprize.org/medicine/laureates/1930/landsteiner-bio.html 48 Genomic Medicine is New What to do with all of this genetic information and every person being unique? And the information about genetic conditions is available on the Internet. 49 Genomics Data and Patient Care Where do you find the data for genes causing human diseases? What do you do with genetic data in electronic medical records? 50 Where do you find the data for genes causing human diseases? Study on availability of genetic data on health implications of the HGP. • Mitchell, McCray, Bodenreider. Methods Inf Med 2003; 42:557-63. 51 Questions What genes cause the condition? What are the normal function of the gene? What mutations have been linked to diseases? How does the mutation alter gene function? What laboratories are performing DNA tests? Are there gene therapies or clinical trials? What names are used to refer to the genes and the diseases? What other conditions are linked to these same genes? 52 You can find the answers online … but it is not easy; answers in many places Can’t navigate by genes names - must use hot links and numeric identifiers The number and function of alternate forms of the protein are inconsistently reported Synonymy (many names, same meaning) and polysemy (same name, different meanings) cause confusion Upper and lower case are used for species distinctions 53 Major Challenges of Navigation Complexity of data Dynamic nature of the data Diverse foci and number of data/knowledge base systems Data and knowledge representation lack standards Can navigate if you know what you are looking for. 54 Genetics Home Reference Consumer health resource to help the public navigate from phenotype to genotype. Focus on health implications of the Human Genome Project. http://ghr.nlm.nih.gov Mitchell, Fun, McCray, JAMIA, 2004 Nov 11(6):439-437 55 Hands-on with GHR Scavenger hunt with hemochromatosis and the genes that influence it. Explore the Genetics Home Reference by answering the following questions. Start at http://ghr.nlm.nih.gov . 56 GHR Scavenger Hunt How common is hemochromatosis? How many genes have been proven to be involved in hemochromatosis when the genes are mutated? What are the symbols for these genes? Can you find the link to MedlinePlus with health information on hemochromatosis? 57 GHR Scavenger Hunt What are the names of the patient support associations for hemochromatosis? One synonym for this condition is “bronze diabetes”. Can you find a reason for this? What kind of damage is done to the liver of people with hemochromatosis? 58 GHR Scavenger Hunt For the genes involved in hemochromatosis, how many of them are available as a DNA test? Give one place where you would choose to send a tissue sample for DNA testing. What sites are listed under “Research Resources” for the TFR2 gene? • • How many alternately spliced proteins for TFR2? In what tissues is this gene expressed? 59 GHR Scavenger Hunt How do people inherit hemochromatosis? Do the genes involved in hemochromatosis cause other health conditions when they are mutated? Can you find a protein sequence for one of the genes? What clinical trials are available for hemochromatosis patients close to where you live? 60 5. Impact of Bioinformatics on Health Information Systems Electronic Medical Record Public Health Systems 61 Genetics is Impacting Medicine Today! 1700 genes & health conditions > 1100 gene tests for diagnosis Relate to diagnosis, therapy, drug dosage, occupational hazards, reproductive plans, health risks, …. 62 Well-known Examples Pharmacogenetics: • CYP450 alleles: exaggerated, diminished or ultrarapid drug responses. E.G., Warfarin. 93% of patients are OK on standard doses. 7% of patients have severe hemorrhage. CYP2C9*2 and CYP2C9*3 most severe of 6 known mutations. Environmental susceptibility • Sickle Cell trait carrier and malaria parasite Nutrition • PKU and avoidance of phenylalanine 63 Another Example: Iressa (gefitinib) Non-small cell lung CA ~ 140,000 pt/yr Iressa (Astra Zeneca) causes remission in 1 of 10 patients if taken daily for life. Iressa efficacy correlates with EGFR mutation in the tumor. Now have gene testing for EGFR so can target appropriate people. http://www.sciencemag.org/cgi/content/full/305/5688/1222a BUT – Astra Zeneca can’t make money on only 14,000 per year. http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=131550 64 Collie Dog Example Collies are more sensitive to the anti-parasytic drug invermectin and loperamide (imodium) and other drugs 75% of collies in US have a mutation in the mdr1 gene causing multiple drug sensitivity (50 drugs). Can cause death or neurological damage. Now have testing available. http://www.wral.com/money/3565592/detail.html 65 Implications for Health Care System More gene tests will be ordered. [reports of 300% increase in gene tests in 2003.] • Arch Pathol Lab Med – 2004, 128(12):1330-1333 The FDA will regulate panels of tests. • http://www.fda.gov/bbs/topics/news/2004/new01149.html Non-discrimination laws for insurance and employment would open a floodgate. Preventive healthcare will play a larger part. Environmental risk factors dictate OSHA-type approach to worker empowerment and education about safe behavior 66 Example: Hemochromatosis 2 copies of mutated HFE gene - too much iron absorbed from diet, which accumulates. Causes arthritis, liver disease, diabetes, skin discoloration. • (1 million people in US) HFE gene regulates the storage, transport and absorption of iron Labs doing gene tests use different techniques: full sequence vs limited analysis 67 A Portion of the HFE DNA Sequence ATGGGCCCGCGAGCCAGGCCGGCGCTTCTCCTCCTGATGCTTTTGCAGA CCGCGGTCCTGCAGGGGCGCTTGCTGCGTTCACACTCTCTGCACTA CCTCTTCATGGGTGCCTCAGAGCAGGACCTTGGTCTTTCCTTGTT TGAAGCTTTGGGCTACGTGGATGACCAGCTGTTCGTGTTCTATGATCA TGAGAGTCGCCGTGTGGAGCCCCGAACTCCATGGGTTTCCAGTAGAA TTTCAAGCCAGATGTGGCTGCAGCTGAGTCAGAGTCTGAAAGGGT GGGATCACATGTTCACTGTTGACTTCTGGACTATTATGGAAAATCACAA CCACAGCAAGGAGTCCCACACCCTGCAGGTCATCCTGGGCTGTGAA ATGCAAGAAGACAACAGTACCGAGGGCTACTGGAAGTACGGGTAT GATGGGCAGGACCACCTTGAATTCTGCCCTGACACACTGGATTGGAG AGCAGCAGAACCCAGGGCCTGGCCCACCAAGCTGGAGTGGGAAAG GCACAAGATTCGGGCCAGGCAGAACAGGGCCTACCTGGAGAGGGAC TG 68 69 A Portion of the HFE DNA Sequence GCACAAGATTCGGG GGACAAGATTCGGG His: CAU and CAC Asp: GAU and GAC A Mutation in position 225 – changes C to G. Changes a part of the protein. (histadine to aspartic acid at position 63) 70 Amino Acid Sequence for HFE MGPRARPALLLLMLLQTAVLQGRLLRSHSLHYLFMGASEQ DLGLSLFEALGYVDDQLFVFYD H D ESRRVEP RTPWVSSRISSQMWLQLSQSLKGWDHMFTVDFWTIME NHNHSKESHTLQVILGCEMQEDNSTEGYWKYGY DGQDHLEFCPDTLDWRAAEPRAWPTKLEWERHKIRAR QNRAYLERDCPAQLQQLLELGRGVLDQQVPPLV KVTHHVTSSVTTLRCRALNYYPQNITMKWLKDKQPMDA KEFEPKDVLPNGDGTYQGWITLAVPPGEEQRY TCQVEHPGLDQPLIVIWEPSPSGTLVIGVISGIAVFVVILFI GILFIILRKRQGSRGAMGHYVLAERE His63Asp in ONE chromosomes Cys282Tyr in ONE chromosome (not shown) 71 Report Back from Full Sequence Lab Reference sequences for transcript variant 1 for the HFE gene. NM_000410 ; NP_000401 Consensus CDS (CCDS) CCDS4578.1 Mutant phenotype changes: • His63Asp; Cys282Tyr (2 mutations) Polymorphisms noted: • AA position 59 VAL53MET [157GA (freq 5%)] 72 Special health concerns HFE For person with dx: For family members: 73 Dilemmas The reference sequence ties you to external data sources that change The protein has eleven transcript variants Mutant phenotype is noted as an amino acid change Polymorphisms are noted as nucleotide change These results have implications for other family members in addition to the patient 74 What Should You Store in the EMR? Do you put the DNA sequence for the gene into the EMR? Where do you put it? Do you just store meta-data about the DNA sequence? HFE test abn or (his63asp; cys282tyr) What about the normal variants? If you don’t store the sequence, what do you do when the reference sequence changes? How do you trigger alerts and reminders? And for what? People with hemochromatosis need special screening and check-ups. 75 Genetic data in electronic medical records? Implications for component systems: • • • • Laboratory Pharmacy Computerized order entry Documentation and notes Knowledge management • • • Alerts and reminders Finding patients matching profiles Practice guidelines and clinical trials 76 Genome Data and Other Information Systems Genomic information will be pervasive in all healthcare information systems. Also in public health systems • • • • • Newborn screening Tissue and organ banks DOD requires DNA samples Bioterrorism and homeland security Identification of World Trade Center victims Privacy and security issues will remain with us always but are manageable. 77 Summary Informatics is the enabler of personalized, genomic medicine. Personalized medicine requires a combination of medical informatics and applied bioinformatics (and a lot more). 78 Informatics will be a very dynamic discipline for eons to come! Your week at Wood’s Hole is the first step to an exciting future. 79 The End Joyce Mitchell, PhD University of Utah 80