BME355 GENOMIC SEQUENCE ANALYSIS COURSE GUIDE STUDY GUIDE (5CU) Course Development Team Head of Programme : Assoc Prof Ooi Chui Ping Course Developer(s) : Dr. Lakshmi V. Madabusi, Dr. LIN Feng Production : Educational Technology & Production Team © 2020 Singapore University of Social Sciences. All rights reserved. No part of this material may be reproduced in any form or by any means without permission in writing from the Educational Technology & Production, Singapore University of Social Sciences. Educational Technology & Production Singapore University of Social Sciences 461 Clementi Road Singapore 599491 Release V1.3 CONTENTS COURSE GUIDE 1. Welcome .............................................................................................................1 2. Course Description and Aims .........................................................................1 3. Learning Outcomes .......................................................................................... 3 4. Learning Material ............................................................................................. 4 5. Assessment Overview ...................................................................................... 4 6. Course Schedule ................................................................................................ 5 7. Learning Mode ..................................................................................................5 STUDY UNIT 1 INTRODUCTION TO MOLECULAR GENETICS Learning Outcomes ......................................................................................... SU1-1 Overview ........................................................................................................... SU1-1 Chapter 1 Introduction to Molecular Genetics ............................................ SU1-2 Summary ......................................................................................................... SU1-13 References ....................................................................................................... SU1-14 STUDY UNIT 2 MOLECULAR GENETICS AND DATABASES Learning Outcomes ......................................................................................... SU2-1 Overview ........................................................................................................... SU2-1 Chapter 2 Molecular Genetics and Databases ............................................. SU2-2 Summary ......................................................................................................... SU2-39 STUDY UNIT 3 PAIRWISE SEQUENCE ALIGNMENT Learning Outcomes ......................................................................................... SU3-1 Overview ........................................................................................................... SU3-1 Chapter 3 Pairwise Sequence Alignment ..................................................... SU3-2 Summary ......................................................................................................... SU3-29 STUDY UNIT 4 DATABASE SEARCHING WITH BLAST AND FASTA Learning Outcomes ......................................................................................... SU4-1 Overview ........................................................................................................... SU4-1 Chapter 4 Database Searching with BLAST and FA4.1 Basic Concepts, BLAST Searches and Interpretation of Results ........................................... SU4-2 Summary ......................................................................................................... SU4-17 STUDY UNIT 5 ADVANCED BLAST SEARCHING Learning Outcomes ......................................................................................... SU5-1 Overview ........................................................................................................... SU5-1 Chapter 5 Advanced BLAST Searching ....................................................... SU5-2 Summary ......................................................................................................... SU5-20 STUDY UNIT 6 MULTIPLE SEQUENCE ALIGNMENT Learning Outcomes ......................................................................................... SU6-1 Overview ........................................................................................................... SU6-1 Chapter 6 Multiple Sequence Alignment (MSA) ........................................ SU6-2 Summary ......................................................................................................... SU6-22 COURSE GUIDE BME355 COURSE GUIDE 1. Welcome (Access video via iStudyGuide) Welcome to the course BME355 Genomic Sequence Analysis, a 5 credit unit (CU) course. This Study Guide will be your personal learning resource to take you through the course learning journey. The guide is divided into two main sections – the Course Guide and Study Units. The Course Guide describes the structure for the entire course and provides you with an overview of the Study Units. It serves as a roadmap of the different learning components within the course. This Course Guide contains important information regarding the course learning outcomes, learning materials and resources, assessment breakdown and additional course information. 2. Course Description and Aims BME355 Genomic Sequence Analysis and BME 356 Functional Genomics form the bioinformatics specialization in your BSBE degree program. The two courses are delivered consecutively in the semester and students are encouraged to complete both within the same semester. The complete sequencing of the Human Genome in 2001, development of high throughput genomic sequencing and the reduced costs of computation have acted as catalysts to allow application of genomic sequencing information in medicine and other applications. The overall aim of the course is to understand and evaluate how genomic sequences are utilized in research and medicine. Topics include introduction to molecular genetics and genomic databases, computational methods and algorithms for analysing and disseminating genomic information Course Structure This course is a 5-credit unit course presented over 6 weeks. There are six Study Units in this course. The following provides an overview of each Study Unit. 1 BME355 COURSE GUIDE Study Unit 1 – Introduction to Molecular Genetics This unit introduces the background and coverage of computational biology, and bioinformatics in general. It defines the disciplines of bioinformatics, genomics and functional genomics. Three perspectives are to summarize the subject of bioinformatics: The Central Dogma of molecular biology, cellular processes, the genetic code, model organisms, coding vs. non-coding paradox and the tree of life. A consistent example of a gene and its corresponding protein product, Retinol-Binding Protein, is introduced, for illustration of use of various computational tools throughout the course. Study Unit 2 – Molecular Genetics and Databases This unit introduces the main concepts of cellular biology and molecular biology, as well as biological databases for computational biology. It discusses how these databases are organized to store data and what strategies are used to extract information from them. Three publicly accessible databases store large amounts of nucleotide and protein sequence data: GenBank at the National Centre for Biotechnology Information (NCBI), DNA Database of Japan (DDBJ), and the European Bioinformatics Institute (EBI). Five ways to access DNA and protein sequences are studied, demonstrated by examples. Study Unit 3 – Pairwise Sequence Alignment This unit evaluates the methods for analysing the relatedness of genes and proteins, with focus on the pairwise sequence alignment algorithms. We adopt an evolutionary perspective in our description of how amino acids (or nucleotides) in two sequences can be aligned and compared. We then describe the algorithms and programs for global (Needleman-Wunsch) and local (Smith-Waterman) pairwise alignment. Study Unit 4 – Database Searching with BLAST and FASTA This unit introduces the Basic Local Alignment Search Tool (BLAST) which is the main NCBI tool for comparing a query sequence to other sequences in various databases, as well as FASTA. BLAST and FASTA are heuristic and rapid version of pairwise alignment algorithm. Steps of the BLAST and FASTA search processes are described. Strategies applied for BLAST database searching are discussed with examples. Study Unit 5 – Advanced BLAST Searching BLAST searches can be very versatile. This unit further explores the advanced BLAST searching techniques. We begin with an overview of the specialized BLAST resources and websites. We then focus on finding distantly related proteins with Position-specific Iterated 2 BME355 COURSE GUIDE BLAST (PSI-BLAST) and significant pattern matches with Pattern-Hit Initiated BLAST (PHIBLAST). Finally, using BLAST for gene discovery is illustrated. Study Unit 6 – Multiple Sequence Alignment This unit considers establishment of relationship between multiple biological sequences. By introducing sequences into a multiple alignment, we can define members of a gene or protein family. If we know a feature of one of the proteins and identify the homologous proteins, we can predict that they may have similar function. Basic concepts and practical strategies of multiple sequence alignment are studied. Databases of multiple sequence alignments are introduced. Two main multiple sequence alignment programs are closely examined. 3. Learning Outcomes Knowledge & Understanding (Theory Component) By the end of this course, you should be able to: Demonstrate competence in the basic concepts of molecular genetics and computational biology. Discuss the genomic sequence organization and select specific genomic sequence data using GenBank, Ensembl, etc. Examine the various scoring matrices used for protein/DNA alignment and evaluate global vs. local sequence alignment tools used in studying evolution. Assemble the target sequences in genomic databases using the homology score matrices and the heuristic search tools Evaluate various types of multiple sequence alignment algorithms & global genomic analysis tools to formulate a solution for a research problem using these tools Key Skills (Practical Component) By the end of this course, you should be able to: Solve problems using multiple genomic tools (online) introduced in this course to answer a complex research question 3 BME355 COURSE GUIDE 4. Learning Material The following is a list of the required learning materials to complete this course. Required Textbook(s) Jonathan Pevsner, Bioinformatics and Functional Genomics (2009). John Wiley& Sons Inc., 5. Assessment Overview The overall assessment weighting for this course is as follows: Assessment Assignment 1 Assignment 2 Examination TOTAL Description Online Quiz (OCAS) Online Quiz (OCAS) ECA Weight Allocation 15% 15% 70% 100% The following section provides important information regarding Assessments. Continuous Assessment: Assignment 1 and 2 comprise of online quizzes weighted at 15% each, total 30%. Assignment 1 and 2 combined will constitute 100% of OCAS. Examination: The final examination is an End-of-Course Assessment (ECA) and is 100% of this component. Passing Mark: To pass the course you need to achieve scores of 40% in each component. Your overall rank score is the weighted average of all components. For detailed information on the Course grading policy, please refer to The Student Handbook (‘Award of Grades’ section under Assessment and Examination Regulations). The Student Handbook is available from the Student Portal. 4 BME355 COURSE GUIDE Non-graded Learning Activities: Activities for the purpose of self-learning are present in each study unit. These learning activities are meant to enable you to assess your understanding and achievement of the learning outcomes. The type of activities can be in the form of Quiz, Review Questions, Application-Based Questions or similar. You are expected to complete the suggested activities either independently and/or in groups as required for each individual activity. 6. Course Schedule To help monitor your study progress, you should pay special attention to your Course Schedule. It contains study unit related activities including Assignments, Selfassessments, and Examinations. Please refer to the Course Timetable in the Student Portal for the updated Course Schedule. Note: You should always make it a point to check the Student Portal for any announcements and latest updates. 7. Learning Mode The learning process for this course is structured along the following lines of learning: (a) Self-study guided by the study guide units. Independent study will require at least 3 hours per week. (b) Working on assignments, either individually or in groups. (c) Face to Face/ Online sessions (3 hours each session, 6 sessions in total). iStudyGuide You may be viewing the iStudyGuide version, which is the mobile version of the Study Guide. The iStudyGuide is developed to enhance your learning experience with interactive learning activities and engaging multimedia. Depending on the reader you are using to view the iStudy Guide, you will be able to personalize your learning with digital bookmarks, note-taking and highlight sections of the guide. Interaction with Instructor and Fellow Students Although flexible learning – learning at your own pace, space and time – is a hallmark at SUSS, you are encouraged to engage your instructor and fellow students in online 5 BME355 COURSE GUIDE discussion forums. Sharing of ideas through meaningful debates will help broaden your learning and crystallize your thinking. Academic Integrity As a student of SUSS, it is expected that you adhere to the academic standards stipulated in The Student Handbook, which contains important information regarding academic policies, academic integrity and course administration. It is necessary that you read and understand the information stipulated in the Student Handbook, prior to embarking on the course. 6 STUDY UNIT 1 INTRODUCTION TO MOLECULAR GENETICS BME355 STUDY UNIT 1 Learning Outcomes By the end of this unit, you should be able to: 1. Develop a working knowledge of the terms involved in molecular biology and computational biology 2. Differentiate between regulation at the organism level vs. cellular and gene level 3. Illustrate uses of genetic sequence information in studies of evolution 4. Demonstrate knowledge of the cell components, cell cycle, and basics of molecular genetics and molecular biology 5. Contrast between genomic tools pre vs. post Human Genome Project (HGP) era 6. Use key molecular biology tools (PCR, sequencing, others) to propose solutions to a molecular biology research problem 7. Compile a list of the main databases and internet resources used for genomic sequence analysis You can refer to Chapter 1 of the textbook All the figures and tables in this unit are from the recommended text book by Jonathan Pevsner. Overview This unit introduces the background and coverage of computational biology, and bioinformatics in general. It defines the disciplines of bioinformatics, genomics and functional genomics. Three perspectives are to summarize the subject of bioinformatics: The Central Dogma of molecular biology, cellular processes, the genetic code, model organisms, coding vs. non-coding paradox and the tree of life. A consistent example of a gene and its corresponding protein product, Retinol-Binding Protein, is introduced, for illustration of use of various computational tools throughout the course. SU1-1 BME355 STUDY UNIT 1 Chapter 1 Introduction to Molecular Genetics 1.1 Definition Bioinformatics focuses on the use of computer databases and computer algorithms to analyse proteins, genes, and the complete collections of deoxyribonucleic acid (DNA) that comprises an organism (the genome). A major challenge in biology is to make sense of the enormous quantities of sequence data and structural data that are generated by genome-sequencing projects, proteomics, and other large-scale molecular biology efforts. The tools of bioinformatics include computer programs that help to reveal fundamental mechanisms underlying biological problems related to the structure and function of macromolecules, biochemical pathways, disease processes, and evolution. According to a National Institutes of Health (NIH) definition, bioinformatics is: The research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organise, analyse, or visualise such data. Such a definition is also used for the closely related discipline of computational biology, although more accurately, computational biology is the development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioural, and social systems. In many textbooks and references, the two terms “bioinformatics” and “computational biology” are exchangeable. Genes, Genomes and Genomics (Access video via iStudyGuide) SU1-2 BME355 STUDY UNIT 1 1.2 Introduction and Molecular Genetics We can summarise the entire field of bioinformatics with three perspectives. The first perspective on bioinformatics is the cell (depicted in the picture below). The central dogma of molecular biology is that DNA is transcribed into RNA and translated into protein. The focus of molecular biology has been on individual genes, messenger RNA (mRNA) transcripts, and proteins. A focus of the field of bioinformatics is the complete collection of DNA (the genome), RNA (the transcriptome), and protein sequences (the proteome) that have been amassed (Henikoff, 2002). These millions of molecular sequences present both great opportunities and great challenges. A bioinformatics approach to molecular sequence data involves the application of computer algorithms and computer databases to molecular and cellular biology. Such an approach is sometimes referred to as functional genomics. This typifies the essential nature of bioinformatics: biological questions can be approached from levels ranging from single genes and proteins to cellular pathways and networks or even whole genomic responses (Ideker et al., 2001). Our goals are to understand how to study both individual genes and proteins and collections of thousands of genes/proteins. Figure 1.1 The first perspective of the field of bioinformatics is the cell Bioinformatics has emerged as a discipline as biology has become transformed by the emergence of molecular sequence data. Databases such as the European Molecular Biology Laboratory (EMBL), GenBank, and the DNA Database of Japan (DDBJ) serve as repositories for billions of nucleotides of DNA sequence data (see Chapter 2). Corresponding databases of expressed genes (RNA) and protein have been established. A main focus of the field of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems. SU1-3 BME355 STUDY UNIT 1 From the cell we can focus on individual organisms which represent the second perspective of the field of bioinformatics. Each multicellular organism changes during different stages of its development as its body specialises into various organs with specific functions. For example, while we may sometimes think of genes as static entities that specify features such as eye colour or height, they are in fact dynamically regulated across time and region and in response to physiological conditions. Gene expression varies in disease states or in response to a variety of signals, both intrinsic and environmental. Many bioinformatics tools are available to study the broad biological questions relevant to the individual: There are many databases of expressed genes and proteins derived from different tissues and conditions. One of the most powerful applications of functional genomics is the use of DNA microarrays to measure the expression of thousands of genes in biological samples. Two tools of note that have changed the way genomic information could be integrated into medicine are Polymerase Chain Reaction (PCR) and Next Generation Sequencing (NGS). Before we embark on learning more on NGS, the following multimedia will introduce you to basic sequencing techniques and basic requirements for PCR. Find Out More Polymerase Chain Reaction: https://www.youtube.com/watch?v=eEcy9k_KsDI DNA sequencing: https://www.youtube.com/watch?v=bEFLBf5WEtc SU1-4 BME355 STUDY UNIT 1 Time of Body region development physiology, pharmacology, pathology Figure 1.2 The second perspective of bioinformatics is the organism Broadening our view from the level of the cell to the organism, we can consider the individual’s genome (collection of genes), including the genes that are expressed as RNA transcripts and the protein products. Thus, for an individual organism bioinformatics tools can be applied to describe changes through developmental time, changes across body regions, and changes in a variety of physiological or pathological states. At the largest scale is the tree of life (figure below). There are many millions of species alive today, and they can be grouped into the three major branches of bacteria, archaea (single-celled microbes that tend to live in extreme environments), and eukaryotes. Molecular sequence databases currently hold DNA sequence from over 100,000 different organisms. The complete genome sequences of several hundred organisms will soon become available. One of the main lessons we are learning is the fundamental unity of life at the SU1-5 BME355 STUDY UNIT 1 molecular level. We are also coming to appreciate the power of comparative genomics, in which genomes are compared. Figure 1.3 The third perspective of the field of bioinformatics is represented by the tree of life The scope of bioinformatics includes all of life on Earth, including the three major branches of bacteria, archaea, and eukaryotes. Viruses, which exist on the borderline of the definition of life, are not depicted here. For all species, the collection and analysis of molecular sequence data allow us to describe the complete collection of DNA that comprises each organism (the genome). We can further learn the variations that occur between species and among members of a species, and we can deduce the evolutionary history of life on Earth. (After Pace, 1997) Used with permission. SU1-6 BME355 STUDY UNIT 1 A Consistent Example: Retinol-Binding Protein Throughout the textbook we will focus on the example of a gene and its corresponding protein product: retinol-binding protein (RBP4), a small, abundant secreted protein that binds retinol (vitamin A) in blood (Newcomer and Ong, 2000). Retinol, obtained from carrots in the form of vitamin A, is very hydrophobic. RBP4 helps transport this ligand to the eye where it is used for vision. We will study RBP4 in detail because it has a number of interesting features: There are many proteins that are homologous to RBP4 in a variety of species, including human, mouse, and fish (orthologs). We will use these as examples of how to align proteins, perform database searches, and study phylogeny. There are other human proteins that are closely related to RBP4 (paralogs). Altogether the family that includes RBP4 is called the lipocalins, a diverse group of small ligand-binding proteins that tend to be secreted into extracellular spaces (Akerstrom et al., 2000; Flower et al., 2000). Other lipocalins have fascinating functions such as apoliprotein D (which binds cholesterol), a pregnancy-associated lipocalin, aphrodisin (an “aphrodisiac” in hamsters), and an odorantbinding protein in mucus. There are even bacterial lipocalins, which could have a role in antibiotic resistance (Bishop, 2000). We will explore how bacterial lipocalins could be ancient genes that entered eukaryotic genomes by a process called lateral gene transfer. The gene expression levels of some lipocalins are dramatically regulated. Because the lipocalins are small, abundant, and soluble proteins, their biochemical properties have been characterised in detail. The threedimensional protein structure has been solved for several of them by X-ray crystallography. Some lipocalins have been implicated in human disease. Another molecule we will introduce is the pol (polymerase) gene of human immunodeficiency virus 1 (HIV-1). HIV presents one of the greatest public health challenges in the world today. Over 42 million people are infected as SU1-7 BME355 STUDY UNIT 1 of the end of the year 2002 and over 16 million people have died. The HIV-1 genome encodes just nine proteins, including pol (Frankel and Young, 1998). We will examine pol throughout the book because the properties of this gene, its protein products, and the HIV-1 genome are distinct from the lipocalins. The pol gene is a multi-domain protein: it is a single polypeptide with several structurally and functionally distinct domains. The pol gene encodes a protein of 1003 amino acids with reverse transcriptase activity (that is, an RNA-dependent DNA polymerase). It is also an aspartyl protease, and it has integrase activity. These multiple activities are typical of multidomain proteins. Themodular nature of the pol protein affects our ability to perform database searches and multiple sequence alignments. The pol gene incorporates substitutions extremely rapidly. A typical individual infected by HIV may have over a million variants of pol. The study of the evolution of pol complements our study of the lipocalins. As a viral protein, our study of pol gives us the opportunity to learn how to access bioinformatics resources relevant to studying viruses. Database searches with pol will help emphasise how to restrict searches to particular domains of the tree of life. Model organisms: Much of the knowledge we have acquired in biology over the decades has been through experimentation in a set of few model organisms. Shorter lifespans, the ease of manipulation of genetic material in these organisms has allowed evaluations which are unethical in human subjects. By using these biological models we are able to glean information on how the human cell itself is regulated. Model organisms in Genomics (Access video via iStudyGuide) SU1-8 BME355 STUDY UNIT 1 Web Exercises Often, students of bioinformatics have a particular research area of interest such as a gene, a physiological process, a disease, or a genome. It is hoped that by studying RBP4 and other specific proteins and genes throughout this book, students can simultaneously apply the principles of bioinformatics to their own research questions. It has been helpful to complement lectures with computer labs. All the websites described in the textbook are freely available on the World Wide Web, and many of the software packages are free for academic use. Another feature of the course is that each student is required to discover a novel gene by the last day of the course. The student must begin with any protein sequence of interest and perform database searches to identify genomic DNA that encodes a protein no one has described before. This problem is described in Chapter 5. The student thus chooses the name of the gene and its corresponding protein and describes information about the organism and evidence that the gene has not been described before. Then, the student creates a multiple sequence alignment of the new protein (or gene) and creates a phylogenetic tree showing its relation to other known sequences. A benefit of this exercise is that it requires a student to actively use the principles of bioinformatics. Most students choose a gene (or protein) relevant to their own research area, while others find new lipocalins. 1.3 Key Bioinformatics Websites The field of bioinformatics relies heavily on the Internet as a place to access sequence data, to access software that is useful to analyse molecular data, and as a place to integrate different kinds of resources and information relevant to biology. We will describe a variety of websites. Initially, we will focus on the three main publicly accessible databases that serve as repositories for DNA and protein data (Table 1.1). In Chapter 2, we begin with the National Centre for Biotechnology Information (NCBI), which hosts GenBank. The NCBI website offers a variety of other bioinformatics-related tools. We will gradually introduce the European Bioinformatics Institute (EBI) web server, which hosts a complementary DNA database (EMBL, the European Molecular Biology Laboratory database). We will also introduce the DNA Database of Japan (DDBJ). The research teams at GenBank, EMBL, and DDBJ share sequence SU1-9 BME355 STUDY UNIT 1 data on a daily basis. A general theme of the discipline of bioinformatics is that many databases are closely interconnected. Table 1.1 Three Primary Bioinformatics Web Servers That Serve as Centralised Repositories for DNA and Protein Sequence Data Throughout the course we will introduce many websites that are relevant to bioinformatics. Table 1.2 lists several additional servers that offer databases as well as many programs for the analysis of biological sequences. Table 1.2 Additional Bioinformatics Web Servers SU1-10 BME355 STUDY UNIT 1 Table 1.3 lists several additional sites that offer links to bioinformatics resources. We present them now for those who wish to explore the types of bioinformatics resources that are currently available. Table 1.3 Bioinformatics Sites with Useful Links Overviews of the field of bioinformatics have been written by Mark Gerstein and colleagues (Luscombe et al., 2001) and Claverie et al. (2001). Kaminski (2000) also introduces bioinformatics, with practical suggestions of websites to visit. Russ Altman (1998) discusses the relevance of bioinformatics to medicine, while David Searls (2000) introduces bioinformatics tools for the study of genomes. Read the following material and discuss the answers with your lecturer during class: http://en.wikipedia.org/wiki/Translation_(biology) a. Which enzyme attaches amino acids to their corresponding tRNA? b. Can you name the triplet codons coding for a stop codon? What does the ribosome do when it reaches a stop codon? c. Where does translation occur in the cell? SU1-11 BME355 STUDY UNIT 1 Find Out More Self-learn sessions (to be completed before the first lecture): DNA and RNA: History and Structure https://www.youtube.com/watch?v=qoERVSWKmGk&index=404&list=UUE ik-U3T6u6JA0XiHLbNbOw Central Dogma, Replication, Transcription and Translation https://www.youtube.com/watch?v=W4mYwsr9gGE&list=UUEikU3T6u6JA0XiHLbNbOw&index=403 Cell cycle: Mitosis and Meiosis https://www.youtube.com/watch?v=2aVnN4RePyI&list=UUEikU3T6u6JA0XiHLbNbOw Polymerase Chain Reaction: https://www.youtube.com/watch?v=eEcy9k_KsDI DNA sequencing: https://www.youtube.com/watch?v=bEFLBf5WEtc SU1-12 BME355 STUDY UNIT 1 Summary The followings key points are discussed in this unit: Cellular components and how they are regulated Introduction to PCR and DNA sequencing The role of model organisms and their use in genomics research A brief summary of the key bioinformatics websites and datasets SU1-13 BME355 STUDY UNIT 1 References Akerstrom, B., Flower, D. R., and Salier, J. P. Lipocalins: Unity in diversity. Biochim. Biophys. Acta 1482, 1–8 (2000). Altman, R. B. Bioinformatics in support of molecular medicine. Proc. AMIA Symp., 53–61 (1998). Bishop, R. E. The bacterial lipocalins. Biochim. Biophys. Acta 1482, 73–83 (2000). Boguski, M. S. Bioinformatics. Curr. Opin. Genet. Dev. 4, 383–388 (1994). Claverie, J. M., Abergel, C., Audic, S., and Ogata, H. Recent advances in computational genomics. Pharmacogenomics 2, 361–372 (2001). Flower, D. R., North, A. C., and Sansom, C. E. The lipocalin protein family: Structural and sequence overview. Biochim. Biophys. Acta 1482, 9–24 (2000). Frankel, A. D., and Young, J. A. HIV-1: Fifteen proteins and an RNA. Annu. Rev. Biochem. 67, 1–25 (1998). Goodman, N. Biological data becomes computer literate: New advances in bioinformatics. Curr. Opin. Biotechnol. 13, 68–71 (2002). Henikoff, S. Beyond the central dogma. Bioinformatics 18, 223–225 (2002). Ideker, T., Galitski, T., and Hood, L. A new approach to decoding life: Systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 (2001). Kaminski, N. Bioinformatics. A user’s perspective. Am J. Respir. Cell Mol. Biol. 23, 705–711 (2000). Luscombe, N. M., Greenbaum, D., and Gerstein, M. What is bioinformatics? A proposed definition and overview of the field. Methods Inf. Med. 40, 346–358 (2001). Newcomer, M. E., and Ong, D. E. Plasma retinol binding protein: Structure and function of the prototypic lipocalin. Biochim. Biophys. Acta 1482, 57– 64 (2000). SU1-14 BME355 STUDY UNIT 1 Pace, N. R. A molecular view of microbial diversity and the biosphere. Science 276, 734–740 (1997). Searls, D. B. Bioinformatics tools for whole genomes. Annu. Rev. Genomics Hum. Genet. 1, 251–279 (2000). SU1-15 STUDY UNIT 2 MOLECULAR GENETICS AND DATABASE BME355 STUDY UNIT 2 Learning Outcomes Upon completion of this unit, you will be able to: 1. Indicate knowledge of the Central Dogma and its applicability post Human Genome project 2. Differentiate between gene regulation in Prokaryotic and Eukaryotic cells 3. Differentiate between coding and non-coding regulatory elements in the cell 4. Discuss in vitro and in vivo analyses and the requirements for such experiments in cell biology 5. Use genetic database tools to locate DNA/ Protein sequences 6. Illustrate ability to work with GenBank sequences 7. Differentiate between various types of mutations and database resources for studying them 8. Illustrate the role of mutations in disease diagnosis and prognosis 9. Practise literature searches using Medline using different search strategies 10. Analyse data from disease related databases to glean information on mutations involved, taxonomy, structure and population related information 11. Illustrate knowledge of the most current Next Generation Sequence analysis tools and instrumentation You can refer to Chapter 2 of the textbook Overview This unit introduces the main concepts of cellular biology and molecular biology, as well as biological databases for computational biology. It discusses how these databases are organized to store data and what strategies are used to extract information from them. Three publicly accessible databases store large amounts of nucleotide and protein sequence data: GenBank at the National Centre for Biotechnology Information (NCBI), DNA Database of Japan (DDBJ), and the European Bioinformatics Institute (EBI). Five ways to access DNA and protein sequences are studied, demonstrated by examples. SU2-1 BME355 STUDY UNIT 2 Chapter 2 Molecular Genetics and Databases 2.1 Cellular Biology and Molecular Genetics All living organisms are characterised by the capacity to reproduce and evolve. The genome of an organism is defined as the collection of DNA within that organism, including the set of genes that encode proteins. Prokaryotic & Eukaryotic Cells Cells are the structural and functional unit of all living organisms. Some organisms, such as bacteria, are unicellular, and they consist of a single cell. Other organisms, such as humans, are multicellular. Basically, there are two general categories of cells: prokaryotes and eukaryotes. The simplest cells were prokaryotic cells, organisms that lack nuclear membrane. Bacteria are the most studied form of prokaryotic organisms. Prokaryotes are unicellular organisms that do not develop or differentiate into multicellular forms. Prokaryotic cells have three architectural regions: appendages called flagella and pili (proteins attached to the cell surface); a cell envelope consisting of a capsule, a cell wall and a plasma membrane; and a cytoplasmic region that contains the cell genome and ribosomes and various sorts of inclusions. Eukaryotes include fungi, animals, and plants as well as some unicellular organisms. The major and extremely significant difference between prokaryotes and eukaryotes is that eukaryotic cells contain membranebounded compartments in which specific metabolic activities take place. Most important among these is the presence of a nucleus, a membrane-delineated compartment that houses the eukaryotic cell’s deoxyribonucleic acid (DNA). Eukaryotic organisms also have other specialised structures, called organelles, which are small structures within cells that perform dedicated functions. SU2-2 BME355 STUDY UNIT 2 Figure 2.1 Organisation of the cell and its organelles (Picture courtesy https://en.wikipedia.org/wiki/File:Endomembrane_system_diagram_en.svg) Eukaryotic and prokaryotic cells differ in the organisation of the nucleus. A well- defined envelope encompassing the DNA and chromosomes forms the nucleus of a eukaryotic cell. The nucleus for a prokaryotic cell is not so demarcated. Key processes such as DNA replication and transcription are carried out here. Others such as translation take place in the cytoplasm. Specialised organelles such as Golgi, mitochondria are distinct and serve other unique functions in the cell. For most unicellular organisms, reproduction is a simple matter of cell duplication, also known as replication. But for multicellular organisms, cell replication and reproduction are two separate processes. Multicellular organisms replace damaged or worn out cells through a replication process called mitosis - the division of a eukaryotic cell nucleus to produce two identical daughter nuclei. Every time a cell divides, it must ensure that its DNA is shared between the two daughter cells. Mitosis is the process of “divvying up” the genome between the daughter cells. SU2-3 BME355 STUDY UNIT 2 Find Out More DNA and RNA: History and Structure https://www.youtube.com/watch?v=qoERVSWKmGk&index=404&list=UUE ik-U3T6u6JA0XiHLbNbOw Central Dogma, Replication, Transcription and Translation https://www.youtube.com/watch?v=W4mYwsr9gGE&list=UUEikU3T6u6JA0XiHLbNbOw&index=403 Cell cycle: Mitosis and Meiosis https://www.youtube.com/watch?v=2aVnN4RePyI&list=UUEikU3T6u6JA0XiHLbNbOw Meiosis is a specialised type of cell division that occurs during the formation of gametes. Although meiosis may seem much more complicated than mitosis, it is really just two cell divisions in sequence. Each of these sequences maintains strong similarities to mitosis. To reproduce, eukaryotes must first create special cells called gametes (eggs and sperms) that then fuse to form the beginning of a new organism. Gametes are but one of the many unique cell types that multicellular organisms require in order to function as a complete organism. The gametes are created through the process of meiosis. Meiosis serves to reduce the chromosome number for that particular organism by half. The sperm and egg join to make a single cell, which restores the chromosome number. This joined cell then divides and differentiates into different cell types that eventually form an entire functioning organism. SU2-4 BME355 STUDY UNIT 2 Figure 2.2 Meiosis and Mitosis pathways in the cell (Picture courtesy http://www.accessexcellence.org/AB/GG/comparison.html) Mitosis results in a cell with two sets of chromosomes (diploid). These daughter cells are exact copies of their parent. Meiosis, on the other hand takes place in specialised gamete or reproductive cells and results in haploid cells with just one copy of the chromosomes. Recombination during the formation of gamete cells enables a shuffling of the genetic information in the daughter cells. These gametes (ova or sperm) give rise to the diploid embryo after fertilisation. All organisms suffer a certain number of small mutations, or random changes in a DNA sequence, during the process of DNA replication. These are called spontaneous mutations and occur at a rate characteristic for that organism. Genetic recombination refers more to a large-scale rearrangement of a DNA molecule. This process involves pairing between complementary strands of two parental duplex, or double-stranded DNAs, and results from a physical exchange of chromosome material. The position at which a gene is located on a chromosome is called a locus. In a given individual, one might find two different versions of this gene at a particular locus. These alternate gene forms are called alleles. Recombination results in a new arrangement of maternal and paternal alleles on the same SU2-5 BME355 STUDY UNIT 2 chromosome. Although the same genes appear in the same order, the alleles are different. This process explains why offspring from the same parents can look so different. All the different cell types in our body are all derived from a single, fertilised egg cell through differentiation. Differentiation is the process by which an unspecialised cell becomes specialised into one of the many cells that make up the body, such as a heart, liver or muscle cell. During differentiation, certain genes are turned on, or become activated, while other genes are switched off, or inactivated. This process is intricately regulated. As a result, a differentiated cell will develop specific structures and perform certain functions. The Central Dogma The most fundamental property of all living things is their ability to reproduce. All cells arise from pre-existing cells. That is, their genetic material must be replicated and passed from parent cell to progeny. Likewise, all multicellular organisms inherit their genetic information specifying structure and function from their parents. Every organism, including humans, has a genome that contains all the biological information needed to build and maintain a living example of that organism. The biological information contained in a genome is encoded in its DNA and divided into discrete units called genes. Genes code for proteins that attach to the genome at the appropriate positions and switch on a series of reactions called gene expression. SU2-6 BME355 STUDY UNIT 2 Figure 2.3 The central dogma and flow of genetic information in the cell Genetic information in the cell flows from the DNA, RNA and then to proteins. Transcription is the process by which the genetic information is decoded into functional mRNA. Translation, which occurs in the cytoplasm, uses this message to make proteins. Proteins are the workhorses of the cell and carry out much of the structural, functional and enzymatic roles in the cell. The Central Dogma, a fundamental principle of molecular biology, states that genetic information flows from DNA to RNA to protein. The genetic code resides in DNA is passed from generation to generation. In the process of making a protein, the encoded information must be faithfully transmitted first to RNA then to protein. The process of duplicating a cell’s genome, or DNA replication, is required every time a cell divides. Replication, like all cellular activities, requires specialised proteins for carrying out the job. We see evidence of a process called genetic variation among the same species, such as different hair and eye colour, skin pigment, height and blood type. These expressed, or phenotypic, traits are due to genotypic variation in a person's DNA sequence. When two individuals display different phenotypes SU2-7 BME355 STUDY UNIT 2 of the same trait, they are said to have two different “alleles” for the same gene. This means that the gene's sequence is slightly different in the two individuals and the gene is said to be polymorphic. These polymorphic sites influence gene expression, and also serve as markers for genomic research efforts. Most genetic variation occurs during the phases of the cell cycle when DNA is duplicated. Mutations in the new DNA strand can manifest as base substitutions, such as when a single base gets replaced with another; deletions, where one or more bases are left out; or insertions, where one or more bases are added. Mutations in Genomics (Access video via iStudyGuide) While mutations can cause improper cell development, they also provide a species with the opportunity to adapt to new environments, as well as to protect a species from new pathogens. Mutations are what lie behind the popular saying of “survival of the fittest”, the basic theory of evolution proposed by Charles Darwin in 1859. Inside each of our cells lies a nucleus - a membrane bounded region that provides a sanctuary for genetic information. The nucleus contains long strands of DNA that encode this genetic information. A DNA chain is made up of four chemical bases: adenine (A) and guanine (G), which are called purines, and cytosine (C) and thymine (T), referred to as pyrimidines. Each base has a slightly different composition, or combination of oxygen, carbon, nitrogen and hydrogen. In a DNA chain, every base is attached to a sugar molecule (deoxyribose) and a phosphate molecule, resulting in a nucleic acid or nucleotide. Individual nucleotides are linked through the phosphate group and it is the precise order, or sequence, of nucleotides that determines the product made from that gene. The DNA that constitutes a gene is a double-stranded molecule consisting of two chains running in opposite directions. The chemical nature of the bases in double-stranded DNA creates a slight twisting force that gives DNA its characteristic gently coiled structure, known as the double helix. The two strands are connected to each other by chemical pairing of each base on one strand to a specific partner on the other strand. The base Adenine (A) pairs with thymine (T), while guanine (G) pairs with cytosine (C). Thus, A-T and SU2-8 BME355 STUDY UNIT 2 G-C base pairs are said to be complementary. This complementary base pairing is what makes DNA a suitable molecule for carrying our genetic information - one strand of DNA can act as a template to direct the synthesis of a complementary strand. In this way, the information in a DNA sequence is readily copied and passed on to the next generation of cells. In the first step of replication, a special protein, called a helicase, unwinds a portion of the parental DNA double helix. Next, a molecule of DNA polymerase binds to one strand of the DNA. DNA polymerase begins to move along the DNA strand in the 3' to 5' direction, using the single-stranded DNA as a template. This newly synthesised strand is called the leading strand and is necessary for forming new nucleotides and reforming a double helix. Because DNA synthesis can only occur in the 5' to 3' direction, a second DNA polymerase molecule is used to bind to the other template strand as the double helix opens. This molecule synthesises discontinuous segments of polynucleotides, called Okazaki fragments. Another enzyme, called DNA ligase, is responsible for stitching these fragments together into what is called the lagging strand. Figure 2.4 DNA replication of the leading and lagging strand (Picture courtesy http://www.ultranet.com/~jkimball/BiologyPages/D/DNAReplication.html) DNA replication enables duplication of the two strands of the DNA. During this process the leading strand is manufactured in a contiguous fashion, while the lagging strand is synthesised in parts called Okazaki fragments. The Okazaki fragments are stitched together using the enzyme DNA ligase. DNA polymerase used for synthesis of both strands is identical. Other proteins SU2-9 BME355 STUDY UNIT 2 such as helicases and topoisomerases are needed to unwind and re-wind the DNA before and after replication. There are many replication origins sites on a eukaryotic chromosome. Therefore, replication can begin at some origins earlier than at others. As replication nears completion, “bubbles” of newly replicated DNA meet and fuse, forming two new molecules. Just like DNA, ribonucleic acid (RNA) is a chain, or polymer, on nucleotides with the same 5' to 3' direction of its strands. However, the ribose sugar component of RNA is slightly different chemically than that of DNA. Ribonucleic acid has a 2' oxygen atom that is not present in deoxyribonucleic acid. Other fundamental structural differences exist. For example, uracil takes the place of the thymine nucleotide found in DNA and RNA is, for the most part, a single-stranded molecule. DNA directs the synthesis of a variety of RNA molecules, each with a unique role in cellular function. For example, all genes that code for proteins are first made into an RNA strand in the nucleus called a messenger RNA (mRNA). The mRNA carries the information encoded in DNA out of the nucleus to the protein assembly machinery, called the ribosome, in the cytoplasm. The ribosome complex uses mRNA as a template to synthesise the exact protein coded by the gene. DNA transcription refers to the synthesis of RNA from a DNA template. This process is very similar to DNA replication. There are different proteins that are responsible for transcription. The most important enzyme is RNA polymerase. It is an enzyme that influences the synthesis of RNA from a DNA template. In order for transcription to be initiated, RNA polymerase must be able to recognise the beginning sequence of a gene so that it knows where to start synthesising an mRNA. It is directed to this initiation site by the ability of one of its subunits to recognise a specific DNA sequence found at the beginning of a gene called the promoter sequence. The promoter sequence is a unidirectional sequence found on one strand of the DNA that instructs the RNA polymerase in both where to start synthesis and in which direction synthesis should continue. The RNA polymerase then unwinds the double helix at that point and begins synthesis of an RNA strand complementary to one of the strands of DNA. This strand is called the antisense or template strand, while the other strand is referred to as the sense or coding strand. Synthesis can then proceed in a unidirectional. Genes make up about one percent of the total DNA in our genome. In the human genome, the coding portions of a gene, called exons, are interrupted SU2-10 BME355 STUDY UNIT 2 by intervening sequences, called introns. In addition, a eukaryotic gene does not code for a protein in one continuous stretch of DNA. Both exons and introns are transcribed into mRNA, but before it is transported to the ribosome, the primary mRNA transcript is edited. This editing process removes the introns, joins the exons together, and adds unique features to each end of the transcript to make a mature mRNA. Figure 2.5 Schema of splicing of hnRNA transcript and translation of proteins (Picture courtesy https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/0326_Splicing.jpg/220px0326_Splicing.jpg) The transcription machinery in the nucleus gives rise to a RNA transcript called the heterogeneous nuclear RNA (hnRNA). This is then spliced to remove the portions that do not encode for the protein (called introns). Only the exon portion (containing information to form the final protein) is joined to form the messenger RNA (mRNA). The messenger RNA serves as a template for the translational machinery to generate proteins. While DNA is the carrier of genetic information in a cell, proteins do the bulk of the work. Proteins are long chains containing as many as 20 different kinds of amino acids. The genetic code carried by DNA is what specifies the order SU2-11 BME355 STUDY UNIT 2 and number of amino acids and, therefore, the shape and function of the protein. A given amino acid can have more than one codon. These redundant codons usually differ at the third position. For example, the amino acid serine is encoded by UCU, UCC, UCA, and/or UCG. This redundancy is the key to accommodating mutations that occur naturally as DNA is replicated and new cells are produced. By allowing some of the random changes in DNA to have no effect on the ultimate protein sequence, a sort of genetic safety net is created. Some codons do not code for an amino acid at all, but instruct the ribosome when to stop adding new amino acids. Figure 2.6 Standard Genetic Code Information from the mRNA codes in triplets (3 bases) each to represent an amino acid. The codons are degenerate but not ambiguous. Thus, while an amino acid can be coded by more than one triplet codon, each codon is specific for one amino acid only. Exceptions to the standard genetic code are seen in mitochondrion and chloroplast within the cell. The Genetic Code (Access video via iStudyGuide) Proteins are polymers of amino acids. Three bases of the DNA nucleotides codes for one particular amino acid – this is called a triplet codon. Note that SU2-12 BME355 STUDY UNIT 2 the triplet codon is read in a non-overlapping manner. Also there is no thymine base in RNA – in its place is a base known as uracil (U). Proteins fold into complex 3D structures and can function in metabolism and structure of the cell. DNA sequence : 5’- ATG CCC TGC TTG GCC …- 3’ RNA sequence : 5’- AUG CCC UGC UUG GCC …-3’ Protein sequence : M P C L A … Figure 2.7 Schema of Protein structure: Primary to quaternary structures Protein sequence is a linear polymer of amino acids and constitutes the primary structure of a protein. The sequence along with the cytoplasmic conditions dictates folding of the protein into secondary structures, such as alpha-helices, beta-sheets, etc. Further folds of the molecule give rise to more intricate structures constituting the tertiary structure. Multiple peptides can be linked together to form the quaternary structure. For example, hemoglobin is made up of 2 alpha and 2 beta subunits. The cellular machinery responsible for synthesising proteins is the ribosome. The ribosome consists of structural RNA and about 80 different proteins. In its inactive state, it exists as two subunits: a large subunit and a small subunit. When the small subunit encounters an mRNA, the process of translating an mRNA to a protein begins. In the large subunit, there are two sites for amino acids to bind, and thus be close enough to each other to form a bond. The “A site” accepts a new transfer RNA, or tRNA, which bears an amino acid and is SU2-13 BME355 STUDY UNIT 2 the adaptor molecule acting as a translator between mRNA and protein. The “P site” binds the tRNA that becomes attached to the growing chain. The tRNA is a specific RNA molecule. Each tRNA has a specific acceptor site that binds a particular triplet of nucleotides, codon, and an anticodon site that binds a sequence of three unpaired nucleotides, the anticodon, which can then bind to the codon. Each tRNA also has a specific charger protein; called an aminoacyl tRNA synthetase. This protein can only bind to that particular tRNA and attach the correct amino acid to the acceptor site. Each codon specifies a particular amino acid. In this way, the ribosomal complex builds a protein one amino acid at a time, with the order of amino acids determined precisely by the order of the codons in the mRNA. A protein will often undergo further modification, called post-translational modification. For example, it might be cleaved by a protein-cutting enzyme, called a protease, at a specific place or have a few of its amino acids altered. The modified sequence of amino acids affects the structure and function of the protein. Each cell contains thousands of different proteins: structural components that give cells their shape and help them move; enzymes that make new molecules and catalyse nearly all chemical processes in cells; hormones that transmit signals throughout the body; antibodies that recognise foreign molecules; and transport molecules that carry oxygen. Since the completion of the Human Genome Project (HGP) in 2001, we have discovered that majority of the genome does not code for protein-expressing genes. Only about 2% of the genomic real-estate is represented by such sequences. Non-coding DNA represents a larger bulk of the genome. We are now trying to solve this paradox of how non-protein forming entities regulate the cell and how they may be important to define a species. Coding vs. Non-coding paradox (Access video via iStudyGuide) SU2-14 BME355 STUDY UNIT 2 Impact of Molecular Genetics Molecular genetics is the study of the agents that pass information blueprint that directs all cellular activities and specifies the developmental plan from generation to generation. These molecules, our genes, are long polymers of DNA. Just four chemical building blocks (guanine (G), adenine (A), thymine (T) and cytosine (C)) are placed in a unique order to code for all the genes in all living organisms. Genes determine hereditary traits, such as the colour of our hair or our eyes. They do this by providing instructions for how every activity in every cell of our body should be carried out. Many diseases are caused by mutations. When the information coded by a gene changes, the resulting protein may not function properly or may not even be made at all. In either case, the cells containing that genetic change may no longer perform as expected. Most sequencing and analysis technologies were developed from studies of nonhuman genomes, notably those of the bacterium Escherichia coli, the yeast Saccharomyces cerevisiae, the fruit fly Drosophila melanogaster, the roundworm Caenorhabditis elegans, and the laboratory mouse Mus musculus. These simpler systems provide excellent models for developing and testing the procedures needed for studying the much more complex human genome. A large amount of genetic information has already been derived from these organisms, providing valuable data for the analysis of normal human gene regulation, genetic diseases, and evolutionary processes. For example, researchers have already identified single genes associated with a number of diseases, such as cystic fibrosis. As research progresses, investigators will also uncover the mechanisms for diseases caused by several genes or by single genes interacting with environmental factors. Genetic susceptibilities have been implicated in many major disabling and fatal diseases including heart disease, stroke, diabetes, and several kinds of cancer. The identification of these genes and their proteins will pave the way to more effective therapies and preventive measures. Investigators determining the underlying biology of genome organisation and gene regulation will also begin to understand how humans develop, why this process sometimes goes awry, and what changes take place as people age. SU2-15 BME355 STUDY UNIT 2 Genomic Databases (Access video via iStudyGuide) There are three major public DNA databases: GenBank at the National Centre for Biotechnology Information (NCBI) of the National Institutes of Health (NIH) in Bethesha, USA DNA Database of Japan (DDBJ) European Bioinformatics Institute (EBI) The underlying raw DNA sequences are identical. In addition, there are other categories of bioinformatics datasets that contain DNA and/or protein sequence data. 2.2 GenBank: Database of Most Known Nucleotide and Protein Sequences Building of GenBank GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. It is built primarily from the submission of sequence data from authors and from the bulk submission of expressed sequence tag (EST), genome survey sequence (GSS) and other highthroughput data from sequencing centres. Over 100,000 species are represented in GenBank, with over 1000 new species added per month. all species viruses bacteria archaea eukaryota 128,941 6,137 31,262 2,100 87,147 SU2-16 BME355 STUDY UNIT 2 The most sequenced organisms in GenBank are listed as follows: Homo sapiens (6.9 million entries) Mus musculus (5.0 million) Zea mays (896,000) Rattus norvegicus (819,000) Gallus gallus (567,000) Arabidopsis thaliana (519,000) Danio rerio (492,000) Drosophila melanogaster (350,000) Oryza sativa (221,000) A new release of GenBank is made every two months. The growth of the database in the past ten years is presented in the following table: Year Base Pairs Sequences 1994 217,102,462 215,273 1995 384,939,485 555,694 1996 651,972,984 1,021,211 1997 1,160,300,687 1,765,847 1998 2,008,761,784 2,837,897 1999 3,841,163,011 4,864,570 2000 11,101,066,288 10,106,023 2001 15,849,921,438 14,976,310 SU2-17 BME355 STUDY UNIT 2 2002 28,507,990,166 22,318,883 2003 36,553,368,485 30,968,418 Convenient and quick submission of sequence data to GenBank can be done through a WWW form, called BankIt or by using a stand-alone submission software SequIn. The number of bases grows at an exponential rate. At the moment of this writing, there are over 38,989,342,565 bases. Access to GenBank is available via several methods. Each GenBank record, consisting of both a sequence and its annotations, is assigned a stable and unique identifier, the accession number, which remains constant over the lifetime of the record even when there is a change to the sequence or annotation. The DNA sequence within a GenBank record is also assigned a unique identifier, called a ‘GI’, that appears on the VERSION line of GenBank flat file records following the accession number. A third identifier of the form ‘Accession.version’, also displayed on the VERSION line of flat file records, consolidates the information present in the GI and accession numbers. An entry appearing in the database for the first time has an ‘Accession.version’ identifier equivalent to the ACCESSION number of the GenBank record followed by ‘.1’ to indicate the first version of the sequence for the record, e.g., ACCESSION AF000001 VERSION AF000001.1 GI: 987654321. When a change is made to a sequence given in a GenBank record, a new GI number is issued to the sequence and the version extension of the ‘Accession.version’ identifier is incremented. The accession number for the record as a whole remains unchanged and the older sequence remains available under the old ‘Accession.version’ identifier and GI. GenBank Records and Divisions Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, bibliographic references and a table of features listing areas of biological significance, such as coding regions and their protein translations, transcription units, repeat regions and sites of mutation or modification. The files in the GenBank distribution have traditionally been divided into ‘divisions’ that roughly correspond to taxonomic groups such as bacteria (BCT), viruses (VRL), primates (PRI) and rodents (ROD). SU2-18 BME355 STUDY UNIT 2 In recent years, divisions have been added to support specific sequencing strategies. These include divisions for EST, GSS, high-throughput genomic (HTG) and high-throughput cDNA (HTC) sequences, making a total of 17 divisions. For convenience in file transfer, the larger divisions, such as the EST and PRI, are partitioned into multiple files when posting the bimonthly GenBank releases. One point to note before we discuss the EST database is the concept of naturally occurring molecules and those created in the laboratory (in vitro). This is crucial as it bears on the ethical use of some of the data and the concept of ownership of sequence data. In vitro vs. in vivo (Access video via iStudyGuide) EST Database Expressed Sequence Tags (ESTs) are small pieces of DNA sequence (200 to 500 nucleotides) generated by sequencing an expressed gene. They are sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and scientists use these tags to fish a gene out of a portion of chromosomal DNA by matching base pairs. Due to their utility, speed with which they may be generated, and the low cost associated with this technology, many individual scientists as well as large genome sequencing centres have been generating hundreds of thousands of ESTs for public use. Once an EST was generated, scientists would submit their tags to GenBank. ESTs continue to be the major source of new sequence records and gene sequences. Over the past year the number of ESTs has increased by over 45% to a total of 18.1 million sequences representing over 580 different organisms. The top five organisms represented in the EST division are H.sapiens (5.4 million records), M.musculus (3.8 million records), R.norvegicus (540 000 records), Triticum aestivum (500 000 records) and Ciona intestinalis (490 000 records). With the rapid submission of so many ESTs, it became difficult to identify a sequence that had already been deposited in the database. It was becoming SU2-19 BME355 STUDY UNIT 2 increasingly apparent that if ESTs were to be easily accessed and useful as gene discovery tools, they needed to be organised in a searchable database that also provided access to other genome data. Therefore, in 1992, a new database was designed to serve as a collection point for ESTs. Once an EST that was submitted to GenBank had been screened and annotated, it was then deposited in this new database, called dbEST. Using dbEST, a scientist can access not only data on human ESTs, but information on ESTs from over 300 other organisms as well. Whenever possible, NCBI scientists annotate the EST record with any known information. For example, if an EST matches a DNA sequence that codes for a known gene with a known function, that gene’s name and function is placed on the EST record. Annotating EST records allows public scientists to use dbEST as an avenue for gene discovery. By employing a database search tool, such as BLAST, any interested party can conduct sequence similarity searches against dbEST. Non-Redundant Set of Gene-Oriented Clusters Because a gene can be expressed as mRNA many times, ESTs ultimately derived from this mRNA may be redundant. That is, there may be many identical, or similar, copies of the same EST. Such redundancy and overlap means that when someone searches dbEST for a particular EST, he or she may retrieve a long list of tags, many of which may represent the same gene. Searching through all these identical ESTs can be very time consuming. To resolve the redundancy and overlap problem, NCBI investigators developed the UniGene database. As part of its daily processing of GenBank EST data, the NCBI identifies through BLAST searches all homologies for new EST sequences and incorporates that information into the companion database, dbEST. The data in dbEST is further processed to produce the UniGene database. UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. Consequently, the collection may be of use to the community as a resource for gene discovery. UniGene has also been used by experimentalists to select reagents for gene mapping projects and large-scale expression analysis. SU2-20 BME355 STUDY UNIT 2 However, it should be noted that the procedures for automated sequence clustering are still under development and the results may change from time to time as improvements are made. Currently, sequences from the animals including human, rat, mouse, cow, zebrafish, clawed frog, fruitfly and mosquito have been processed. Plant organisms are wheat, rice, barley, maize and cress. These species were chosen because they have the greatest amounts of EST data available and represent a variety of species. Additional organisms may be added in the future. Sequence-Tagged Sites Database The STS division of GenBank, dbSTS, contains over 240 000 sequences including anonymous STSs based on genomic sequence as well as gene-based STSs derived from the 3′ ends of genes and ESTs. These STS records usually include primer sequences, annotations and PCR conditions. GSS Database The GSS division of GenBank, dbGSS, is similar to the EST division, with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). It should be noted that two classes (exon trapped products and gene trapped products) may be derived via a cDNA intermediate. Care should be taken when analysing sequences from either of these classes, as a splicing event could have occurred and the sequence represented in the record may be interrupted when compared to genomic sequence. The GSS division contains (but is not limited to) the following types of data: random “single pass read” genome survey sequences cosmid/BAC/YAC end sequences exon trapped genomic sequences Alu PCR sequences transposon-tagged sequences The GSS division of GenBank has grown over the past year by 73% to a total of 6.4 million records with over 2.0 billion nucleotides. GSS records are predominantly single reads from bacterial artificial chromosomes (‘BAC-ends’) used in a variety of genome sequencing projects. The most highly represented species in the GSS division are Z. mays (1.3 million records), M. musculus (952 000 records), H. sapiens (893 000 records) and Brassica oleracea (595 000 records). SU2-21 BME355 STUDY UNIT 2 Although dbGSS sequences are incorporated into the GSS Division of GenBank, annotation in dbGSS is more comprehensive and includes detailed information about the contributors, experimental conditions, and genetic map locations. Single Nucleic Polymorphism Database Because SNPs occur frequently throughout the genome and tend to be relatively stable genetically, they serve as excellent biological markers. Biological markers are segments of DNA with an identifiable physical location that can be easily tracked and used for constructing a chromosome map that shows the position of known genes, or other markers, relative to each other. These maps allow researchers to study and pinpoint traits resulting from the interaction of more than one gene. To facilitate the identification and cataloguing of SNPs, the SNP database, dbSNP, has been created. It is intended to stimulate many areas of biological research, including the identification of the genetic components of disease. dbSNP links directly to a number of software tools designed to aid in SNP analysis. Records in dbSNP are cross-annotated within other internal information resources such as PubMed, genome project sequences and the dbSTS database. 2.3 NCBI databases and Tools The NCBI (http://www.ncbi.nlm.nih.gov) creates public databases, conducts research in computational biology, develops software tools for analysing genome data, and disseminates biomedical information. Figure 2.8 Screen shot of the NCBI database header at www.ncbi.nlm.nih.gov NCBI serves as a central repository to several genomic resources including the Medline database, Entrez- the retrieval tool, BLAST- the sequence search and alignment tool, OMIM- the repository for diseases and several others. Today, it serves as an indispensable resource for genomic research. SU2-22 BME355 STUDY UNIT 2 PubMed is the National Library of Medicine’s search service that provides access to • • • 11 million citations in MEDLINE, links to participating online journals, and PubMed tutorial (via “Education” on side bar), Entrez is a search and retrieval system that integrates • • • • • the scientific literature, DNA and protein sequence databases, 3D protein structure data, population study data sets, and assemblies of complete genomes, Figure 2.9 Interrelatedness between the various genomic databases Genomic databases are inter-related units with several unique databases linked to each other for ease of flow of information and data analysis. Thus, one can easily navigate from one database to another to download multiple types of data. BLAST (Basic Local Alignment Search Tool) is NCBI’s sequence similarity search tool designed to • support analysis of DNA and protein databases • 80,000 searches per day OMIM (Online Mendelian Inheritance in Man) is • • a catalogue of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU SU2-23 BME355 STUDY UNIT 2 Books is a searchable resource of on-line books. TaxBrowser is a browser for the major divisions of living organisms (archaea, bacteria, , viruses). The site features • • taxonomy information such as genetic codes molecular data on extinct organisms Structure site maintains the Molecular Modelling Database (MMDB), a database of macromolecular three-dimensional structures, as well as tools for their visualisation and comparative analysis. It includes • • • biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) How to find information about a particular gene or protein There are five ways to access protein and DNA sequences: LocusLink with RefSeq UniGene Entrez EMBL ExPASy Sequence Retrieval System (this is separate from NCBI) SU2-24 BME355 STUDY UNIT 2 Figure 2.10 Screen shot of NCBI site with information on locating a particular gene or protein Unique links on the right side of the NCBI home page allow for easy retrieval of information directly from several resources including gene, protein data and others. LocusLink with RefSeq LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms. RefSeq provides a curated, optimal accession number for each DNA or protein. SU2-25 BME355 STUDY UNIT 2 An accession number is a label that is used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. The following are some examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 N91759.1 NM_006744 NP_007635 AAC02945 Q28369 1KT7 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record Figure 2.11 Screen capture of LocusLink site at NCBI SU2-26 BME355 STUDY UNIT 2 UniGene UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt to form one gene per cluster. Use UniGene to study where your gene is expressed in the body, when it is expressed, and see its abundance. Entrez Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. Figure 2.12 Screen Capture of the Entrez web site at NCBI Entrez is linked to multiple databases including those for nucleotides, books, Domains, literature searches, taxonomy and others. SU2-27 BME355 STUDY UNIT 2 The Nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases. EMBL is the European Molecular Biology Laboratory (EMBL) at Hinxton Hall, UK; DDBJ is the DNA Database of Japan (DDBJ) in Mishima, Japan. Sequence data is also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM. Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (USPTO), and via the collaborating international databases from other international patent offices. The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to Protein Information Resource (PIR), SWISSPROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB) (sequences from solved structures). The Genome database provides views for a variety of genomes, complete chromosomes, contig sequence maps, and integrated genetic and physical maps. One of the challenges facing genomics today is the ease of sequencing without a concomitant development of data analysis tools. With the advent of new sequencing techniques such as the Next-generation sequencing (NGS), the amount of data available for several genomes is exploding. Next Generation Sequencing (Access video via iStudyGuide) The Structure database or Molecular Modelling DataBase (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. Use the NCBI 3D structure viewer, Cn3D, for easy interactive visualisation of molecular structures from Entrez. The PopSet database contains aligned sequences submitted as a set resulting from a population, a phylogenetic, or mutation study. These alignments SU2-28 BME355 STUDY UNIT 2 describe such events as evolution and population variation. The PopSet database contains both nucleotide and protein sequence data. The OMIM (Online Mendelian Inheritance in Man) database is a catalogue of human genes and genetic disorders. The Taxonomy database contains the names of all organisms that are represented in the NCBI genetic database by at least one nucleotide or protein sequence. The Bookshelf has a collection of Biomedical books that are linked in Entrez and can also be separately searched at Bookshelf. ProbeSet database is an Entrez view of NCBI’s GEO (Gene Expression Omnibus). GEO is a gene expression and hybridisation array repository. 3D Domains contains protein domains from the NCBI Conserved Domain Database. A unified, non-redundant view of sequence tagged sites (STSs), UniSTS integrates marker and mapping data from a variety of public resources. Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory’s MGD map). Database Interlinking What makes Entrez more powerful than many services is that most of its records are linked to other records, both within a given database (such as Nucleotide) and between databases. Links within a database are called “neighbours” (e.g., Nucleotide neighbours). Protein and Nucleotide neighbours are determined by performing similarity searches using the BLAST algorithm to compare the entry amino acid or DNA sequence to all other amino acid or DNA sequences in the database. Links between databases are also possible. Nucleotide sequence records in the Nucleotide database are linked to the PubMed citation of the article in which the sequences were published. Protein sequence records are linked to the nucleotide sequence from which the protein was translated. SU2-29 BME355 STUDY UNIT 2 You can use limits (such as RefSeq) to focus your Entrez search. Limits allow restriction of a search to a defined subset of the database. Limits can be set to restrict a search to a particular database field (e.g., the author field). Limits can be set to search everything but a particular type of data (e.g., exclude patent records). Alternatively, limits can be set to only search a particular type of data (e.g., Genomic RNA/DNA) or to only search data from a particular source database (e.g., EMBL). Date limits and sequence length limits are also possible. The contents of each Entrez database differ and therefore the Limits available for each database differ. EMBL The European Bioinformatics Institute provides access to sequences via the EMBL nucleotide database. The searches are comparable to those of the NCBI GenBank database using Entrez. EBI also sponsors ENSEMBL for bioinformatic analysis of the human genome. Try ENSEMBL at www.ensembl.org for a premier human genome web browser. ExPASy One of the most useful resources available to obtain protein sequences and associated data is provided by ExPASy (Expert Protein Analysis System). Try ExPASy’s sequence retrieval system at http://www.expasy.ch. How to do a literature search using PubMed PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published in the United States and in 70 other countries. It has 12 million records dating back to 1966. MeSH is the acronym for “Medical Subject Headings.” MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. PubMed search strategies: Try the tutorial (“education” on the left sidebar) Use boolean queries lipocalin AND disease Try using “limits” SU2-30 BME355 STUDY UNIT 2 Try “LinkOut” to find external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Figure 2.13 Literature search strategies in PubMed Medline can be searched using single search terms. Alternatively, to reduce the number of results returned and increase specificity, a combination of search terms can be strung together and used in a Boolean search. Specificity can be introduced by excluding certain search terms. SU2-31 BME355 STUDY UNIT 2 Figure 2.14a Screen capture of the NCBI website with information on PubMed SU2-32 BME355 STUDY UNIT 2 Figure 2.14b Screen capture of PubMed depicting search for RBP protein WelchWeb (from the Welch Medical Library) is available at http://welch.jhmi.edu/welchone/. We use WelchWeb to do literature (and other) searches through Email gateway/ PubMed gateway/ Library catalogue/ Remote access to Welch services/request literature, and then browse journals or database. How to find information about a particular disease There are two main types of disease databases for general and locus-specific respectively. For the general diseases: OMIM, GeneCards (Weizmann) at http://www.genecards.org/, and NCBI Genes & Disease at http://www.ncbi.nlm.nih.gov/disease/. For the locus-specific diseases: Human Gene Mutation Database (HGMD) at http://www.hgmd.cf.ac.uk/docs/oth_mut.html. e.g., Try OMIM for RBP SU2-33 BME355 STUDY UNIT 2 Genomics in Medicine (Access video via iStudyGuide) Figure 2.15a Screen capture of the Location of the OMIM db in the NCBI web site SU2-34 BME355 STUDY UNIT 2 Figure 2.15b Screen capture of search results from OMIM db for RBP protein Figure 2.15c Screen capture showing entry for RBP protein in the OMIM db SU2-35 BME355 STUDY UNIT 2 1. Which of the following is a RefSeq accession number corresponding to an mRNA? (a) J01536 (b) NM_15392 (c) NP_52280 (d) AAB134506 2. Is it possible for a single gene to have more than one UniGene cluster? (a) Yes (b) No 3. If you want literature information, what is the best website to visit? (a) OMIM (b) Entrez (c) PubMed (d) PROSITE Answers: (b) (a) (c) Visit the following link, https://www.youtube.com/watch?v=hRw0TtKgR7Y and learn more about non-coding DNA (regulatory elements). Answers will be discussed in class. a. What are non-coding regulatory elements? Name any two. b. What is the role of the non-coding region sequences in the cell? c. How much of the DNA sequences in the genome code for protein coding genes? Visit the following websites and carry out the below mentioned searches in class. a) Visit the NCBI website (http://www.ncbi.nlm.nih.gov/) to learn more about Medline and search the OMIM database for specific mutations in any disease of interest to you. List at least one or two mutations and provide a screen capture of the details from OMIM for these. b) Visit the SNP database (http://www.ncbi.nlm.nih.gov/snp) and use the help function to learn how to search for SNPs in diabetes. Tabulate the SU2-36 BME355 STUDY UNIT 2 following: exact base location, base pair changed, chromosome position, organism and clinical details if available. Carry out the following activity with your lecturer in class. Complete the PCR figure and provide sequences of the primers, and compute number of copies after 5 rounds of amplification. SU2-37 BME355 STUDY UNIT 2 Summary The followings key points are discussed in this unit: Cellular Processes: Mitosis, Meiosis, Transcription, Replication, Splicing and Translation Differences between Prokaryotic/Eukaryotic Cells The standard genetic code Coding vs. non-coding cellular regulators NCBI databases: EST, SNP, Entrez and others Carrying out PubMed searches and OMIM database searches SU2-38 STUDY UNIT 3 PAIRWISE SEQUENCE ALIGNMENT BME355 STUDY UNIT 3 Learning Outcomes Upon completion of this unit, you will be able to: 1. Differentiate between Sequence homology, similarity and identity 2. Apply knowledge of pairwise alignment and homology to problems of evolution 3. Differentiate between the role of protein sequence vs. structure analysis in understanding evolution 4. Illustrate use of Dayhoff models in pairwise alignment 5. Apply the PAM and BLOSUM matrices to problems of protein sequence homology 6. Compare and contrast between the various global and local alignment algorithms and their applications in pairwise alignments 7. Apply the Needleman-Wunsch algorithms for global alignment 8. Apply the Smith-Waterman algorithm for local alignments 9. Employ FASTA and BLAST algorithms to nucleic acid and protein sequence alignments You can refer to Chapter 3 of the textbook Overview This unit evaluates the methods for analysing the relatedness of genes and proteins, with focus on the pairwise sequence alignment algorithms. We adopt an evolutionary perspective in our description of how amino acids (or nucleotides) in two sequences can be aligned and compared. We then describe the algorithms and programs for global (Needleman-Wunsch) and local (Smith-Waterman) pairwise alignment. SU3-1 BME355 STUDY UNIT 3 Chapter 3 Pairwise Sequence Alignment 3.1 Concepts and General Alignment Processes One of the most basic questions about a gene or protein is whether it is related to any other gene or protein. Relatedness of two proteins at the sequence level suggests that they are homologous. Relatedness also suggests that they may have common functions. By analysing many DNA and protein sequences, it is possible to identify domain or motifs that are shared among a group of molecules. The analyses of the relatedness of proteins and genes are accomplished by aligning sequences. Hence, pairwise sequence alignment is the most fundamental operation of computational biology, and its main applications are as follows: It is used to decide if two proteins (or genes) are related structurally or functionally; It is used to identify domains or motifs that are shared between proteins; It is the basis of BLAST searching (next unit); and It is used in the analysis of genomes. Protein Alignment: Often more Informative than DNA Alignment Given the choice of aligning a DNA sequence or the sequence of the protein it encodes, it is usually more informative to compare protein sequence. There are several reasons: Protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties; Codons are degenerate: changes in the third position often do not alter the amino acid that is specified; Protein sequences offer a longer “look-back” time; and DNA sequences can be translated into protein, and then used in pairwise alignments. DNA can be translated into six potential proteins, for example, SU3-2 BME355 STUDY UNIT 3 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG But sometimes, DNA alignments are more appropriate to confirm the identity of a cDNA to study noncoding regions of DNA to study DNA polymorphisms For example: Neanderthal vs modern human DNA: Homology, Similarity and Identity Conservation: Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico3-chemical properties of the original residue. Identity is the extent to which two (nucleotide or amino acid) sequences are invariant. Similarity is the extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Homology is the similarity attributed to descent from a common ancestor. SU3-3 BME355 STUDY UNIT 3 There are two types of homology, as illustrated below: Figure 3.1 Illustration of the various types of sequence homologies in genes Gene duplication events are advantageous to an organism. As a functional copy of the gene exists, the other copy is free to be mutated to evolve into new functionalities. In the early globin ancestor, several gene duplication events allowed formation of two types of globin chains which diverged further during speciation events. Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. SU3-4 BME355 STUDY UNIT 3 For example, the following tree shows RBP orthologs: Figure 3.2 Example of a tree showing RBP orthologs Paralogs Homologous sequences within a single species that arose by gene duplication. SU3-5 BME355 STUDY UNIT 3 For example, the following tree shows paralogous human lipocalins: Figure 3.3 Tree showing paralogs in Lipocalins Pairwise Alignment, Homology and Evolution of Life Pairwise alignment is the process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. For example, the following is the pairwise alignment of retinol-binding protein and b-lactoglobulin: SU3-6 BME355 STUDY UNIT 3 Positions at which a letter is paired with a null are called gaps: Gap scores are typically negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. In BLAST, it is rarely necessary to change gap values from the default. Pairwise sequence alignment allows us to look back billions of years ago (BYA) Figure 3.4 Overview of the history of life on Earth See Chapter 13 of the text book for details. Gene/Protein sequences are analysed in the context of evolution: Which organisms have orthologous genes? When did these organisms evolve? How related are human and bacterial globins? Source: Recommended text book (page 56) General approach to pairwise alignment: Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance SU3-7 BME355 STUDY UNIT 3 Calculation of an alignment score: Protein sequence vs. structure: As seen in the previous sections, studies of proteins enrich our knowledge of evolution much more than that of DNA sequence analysis. However, one of the questions typically raised in studying proteins is whether the sequence or the structure of a protein is more informative. We look at the role of protein structure in the following audio and also understand why protein sequences are studied. Many of the algorithms and programs used to study proteins in units 3 to 5 relate to this issue. Protein Structure (Access video via iStudyGuide) 3.2 Dayhoff Model and Substitution Matrices Margaret Dayhoff and colleagues catalogued thousands of proteins and compared the sequences of closely related proteins in many families. They considered the question of which specific amino acid substitutions are observed to occur when two homologous protein sequences are aligned. An Accepted Point Mutation (or PAM) is defined as a replacement of one amino acid in a protein by another residue that has been accepted by natural selection. An amino acid change that is accepted by natural selection occurs, when SU3-8 BME355 STUDY UNIT 3 A gene undergoes a DNA mutation such that it encodes a different amino acid, and The entire species adopts that change as predominant form of the protein. Dayhoff’s 34 protein superfamilies: Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein 33 Lactalbumin 27 Hemoglobin a 12 Myoglobin 8.9 Insulin 4.4 Histone H4 0.10 Ubiquitin 0.00 Dayhoff’s numbers of “accepted point mutations”: What amino acid substitutions occur in proteins? A Ala A R N D C Q E G H R Arg N Asn D Asp C Cys Q Gln E Glu G Gly 30 109 17 154 0 532 33 10 0 0 93 120 50 76 0 266 0 94 831 0 422 579 10 156 162 10 30 112 21 103 226 43 10 243 23 10 Figure 3.5 The figure depicts a subset of the modified Dayhoff matrix SU3-9 BME355 STUDY UNIT 3 Number of accepted point mutations, multiplied by 10, observed in 1572 cases of amino acid substitutions from closely related protein sequences were used to generate this data. Amino acids are presented alphabetically according to the three-letter code. Some substitutions such as V and I or S and T are common. Substitutions of C and W are rarely allowed. Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases The relative mutability of amino acids: Asn 134 His 66 Ser 120 Arg 65 Asp 106 Lys 56 Glu 102 Pro 56 Ala 100 Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 SU3-10 BME355 STUDY UNIT 3 Normalised frequencies of amino acids: Gly 8.9% Arg 4.1% Ala 8.7% Asn 4.0% Leu 8.5% Phe 4.0% Lys 8.1% Gln 3.8% Ser 7.0% Ile 3.7% Val 6.5% His 3.4% Thr 5.8% Cys 3.3% Pro 5.1% Tyr 3.0% Glu 5.0% Met 1.5% Asp 4.7% Trp 1.0% blue=6 codons; red=1 codon Dayhoff’s PAM1 mutation probability matrix (original amino acid): Figure 3.6 The PAM1 mutation probability matrix SU3-11 BME355 STUDY UNIT 3 The original amino acid j is arranged in columns (across the top), while the replacement amino acid I is arranged in rows. Dayhoff’s PAM1 mutation probability matrix (Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side)). Substitution Matrix A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM. PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. All the PAM data come from closely related proteins (>85% amino acid identity). PAM Matrices Dayhoff’s PAM0 mutation probability matrix (the rules for extremely slowly evolving proteins; Top: original amino acid, Side: replacement amino acid): Figure 3.7 A PAM 2000 matrix has similar values that tend to converge on the same limits In a PAM 2000 matrix, the proteins being compared are an extreme of unrelatedness. In contrast, at PAM0, no mutations are tolerated, and the residues of the proteins are perfectly conserved. SU3-12 BME355 STUDY UNIT 3 Dayhoff’s PAM 2000 mutation probability matrix: (the rules for very distantly related proteins; Top: original amino acid, Side: replacement amino acid): Figure 3.8 Portion of the matrices for an infinite Pam (infinite) value. This results by multiplying a PAM 1 matrix against itself an infinite number of times PAM250 mutation probability matrix (Top: original amino acid, Side: replacement amino acid): Figure 3.9 The PAM250 mutation probability matrix At this evolutionary distance, only one in five amino acid residues remains unchanged from an original amino acid sequence (columns) to a replacement amino acid (rows). SU3-13 BME355 STUDY UNIT 3 PAM250 log odds scoring matrix: Figure 3.10 Log-odds matrix for PAM250 High PAM values (e.g., PAM 250) are useful for aligning very divergent sequences. A variety of algorithms for pairwise alignment, multiple sequence alignment, and database searching (e.g., BLAST) allow you to select an assortment of PAM matrices such as PAM250, PAM70, and PAM30. Why do we go from a mutation probability matrix to a log odds matrix? We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). How do we go from a mutation probability matrix to a log odds matrix? The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic / the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a, b) = 10 log10 (Mab/pb) As an example, for tryptophan, S(a, tryptophan) = 10 log10 (0.55/0.010) = 17.4 What do the numbers mean in a log odds matrix? A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. SU3-14 BME355 STUDY UNIT 3 S(a, b) = 17 Probability of replacement (Mab/pb) = x, Then 17 = 10 log10 x 1.7 = log10 x 101.7 = x = 50 A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. Comparing two proteins with a PAM1 matrix gives completely different results than PAM250! Consider two distantly related proteins. A PAM40 matrix is not forgiving of mismatches, and penalises them severely. Using this matrix you can find almost no match: hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** * ** A PAM250 matrix is very tolerant of mismatches: BLOSUM Matrices BLOSUM (BLOcks sUbstitution Matrix) matrices are based on local alignments. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. SU3-15 BME355 STUDY UNIT 3 BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. A R N D C Q E G H I L K M F P S T W Y V 4 -1 5 -2 0 6 -2 -2 1 6 0 -3 -3 -3 9 -1 1 0 0 -3 5 -1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6 -2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V Figure 3.11 BLOSUM 62 scoring matrix This matrix merges all proteins in an alignment that have 62% amino acid identity or greater into one sequence. BLOSUM62 performs better than alternative BLOSUM matrices or a variety of PAM matrices at detecting distant relationships between proteins. It is thus the default scoring matrix for most database search programs such as BLAST. 3.3 Global and Local Alignment Algorithms There are two kinds of sequence alignment: global and local. We will first consider the global alignment algorithm of Needleman and Wunsch (1970). We will then explore the local alignment algorithm of Smith and Waterman (1981). Finally, we will consider BLAST, a heuristic version of SmithWaterman. We will cover BLAST in detail. SU3-16 BME355 STUDY UNIT 3 Global Alignment with the Algorithm of Needleman and Wunsch Two sequences can be compared in a matrix along x- and y-axes. If they are identical, a path along a diagonal can be drawn. Find the optimal sub-paths, and add them up to achieve the best score. This involves adding gaps when needed allowing for conservative substitutions choosing a scoring system (simple or complicated) Three steps to global alignment with the Needleman-Wunsch algorithm: 1. set up a matrix 2. score the matrix 3. identify the optimal alignment(s) Four possible outcomes in aligning two sequences: identity (stay along a diagonal) mismatch (stay along a diagonal) gap in one sequence (move vertically!) gap in the other sequence (move horizontally!) Figure 3.12a&b Outcomes of aligning two sequences SU3-17 BME355 STUDY UNIT 3 Figure 3.12b Pairwise alignment of two amino acid sequences using a dynamic programming algorithm of Needleman and Wunsch (1970) for global alignment Two sequences can be assigned a diagonal path through the matrix (top left panel), a mismatch in one still results in a diagonal path (top right panel), a deletion in sequence 2 (or insertion in 1) results in insertion of a gap position and a resulting vertical path in the optimal alignment (bottom left panel); a gap in the first sequence is represented by a horizontal path through the matrix (bottom right panel). The Needleman-Wunsch algorithm is guaranteed to find optimal alignment(s), although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal sub-paths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. Scoring matrices for computing alignment scores are often based on observed substitution rates, derived from the substitution frequencies seen in multiple alignments of sequences. The score of an alignment is the sum of the scores for pairs of aligned characters plus the scores for gaps, e.g., substitution matrix s(x,y) = +5 if x=y and 3 if xy, and linear gap penalty g = 4 SU3-18 BME355 STUDY UNIT 3 A - C - G G A C T | | | | | A T C G G A T C T Score = s(A,A) + g + s(C,C) + g + s(G,G) + s(G,A) + s(A,T) + s(C,C) + s(T,T) = 4+ 5 5 4+ 5 3 3 + 5 + 5 = +11 A - C G G - A C T | | | | | | A T C G G A T C T Score = s(A,A) + g + s(C,C) + s(G,G) + s(G,G) + g + s(A,T) + s(C,C) + s(T,T) = 5 4+ 5 + 5 + 5 4 3 + 5 + 5 = +9 The Dynamic Programing algorithms can be specified by recurrence relations. The recurrence relation for global alignment with linear gap penalty to fill the DP matrix for all 1<im & 1<jn M (0,0) 0 M (i,0) i g M (0, j ) j g M (i 1, j 1) s( S1[i], S 2 [ j ]) M (i, j ) max M (i 1, j ) g M (i, j 1) g The DP guarantees an optimal alignment between two sequences. We have to provide a scoring system for the comparison of symbol pairs (nucleotides for DNA sequences and amino acids for protein sequences), and a scheme for insertion / deletion (GAP) penalties, but once those parameters have been set, the resulting alignment should always be the same. SU3-19 BME355 STUDY UNIT 3 To find alignment itself, we must find the path of choices that led to this score. Procedure for doing this is known as Traceback. Start from the bottom-right corner and trace back to the up-left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. Local Alignment with the Algorithm of Smith and Waterman Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other. Local alignment finds optimally matching regions within two sequences (“subsequences”). Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. In the local alignment, the alignment tends to stop at the ends of regions of identity or strong similarity. A much higher priority is given to finding these local regions than to extending the alignment to include more neighbouring amino acid pairs. This type of alignment favours finding conserved amino acid motifs in related protein sequences. Given: A pair of sequences S1 and S2, A method for scoring the similarity of a pair of characters, Penalty function for gaps. Task: Find subsequences of S1 and S2, whose similarity score is maximum over all pairs of subsequences of S1 and S2. Slight modification of the basic DP algorithm: Given a sequence S1 of length m, a sequence S2 of length n, s(x,y) and g, Construct an (m+1)(n+1) matrix M. SU3-20 BME355 STUDY UNIT 3 M(i,j) = score of the best alignment of a suffix of S1[1..i] and a suffix of S2[1..j]. Initialise first row and first column of the matrix with 0: M (0,0) 0; M (i,0) 0; M (0, j ) 0 Fill in the rest of matrix top to bottom, left to right and store corresponding pointers to parent cells: 0 M (i 1, j 1) s( S [i], S [ j ]) 1 2 M (i, j ) max M (i 1, j ) g M (i, j 1) g Traceback: Find maximum value of M(i,j); can be anywhere in the matrix; Traceback pointers from the maximum cell until you hit a cell with value 0. FASTA & BLAST–- Rapid, heuristic versions of SmithWaterman Algorithm Sequence Databases are large and growing fast: Swiss-Prot: 5107 amino-acids (19-July-2003) TrEMBL: 3108 amino-acids (18-July-2003) Genebank: 31010 base-pairs (07-FEB-2003) Even fast workstations are too slow for complete dynamic programming alignment. Assume 107 matrix cells/sec (that’s pretty fast) for an amino-acid sequence of length 400, this leads to a runtime of: (4005107/107) sec = 2’000 sec = 33 min for Swiss-Prot (4003108/107) sec = 12’000 sec = 3,3 hours for TrEMBL For a DNA sequence of length 1000 this leads to a runtime of: (100031010/107) sec = 3’000’000 sec = 35 days for GenBank The Smith-Waterman algorithm is very rigorous and it is guaranteed to find an optimal alignment. But it is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the SU3-21 BME355 STUDY UNIT 3 product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to the Smith-Waterman algorithm. How FASTA works? 1. A “lookup table” is created. It consists of short stretches of amino acids (e.g., k=3 for a protein search). The length of a stretch is called a k-tuple. The FASTA algorithm finds the ten highest scoring segments that align to the query. 2. These ten aligned regions are re-scored with a PAM or BLOSUM matrix. 3. High-scoring segments are joined. 4. The Needleman-Wunsch or Smith-Waterman algorithm is then performed. How BLAST works? Pairwise alignment–- BLAST 2 SEQUENCES: Figure 3.13 a&b Screen Capture of NCBI web site showing BLAST 2 Sequences algorithm and results SU3-22 BME355 STUDY UNIT 3 Go to http://www.ncbi.nlm.nih.gov/BLAST. Choose BLAST SEQUENCES. In the program: 1. choose blastp or blastn 2. paste in your accession numbers (or use FASTA format) 3. select optional parameters: 3 BLOSUM and 3 PAM matrices gap creation and extension penalties filtering word size Figure 3.13 b Results from a BLAST 2 sequence query Regions aligning with each other are shown as boxes along the diagonal. The Query sequence is depicted in the top, bases identical in the query and subjects are shown with their amino acid letters, allowed substitutions are shown with a + sign. SU3-23 BME355 STUDY UNIT 3 Tests for Significance of Pairwise Alignments Figure 3.14 Tabulation of sensitivity and specificity for sequences Sensitivity and specificity are dictated by the ability to identify true positive data while minimising false positives. All models/ algorithms need to be tested to ensure a balance between getting results while reducing artefacts. Source: with permission from Sonego P. et al., Brief Bioinform (2008) 9(3): 198209. Randomisation test: scramble a sequence: irst compare two proteins and obtain a score: Next scramble the bottom sequence 100 times, and obtain 100 “randomised” scores (+/- S.D.) Composition and length are maintained If the comparison is “real” we expect the authentic score to be several standard deviations above the mean of the “randomised” scores For example, a randomisation test shows that RBP is significantly related to b-lactoglobulin. (But this test assumes a normal distribution of scores!) SU3-24 BME355 STUDY UNIT 3 You can perform this randomisation test in GCG using the gap or bestfit pairwise alignment programs. Type > gap–-ran=100 or > bestfit–-ran=100 Z = (Sreal – Xrandomised score) / (standard deviation) The PRSS program performs a scramble test for you at http://fasta.bioch.virginia.edu/fasta/prss.htm (But these scores are not normally distributed!) SU3-25 BME355 STUDY UNIT 3 Figure 3.15 Series of alignments for scoring sequences from the database Note that these alignments are produced post hoc and do not actually represent the search process. BLAST and FASTA share a common strategy: fast screening to eliminate unrelated sequences complete alignment of top scoring sequences BLAST and FASTA differ in: statistical model SU3-26 BME355 STUDY UNIT 3 heuristics and tuning 1. Which of the following amino acids is least mutable according to the PAM scoring matrix? (a) Alanine (b) Glutamine (c) Methionine (d) Cysteine 2. You have two distantly related proteins. Which BLOSUM or PAM matrix is best to use to compare them? (a) BLOSUM45 or PAM250 (b) BLOSUM45 or PAM1I) BLOSUM80 or PAM 250 (c) BLOSUM80 or PAM1 3. True or False: Two proteins that share 30% amino acid identity are 30% homologous. Answers: (d) (a) (False) Carry out the following exercise and post your answers on Learning Management System. Do a Google search for the CDD website and search for specific set of motifs, and conserved sequences in Human tbx-18 protein. You should now read the following: Gene networks: http://en.wikipedia.org/wiki/Gene_regulatory_network Epigenetics: http://en.wikipedia.org/wiki/Epigenetics Find Out More Ion-torrent: http://www.youtube.com/watch?v=WYBzbxIfuKs Illumina SBS: http://technology.illumina.com/technology/next-generationsequencing/sequencing-technology.html SU3-27 BME355 STUDY UNIT 3 Is the DNA or protein sequence/structure more useful in understanding evolution? Why? Answers will be discussed in class. SU3-28 BME355 STUDY UNIT 3 References Gotoh, O. An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705-708 (1982). Myers, E.W., and Miller, W. Optimal alignments in linear space. Comput. Appl. Biosci. 4, 11-17 (1988). Needleman, S.B., and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970). Smith, T.F., Waterman, M.S., and Fitch, W.M., Comparative bio-sequence metrics. J. Mol. Evol. 18, 38-46 (1981). SU3-29 BME355 STUDY UNIT 3 Summary The followings key points are discussed in this unit: Concept of Pairwise Alignment Protein vs. DNA sequence alignments Homology, Similarity and Identity Protein Sequence vs. Structure information in studying evolution Dayhoff model and the concept of Accepted Point Mutations (PAM) Global vs. Local Alignment tools Comparison of BLAST and FASTA SU3-30 STUDY UNIT 4 DATABASE SEARCHING WITH BLAST AND FASTA BME355 STUDY UNIT 4 Learning Outcomes Upon completion of this unit, you will be able to: 1. 2. 3. 4. Discuss the advantages and limitations of the BLAST algorithm Differentiate between Normal Gaussian Distribution and EVD Apply the concepts of basic BLAST to searches for genes/protein sequences and evaluate significance of the search results Differentiate between BLAST and FASTA algorithms for pairwise alignments You can refer to Chapter 4 of the textbook Overview This unit introduces the Basic Local Alignment Search Tool (BLAST) which is the main NCBI tool for comparing a query sequence to other sequences in various databases, as well as FASTA. BLAST and FASTA are heuristic and rapid version of pairwise alignment algorithm. Steps of the BLAST and FASTA search processes are described. Strategies applied for BLAST database searching are discussed with examples. SU4-1 BME355 STUDY UNIT 4 Chapter 4 Database Searching with BLAST and FASTA 4.1 Basic Concepts, BLAST Searches and Interpretation of Results Selectivity: Describes the ability of a search tool to discard false positive, i.e., higher selectivity means that the method identifies fewer matches between unrelated sequences. Sensitivity: Describes the ability of a search tool to discard false negatives, i.e., higher sensitivity means that the method identifies more matches between distantly related sequences. Significance: A significant result is one that has not simply occurred by chance. Significance levels show how likely a result is due to chance, expressed as probability. In sequence analysis, the significance of an alignment score maybe calculated as the chance that such a score would be found between random sequences. Most database search methods involve a trade-off: Sensitivity vs Selectivity, for example: Suppose a database contains 1’000 globin sequences. Suppose a search of this database for globins reported 900 results, 700 of which were really globin sequences and 200 of which were not. This result would be said to have 300 false negatives (misses) and 200 false positives. Lowering a tolerance threshold will most likely increase the number of both false negatives and false positives, i.e., higher sensitivity, but lower selectivity. Another trade-off in database searching: Sensitivity vs Speed BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible. SU4-2 BME355 STUDY UNIT 4 Why do we need BLAST searching? BLAST searching is fundamental to understanding the relatedness of any favourite query sequence to other known proteins or DNA sequences. Applications of BLAST include identifying orthologs and paralogs, discovering new genes or proteins, discovering variants of genes or proteins, investigating expressed sequence tags (ESTs), and exploring protein structure and function. BLAST Search Steps Four steps to a BLAST search: Choose the sequence (query) Select the BLAST program Choose the database to search Choose optional parameters Specifying Sequence of Interest Sequence can be input in FASTA format or as accession number. Figure 4.1 Screen capture of NCBI home page showing RBP protein in FASTA format Generating the FASTA format is a pre-requisite to search data in multiple databases. This can be carried out for a single entity or multiple sequences to query in a batch mode. SU4-3 BME355 STUDY UNIT 4 Selecting BLAST Program DNA can be translated into six potential proteins, for example, Figure 4.2 Schema of the six potential protein start sites from any ds DNA While carrying out search for proteins encoded from a DNA sequence, all six potential start sites of translation (along both strands) need to be interrogated and the proteins generated from all of these frames need to be aligned against a query sequence. blastn (nucleotide BLAST): blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST) Selecting a Database nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence Selecting Optimal Search and Formatting Parameters You can choose the organism to search, turn filtering on/off, change the substitution matrix, change the expect (e) value, change the word size, change the output format, and so on. SU4-4 BME355 STUDY UNIT 4 For example, choosing filter, selecting alignment view, descriptions and alignments: SU4-5 BME355 STUDY UNIT 4 Figure 4.3 Output from a BLAST search using Retinol binding protein The Local Alignment Strategy for BLAST Search The Smith-Waterman algorithm is guaranteed to find optimal alignments, but it is computationally expensive (requires O(n2) time). BLAST and FASTA are heuristic approximations to local alignment. Each requires only O(n 2/k) time; they examine only part of the search space. How BLAST works? The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T. The original BLAST algorithm works in 3 phases: Phase 1: compile a list of word pairs (w=3) above threshold T Example: For a human RBP query…FSGTWYA… (the query word is in bold) SU4-6 BME355 STUDY UNIT 4 A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS Figure 4.3a 3 steps used by the BLAST algorithm to align sequences The BLAST algorithm parses the query sequence into 3 letter words, these are compared to the words in the database and the scores for the various hits are recorded. If multiple (2 or more) hits with scores higher than threshold are obtained, the alignment is continued in both directions until the next set of identical hits are encountered. Phase 2: Scan the database for entries that match the compiled list. This is fast and relatively easy. Phase 3: When you manage to find a hit (i.e., a match between a “word” and a database entry), extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. For example, Figure 4.3b 3 steps used by the BLAST algorithm to align sequences SU4-7 BME355 STUDY UNIT 4 In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search. You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options. How to interpret BLAST: E values and p values It is important to assess the statistical significance of search results. For global alignments, the statistics are poorly understood. For local alignments (including BLAST search results), the scores follow an extreme value distribution (EVD) rather than a normal distribution. The probability density function of the extreme value distribution (characteristic value u=0 and decay constant l=1) is illustrated as follows: Figure 4.4 EVD vs. Normal Gaussian distribution plot Evolution tends to be sporadic and not continuous. Thus, a few hot spots of high mutation rates dictate the divergence of species and generation of newer species. Using sequence data to pick up such events requires use of the EVD pattern of distribution, rather than the normal symmetrical distribution followed by Gaussian plots. Hence, BLAST uses the EVD pattern to identify homology. SU4-8 BME355 STUDY UNIT 4 Gaussian vs. Extreme Value Distribution (Access video via iStudyGuide) The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e- λ S This equation is derived from a description of the extreme value distribution, where S = the score E = the expect value = the number of HSPs expected to occur with a score of at least S m, n = the length of two sequences λ, K = Karlin Altschul statistics Some properties of the equation E = Kmn e- λ S: • • • • The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores. Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly. There are two kinds of scores: Raw scores (calculated from a substitution matrix) and bit scores (normalised scores) SU4-9 BME355 STUDY UNIT 4 Bit scores are comparable between different searches because they are normalised to account for the use of different scoring matrices and different database sizes S’ = bit score = (λ S – ln K) / ln 2 The E value corresponding to a given bit score is E = mn 2 -S’ To make sense of raw scores with bit scores: Bit scores allow you to compare results between different database searches, even using different scoring matrices. The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e-E Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. E 10 5 2 1 0.1 0.05 0.001 0.0001 p 0.99995460 0.99326205 0.86466472 0.63212056 0.09516258 (about 0.1) 0.04877058 (about 0.05) 0.00099950 (about 0.001) 0.0001000 SU4-10 BME355 STUDY UNIT 4 Figure 4.5a Summary of results from a typical BLAST search Sometimes, a real match has an E value > 1, for example, SU4-11 BME355 STUDY UNIT 4 Figure 4.5 b Alignment results from a BLAST search Assessing whether proteins are homologous, as illustrated in the following: Figure 4.6 Deducing Protein homology from alignment scores Deducing homology from sequence alignment scores alone can be misleading. Many homologous proteins have 20% or fewer identical amino acids in their primary sequence. Complementing sequence data with a comparison of the structural (or possible structural) data is more conclusive. However, there are fewer protein structures than sequence data available. SU4-12 BME355 STUDY UNIT 4 RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins. BLAST searching with HIV-1 pol, a multi-domain protein: 4.2 FASTA Procedure FASTA is a family of heuristic algorithms developed by William Pearson of the University of Virginia. FastA lies between BLAST and Smith-Waterman in both accuracy and speed. An optimised FastA option makes use of SmithWaterman for part of the alignment process. The FastA family includes DNA to DNA, protein to protein, and translation searches. Refer to Pearson, W. R. and Lipman, D. J., Improved tools for biological sequence comparison, Proceedings of the National Academy, 1988. Procedure: – – – – – Find best regions on diagonals. Re-scoring 10 best region using a PAM or BLOSUM substitution matrix. The best of these new scores is a first measure of the similarity between two sequences, called INIT1. INIT1 is computed for every database sequence with respect to the query sequence. INIT1 is used to rank all database sequences. For the highest ranking sequences an optimised score opt is computed running a DP algorithm restricted to a band around the initial alignment. Calculate significance of scores. The FASTA program sets a size k for k-tuple sub-words. The program then looks for diagonals in the comparison matrix between query and search sequence along which many k-tuples match. This can be done very quickly based on a preprocessed list of k-tuples contained in the query sequence. The set of k-tuples can be identified with an array whose length corresponds to the number of possible tuples of size k. This array is linked to the indices where the particular k-tuples occur in the query sequence. Note that a matching k-tuple at index i in the query and at index j in the database sequence can be attributed to a diagonal by subtracting the one index from the other. Therefore, when inspecting a new sequence for similarity, one walks along this sequence inspecting each k-tuple. For each of them, one looks up the indices where it occurs in the query, computes the index-difference to identify the diagonal and increases a counter for this diagonal. After SU4-13 BME355 STUDY UNIT 4 inspecting the search sequence in this way, a diagonal with a high count is likely to contain a well-matching region. In terms of the execution time, this procedure is only linear in the length of the database sequence and can easily be iterated for a whole database. Of course, this rough outline needs to be adapted to focus on regions on diagonals where the match density is high and link nearby, good diagonals into alignments. Step 1 Determine k-tuples common to both sequences k = 1 or 2 for proteins k = 4, 5, or 6 for DNA The value of k is a parameter called ktup in the program. In addition, the offset of a common k-tuple is determined: The offset is a value between n+1 and m1 that determines the relative displacement of sequence S1 relative to sequence S2 (n and m are the lengths of the sequences). Specifically, if the common k-tuple starts a positions S1[i] and S2[j], then offset = ij. This is called “diagonal method”, because an offset can be viewed as a diagonal in the dot plot matrix. Scan S2 and each k-tuple in S2 is looked-up in the table. For all common occurrences the entry of the corresponding offset in the offset vector is incremented. Offsets correspond to diagonals in the “dot plot”. Hashing technique used by FASTA is an efficient way to count number of “hot spots” in each diagonal of the dot plot. Complexity Analysis: Given two sequences of length n and m Time complexity to calculate offset vector: O(m+n+x) x is the number of hot-spots, i.e., exact matches of length k Increasing ktup decreases x Higher ktup value FASTA is faster Higher ktup value FASTA is less sensitive Trade-off: sensitivity vs speed SU4-14 BME355 STUDY UNIT 4 ktup = 1 or 2 for protein sequences ktup = 4, 5 or 6 for DNA sequences Join two or more k-tuples in the same diagonal if they are not very far apart. The combined k-tuples form a region. A region is a gapless local alignment. Regions are given a score depending on the matches and mismatches: Score = sum of hot-spots distance between hot-spots Store the 10 best regions. Step 2 Re-score 10 best regions using a substitution matrix. The best of these new scores is used as a first measure of similarity between S1 and S2, called init1. Steps 3 From now on, only database sequences are considered with init1 > cutoff. FASTA then tries to join nearby regions of the 10 best scoring regions using gaps. The best joined score is called initn. Steps 4 Additionally, an opt score is computed by computing a local banded alignment around the init1 region (band has 16 diagonals for ktup=2, 32 diagonals for ktup=1). Steps 5 Assessing the statistical significance of a score: Z-score: It is a way of measuring the significance of a score considering the mean of the random score distribution. Difference between the similarity score for a single alignment and the mean of the random score distribution is normalised by the standard deviation of that random score distribution. Higher Z-scores are better because the further the real score is from this mean (in standard deviation units) the more significant it is. SU4-15 BME355 STUDY UNIT 4 E value: It is the probability that an alignment score is as good as the one found between a query sequence and a database sequence in as many comparisons between random sequences as was done to find the matching sequence. When Z-score goes up, the E-value goes down. 1. You have a short DNA sequence. Basically, how many proteins can it potentially code? (a) 1 (b) 2 (c) 3 (d) 6 2. You can limit a BLAST search using any Entrez term. For example, you can limit the results to those containing a researcher’s name. (a) True (b) False 3. As the E value of a BLAST search becomes smaller: (a) The value of K also becomes smaller (b) The score tends to be larger (c) The probability p tends to be larger (d) The Extreme Value Distribution becomes less skewed Answers: (d) (a) (b) Visit the link, http://blast.ncbi.nlm.nih.gov/Blast.cgi and do a nucleotide search and protein search using the term human and tbx-18. List top 5 results. Search using only tbx-18 and list the top 5 results. Answers will be discussed in class. SU4-16 BME355 STUDY UNIT 4 Summary The followings key points are discussed in this unit: Basic Concepts used in the BLAST algorithm Selectivity, Sensitivity and Specificity BLAST Search steps FASTA principles Interpretation of BLAST results: E-values and p-values SU4-17 STUDY UNIT 5 ADVANCED BLAST SEARCHING BME355 STUDY UNIT 5 Learning Outcomes Upon completion of this unit, you will be able to: 1. Compare and contrast the various BLAST programs and their output and explain the need for these programs 2. Evaluate the role of PSI-BLAST and PHI-BLAST to study homologous proteins with limited sequence similarity 3. Explain how protein sequence patterns are utilised in homology studies 4. Practise carrying out search using Advanced BLAST tools 5. Illustrate your knowledge of BLAST, Advanced BLAST and other databases to solve a research problem You can refer to Chapter 5 of the textbook Overview BLAST searches can be very versatile. This unit further explores the advanced BLAST searching techniques. We begin with an overview of the specialised BLAST resources and websites. We then focus on finding distantly related proteins with Position-specific Iterated BLAST (PSI-BLAST) and significant pattern matches with Pattern-Hit Initiated BLAST (PHI-BLAST). Finally, using BLAST for gene discovery is illustrated. SU5-1 BME355 STUDY UNIT 5 Chapter 5 Advanced BLAST Searching 5.1 Types of BLAST Programs and Search Parameters We have used two BLAST resources, both from the NCBI websites: BLAST 2 Sequences and the standard five BLAST programs. There are other specialised BLAST programs. First, there are many entire databases that consist of molecular sequence data from a specific organism. Often, the data include unfinished sequences that have not yet been deposited in GenBank. For example, the Ensembl BLAST server allows the user to search the Ensembl database, including the most finished sequence. Output of a BLAST search of the Ensembl database using RBP4 as a query is presented in a graphical format by chromosome, showing the best match to the long arm of chromosome 10 near the centromere. Weaker matches to paralogs on other chromosomes are also evident, as in the following: Figure 5.1 Location of RBP protein on the long arm of chromosome 10 The red box depicts the location of the RBP protein on chromosome 10. Weaker matches are also depicted in blue and green. BLAST search from The Institute for Genomic Research (TIGR) allows the choice of databases from various organisms as well as optional parameters such as a choice from dozens of substitution matrices. The TIGR BLAST output resembles that of NCBI BLAST, but with fewer organisms. Also, there SU5-2 BME355 STUDY UNIT 5 is typically only one entry per species because redundant or partial sequences from assorted databases are unified into one accession number. Specialised BLAST-related algorithms Developed at Washington University, WU BLAST 2.0 is related to the traditional NCBI BLAST algorithms, as both did not permit gapped alignments. WU BLAST 2.0 may provide faster speed and increased sensitivity, and it includes a variety of options such as a full Smith-Waterman alignment on some pairwise alignments of database matches. Figure 5.2 Screen capture of EMBL site showing WU-Blast 2 BLAST-like tools for genomic DNA searches The analysis of genomic DNA presents special challenges: • • There are exons (protein-coding sequence) and introns (intervening sequences). There may be sequencing errors or polymorphisms. SU5-3 BME355 STUDY UNIT 5 • The comparison may be between related species (e.g., human and mouse). Recently developed tools include: • • MegaBLAST at NCBI. BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror image of the BLAST strategy. See http://genome.ucsc.edu • SSAHA at Ensembl uses a similar strategy as BLAT. See http://www.ensembl.org Position-Specific Iterated BLAST (PSI-BLAST) Many homologous proteins share only limited sequence identity. PSI-BLAST is a specialised kind of BLAST search that is often more sensitive than a regular BLAST search. The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by employing a scoring matrix that is customised to your query. PSI-BLAST is performed in five steps: 1. Select a query and search it against a protein data; 2. PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialised position-specific scoring matrix (PSSM); 3. The PSSM is used as a query against the database; 4. PSI-BLAST estimates statistical significance (E values); 5. Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query. SU5-4 BME355 STUDY UNIT 5 Figure 5.3 Steps and results of a PSI-BLAST search PSI-BLAST creates a profile of the conserved bases between the query and search results. The following iterations further strengthen this profile. Barring the introduction of spurious false positives, typically 3-5 iterations are needed to establish homology relationships even between sequences with low sequence identity. Performance assessment: Evaluate PSI-BLAST results using a database in which protein structures have been solved and all proteins in a group share < 40% amino acid identity. PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. The main source of false positives is the spurious amplification of sequences not related to the query. For instance, a query with a coiled-coil motif may detect thousands of other proteins with this motif that SU5-5 BME355 STUDY UNIT 5 are not homologous. Once even a single spurious protein is included in a PSIBLAST search above threshold, it will not go away. The problem of corruption: Corruption is defined as the presence of at least one false positive alignment with an E value < 10-4 after five iterations. Three approaches to stopping corruption are: Apply filtering of biased composition regions Adjust E value from 0.001 (default) to a lower value such as E = 0.0001. Visually inspect the output from the iterations. Remove suspicious hits by unchecking the box. Pattern-Hit Initiated BLAST (PHI-BLAST) Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. PHI-BLAST is launched from the same page as PSI-BLAST. It combines matching of regular expressions with local alignments surrounding the match. For example, to align three lipocalins (RBP and two bacterial lipocalins), pick a small, conserved region and see which amino acid residues are used. Create a pattern using the appropriate syntax: GXW [YF][EA][IVLM] The syntax for patterns in PHI-BLAST follows the conventions of PROSITE. When using the stand-alone program, it is permissible to have multiple patterns. When using the Web-page, only one pattern is allowed per query. [ ] means any one of the characters enclosed in the brackets, e.g., [LFYT] means one occurrence of L or F or Y or T; - means nothing; x(5) means 5 positions in which any residue is allowed; SU5-6 BME355 STUDY UNIT 5 x(2,4) means 2 to 4 positions where any residue is allowed. SU5-7 BME355 STUDY UNIT 5 Figure 5.4 Steps and results of a typical PHI-BLAST search PHI-BLAST starts with the use of a fixed pattern and uses this pattern (profile/motif) to query the database for proteins containing these motifs. Using BLAST for Gene Discovery You can use BLAST to find a “novel” gene, as summarised in the figure below: SU5-8 BME355 STUDY UNIT 5 SU5-9 BME355 STUDY UNIT 5 Figure 5.5 Diagrammatic view of schema to search for novel genes using BLAST 5.2 A Case Study Come back to the NCBI Map Vewer to search the human genome for sequences similar to that of the red opsin. Figure 5.6 Genomic overview of Red Opsin analogs in other organisms SU5-10 BME355 STUDY UNIT 5 Pick the following options from the various menus: Database: Protein (Search the database of proteins sequences.) Program: blastp (Use the version of BLAST that compares protein sequences, unlike blastn, which compares nucleotide sequences.) Expect: 10 (The higher the number, the less stringent that matching, and the more hits you will get.) Parameter settings enable the user to optimise their BLAST search for each query sequence. The filter option enables the program to mask regions of a query sequence in order to exclude regions of low compositional complexity such as repetitive elements. Both the blastn and blastp search tools offer fully gapped alignments while blastx and tblastn have "in-frame" gapped alignments and the tblastx search tool provides only un-gapped alignments. Statistical matrices are used both to identify sequences in a database, and to predict the biological significance of the match. PAM matrices are most sensitive for alignments of sequences with evolutionary related homologs. The greater the number in the matrix name, the greater the expected evolutionary (mutational) distance. BLOSUM matrices are most sensitive for local alignment of related sequences and are therefore ideal when trying to identify an unknown nucleotide sequence. SU5-11 BME355 STUDY UNIT 5 SU5-12 BME355 STUDY UNIT 5 Retrieved information includes: a schematic distribution of alignments of the query sequence to those in the databases, a series of one-line descriptions of the database sequences which have significantly aligned to the query sequence, actual sequence alignments, and a list of statistics specific to the BLAST search method. Look down the page to the graphical display, a box containing lots of coloured lines. Each line represents a hit from the BLAST search. SU5-13 BME355 STUDY UNIT 5 Figure 5.7: MEGABLAST results of a query against the Human Genome If you pass your mouse cursor over a red line, the narrow box just above the box gives a brief description of the hit. The first hit is your red opsin - the best match should be to the query sequence itself, and you got this sequence from that gene entry. The second hit is the green opsin - the PubMed entry reported the red and green pigments are the most similar. The third and fourth hits are the blue opsin and the rod-cell pigment rhodopsin. Other hits have lower numbers of matching residues, and are colour coded according to a score of matches. If you click on any of the coloured lines, you will skip down to more information about that hit, and you can see how much similarity each one has SU5-14 BME355 STUDY UNIT 5 to the red opsin, your original query sequence. As you go down the list, each succeeding sequence has less in common with red opsin: Sequences producing significant alignments: Value ref|NP_000504.1| ref|XP_301073.1| 179 ref|NP_001699.1| 5e-81 ref|NP_000530.1| 2e-78 ref|NP_055137.1| 2e-35 ref|NP_006574.1| 3e-34 ref|NP_150598.1| 2e-33 ref|NP_005949.1| 4e-20 ref|NP_001048.1| 19 ref|NP_000900.1| 18 ref|NP_000903.1| 17 ref|NP_062874.1| 17 ref|NP_062873.1| 17 ref|NP_000863.1| 17 ref|NP_001049.1| 17 ref|NP_001050.1| 17 ref|NP_004215.1| 17 Score E (bits) opsin 1 (cone pigments), medium-wave-sensi... similar to Red-sensitive opsin (Red cone p... 729 0.0 625 e- opsin 1 (cone pigments), short-wave-sensit... 298 rhodopsin; rhodopsin (retinitis pigmentosa... 289 opsin 3 (encephalopsin, panopsin); opsin 3... 146 peropsin [Homo sapiens] 142 opsin 4 (melanopsin); melanopsin [Homo sap... 139 melatonin receptor 1A; melatonin receptor ... 96 tachykinin receptor 2; NK-2 receptor; Tach... 93 2e- neuropeptide Y receptor Y1; Neuropeptide Y... 89 3e- opioid receptor, kappa 1; Opiate receptor,... 87 1e- 5-hydroxytryptamine receptor 7 isoform b; ... 87 2e- 5-hydroxytryptamine receptor 7 isoform d; ... 87 2e- 5-hydroxytryptamine receptor 7 isoform a; ... 87 2e- tachykinin receptor 1 isoform long; NK-1 r... 86 3e- tachykinin receptor 3; NK-3 receptor; neur... 86 5e- G protein-coupled receptor 50 [Homo sapiens] 85 8e- ref|XP_301490.1| similar to odorant receptor MOR10 [Homo sa... 09 58 8e- … SU5-15 BME355 STUDY UNIT 5 ref|XP_063312.2| 09 ref|NP_065110.1| 08 ref|XP_301842.1| 08 ref|NP_005217.1| 08 ref|XP_301795.1| 08 ref|NP_000857.1| 08 ref|NP_115892.1| 08 ref|NP_003605.1| 08 ref|NP_000856.1| 08 similar to seven transmembrane helix recep... 58 8e- cysteinyl leukotriene receptor 2; cysteiny... 58 1e- similar to D(1B) dopamine receptor (D(5) d... 58 1e- endothelial differentiation, sphingolipid ... 57 1e- similar to D(1B) dopamine receptor (D(5) d... 57 1e- 5-hydroxytryptamine (serotonin) receptor 1... 57 1e- G protein-coupled receptor 145; G protein-... 57 1e- galanin receptor 3; galanin receptor, fami... 57 2e- 5-hydroxytryptamine (serotonin) receptor 1... 57 2e- The sequences are listed in order of increasing E (expect) value. The E value is the probability that the associated match is due to randomness. The lower the E value, the more specific/significant is the match. The alignments are listed in order of most to least significant. Line descriptions are useful for identifying biologically interesting database matches and correlating this with the statistical significance of the alignment. Identifiers for the database sequences appear in the first column and are hyperlinked to the associated Genbank sequence record. The Score (bits) is a value attributed to the alignment but is independent of the scoring matrix used. The higher this value, the better the match. Alignments found with the BLAST algorithms are gapped unless specified by the user on the main BLAST input page. Both the Score and Expect values are given. Additionally, the percent identity is given; this is the percent of exact matches between your query sequence and the database sequence. The positive value is more relevant to protein alignments. This is the percent of exact + similar (based on properties) amino acid matches. The gap value is the percent of the alignment that has been gapped in order to produce the alignment. SU5-16 BME355 STUDY UNIT 5 In this case, each sequence is shown in comparison with red opsin in a pairwise sequence alignment. (Later, you will make multiple sequence alignments from which you can discern relationships among genes.) For example, the blue opsin is aligned with the red opsin as in the following: >ref|NP_001699.1| opsin 1 (cone pigments), short-wave-sensitive (colour blindness, tritan); blue cone pigment [Homo sapiens] Length = 348 Score = 298 bits (762), Expect = 5e-81 Identities = 145/339 (42%), Positives = 220/339 (64%), Gaps = 2/339 (0%) Query: 23 TQSSIFTYTNSNSTRGPFEGPNYHIAPRWVYHLTSVWMIFVVTASVFTNGLVLAATMKFK 82 ++ + + N +S GP++GP YHIAP W ++L + +M V N +VL AT+++K Sbjct: 5 SEEEFYLFKNISSV-GPWDGPQYHIAPVWAFYLQAAFMGTVFLIGFPLNAMVLVATLRYK 63 Query: 83 KLRHPLNWILVNLAVADLAETVIASTISIVNQVSGYFVLGHPMCVLEGYTVSLCGITGLW 142 KLR PLN+ILVN++ + + V +GYFV G +C LEG+ ++ G+ W Sbjct: 64 KLRQPLNYILVNVSFGGFLLCIFSVFPVFVASCNGYFVFGRHVCALEGFLGTVAGLVTGW 123 Query: 143 SLAIISWERWLVVCKPFGNVRFDAKLAIVGIAFSWIWSAVWTAPPIFGWSRYWPHGLKTS 202 SLA +++ER++V+CKPFGN RF +K A+ + +W + PP FGWSR+ P GL+ S Sbjct: 124 SLAFLAFERYIVICKPFGNFRFSSKHALTVVLATWTIGIGVSIPPFFGWSRFIPEGLQCS 183 Query: 203 CGPDVFSGSSYPGVQSYMIVLMVTCCIIPLAIIMLCYLQVWLAIRAVAKQQKESESTQKA 262 CGPD ++ + +SY L + C I+PL++I Y Q+ A++AVA QQ+ES +TQKA Sbjct: 184 CGPDWYTVGTKYRSESYTWFLFIFCFIVPLSLICFSYTQLLRALKAVAAQQQESATTQKA 243 Query: 263 EKEVTRMVVVMIFAYCVCWGPYTFFACFAAANPGYAFHPLMAALPAYFAKSATIYNPVIY 322 E+EV+RMVVVM+ ++CVC+ PY FA + N + + +P++F+KSA IYNP+IY Sbjct: 244 EREVSRMVVVMVGSFCVCYVPYAAFAMYMVNNRNHGLDLRLVTIPSFFSKSACIYNPIIY 303 Query: 323 VFMNRQFRNCILQLF-GKKVDDGSELSSASKTEVSSVSS 360 FMN+QF+ CI+++ GK + D S+ S+ KTEVS+VSS Sbjct: 304 CFMNKQFQACIMKMVCGKAMTDESDTCSSQKTEVSTVSS 342 SU5-17 BME355 STUDY UNIT 5 To figure out what the scores mean: Identities are residues that are identical in the hit and the query (red opsin), when the two are optimally aligned. Positives are residues that are very similar to each other. For example, look at residue number 1 in the blue opsin, it is threonine in red opsin and the very similar serine in the blue. Gaps are sometimes introduced into a hit to improve its alignment with the query. Note: blue opsin and rhodopsin are only about 45% identical to the red opsin. Allocation of the Genes for the Hits in the Genome We are interested in where all the genes for these hit proteins are in the human genome. Click the Genome View button near just below the introductory information at the top of this result page. You have come full circle. You are back at the human chromosome diagram, and all the hits of your search, in the colours that signify their BLAST scores, are located on the diagram. About 100 proteins (discovered so far) that have 40% or more positives in alignment with red opsin. The opsins are members of the very large family of G protein-coupled receptors, key players in signal transduction. SU5-18 BME355 STUDY UNIT 5 1. Raw DNA sequences (other than Ref seq) in the EMBL and NCBI databases: (a) Overlap entirely (b) Overlap to a substantial degree but have distinct sequences (c) Have relatively little overlap 2. Which of the following BLAST programs uses a signature of amino acids to find proteins within a family? (a) PSI-BLAST (b) PHI-BLAST (c) MS BLAST (d) Worm-BLAST 3. Which of the following steps is crucial to validating a sequence you believe to be that of a novel gene? (a) Performing a PSI-BLAST search (b) Checking the EST database to see where this gene might be expressed (c) Checking Locus Link to see if other family members of this gene have been annotated (d) BLAST searching your novel sequence into the appropriate database to evaluate whether anyone else has described your protein Answers: (a) (b) (d) Go to http://blast.ncbi.nlm.nih.gov/Blast.cgi and search for a nucleotide and a protein using the search terms human and tbx-18 and tbx-18 alone. Record the top 5 hits you obtain with each search parameter. Tabulate the E-values and p-scores for the top 5 results. Answers will be discussed in class. SU5-19 BME355 STUDY UNIT 5 Summary The followings key points are discussed in this unit: Advanced BLAST Searching techniques and a summary of their applications BLAST tools for genomic searches BLAST tools for protein homology searches Using BLAST tools for gene discovery Gene discovery: a case study SU5-20 STUDY UNIT 6 MULTIPLE SEQUENCE ALIGNMENT BME355 STUDY UNIT 6 Learning Outcomes Upon completion of this unit, you will be able to: 1. Define MSA and why MSA are necessary to glean information on homology 2. Discuss various methods used for MSA, their advantages and disadvantages 3. Evaluate use of HMM and probability-based models in sequence analysis 4. Differentiate between the various methods of creating MSA in comparison with profile HMMs 5. Illustrate knowledge of the various alternative MSA algorithms 6. Discuss how genomic sequences analysis has impacted daily life, medicine, etc., post HGP 7. Critically evaluate the new disciplines such as Metagenomics enabled by genome-wide sequence analysis You can refer to Chapter 6 of the text book Overview This unit considers establishment of relationship between multiple biological sequences. By introducing sequences into a multiple alignment, we can define members of a gene or protein family. If we know a feature of one of the proteins and identify the homologous proteins, we can predict that they may have similar function. Basic concepts and practical strategies of multiple sequence alignment are studied. Databases of multiple sequence alignments are introduced. Two main multiple sequence alignment programs are closely examined. SU6-1 BME355 STUDY UNIT 6 Chapter 6 Multiple Sequence Alignment (MSA) 6.1 MSA: Introduction and Methods The goals of this unit are as follows: To define what a multiple sequence alignment is and how it is generated; to describe profile HMMs; To introduce databases of multiple sequence alignments; To introduce ways you can make your own multiple sequence alignments; and To show how a multiple sequence alignment provides the basis for phylogenetic trees. Definition of multiple sequence alignment: A collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned; Homologous residues are aligned in columns across the length of the sequences; Residues are homologous in an evolutionary sense; and Residues are homologous in a structural sense. Properties of multiple sequence alignment: Not necessarily one “correct” alignment of a protein family; Protein sequences evolve; The corresponding three-dimensional structures of proteins also evolve; May be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment; and For two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superimposable in the two structures. Features of multiple sequence alignment: Some aligned residues, such as cysteine that form disulfide bridges, may be highly conserved; There may be conserved motifs such as a transmembrane domain; There may be conserved secondary structure features; and There may be regions with consistent patterns of insertions or deletions (indels). SU6-2 BME355 STUDY UNIT 6 Uses of multiple sequence alignment: MSA is more sensitive than pairwise alignment to detect homologs; BLAST output can take the form of a MSA, and can reveal conserved residues or motifs; Population data can be analysed in a MSA (PopSet); A single query can be searched against a database of MSAs; and Regulatory regions of genes may have consensus sequences identifiable by MSA. Methods for Multiple Sequence alignment There are two main ways to make a multiple sequence alignment: progressive alignment and iterative approaches. We illustrate the progressive alignment using ClustalW. In the example, two data sets are used: five distantly related lipocalins (human to E. coli), and five closely related RBPs. Note: When you do this, obtain the sequences of interest in the FASTA format. (You can save them in a Word document) Visit http://www2.ebi.ac.uk/clustalw/ Figure 6.1 Sequence input for a MSA using the Clustal W algorithm SU6-3 BME355 STUDY UNIT 6 Feng-Doolittle MSA occurs in 3 stages: 1. Do a set of global pairwise alignments (Needleman and Wunsch) 2. Create a guide tree 3. Progressively align the sequences Stage 1 of 3: Generate global pairwise alignments Number of pairwise alignments needed for N sequences, (N-1) (N) / 2, e.g., for 5 sequences, (4) (5) / 2 = 10. For example, five distantly related lipocalins as follows: Another example, five closely related lipocalins as follows: SU6-4 BME355 STUDY UNIT 6 Stage 2 of 3: Generate guide tree • • • • • Convert similarity scores to distance scores A tree shows the distance between objects Use UPGMA ClustalW provides a syntax to describe the tree A guide tree is not a phylogenetic tree It is calculated from the distance matrix. For the first example, For the second example, ( ( SU6-5 BME355 STUDY UNIT 6 gi|5803139|ref|NP_006735.1|:0.04284, ( gi|6174963|sp|Q00724|RETB_MOUS:0.00075, gi|132407|sp|P04916|RETB_RAT:0.00423) :0.10542) :0.01900, gi|89271|pir||A39486:0.01924, gi|132403|sp|P18902|RETB_BOVIN:0.01902); Stage 3 of 3: Progressive alignment • • • • • Make a MSA based on the order in the guide tree Start with the two most closely related sequences Then add the next closest sequence Continue until all sequences are added to the MSA Rule: “once a gap, always a gap.” For the first example, progressively align the sequences following the branch order of the tree, as follows: SU6-6 BME355 STUDY UNIT 6 For the second example, Figure 6.2 Results from a MSA using RBP and Lipocalins as query Why “once a gap, always a gap”? • • • • • There are many possible ways to make a MSA Where gaps are added is a critical question Gaps are often added to the first two (closest) sequences To change the initial gap choices later on would be to give more weight to distantly related sequences To maintain the initial gap choices is to trust that those gaps are most believable SU6-7 BME355 STUDY UNIT 6 Sum-of-Pairs function How do we assess the quality of an MSA? Usually, assumption is made that the alignment score is the sum of column scores. Therefore, we need a way to assign a score to each column and add them up to get the alignment score. To score a column, we want a function with k arguments, where k is the number of sequences. Sum-of-Pairs function: Scores of pair-wise alignments in each column added together. e.g., SP-score(I, , I, V) = p(I,) + p(I,I) + p(I,V) + p(,I) + p(,V) + p(I,V) p(a,b) is the pairwise score of symbols a and b, specified by a substitution matrix p(a,) or p(,a) is a specified gap penalty p(,) = 0 SP-score is independent of the order of arguments, e.g., SPscore(I, , I, V) = SP-score(V, I, I, ) k(k1)/2 pairwise scores need to be added for k arguments SP scoring system is widely used due to its simplicity and effectiveness Multiple Dynamic Programming Dynamic programming with two sequences: Relatively easy to code; Guaranteed to obtain optimal alignment. Can this be extended to multiple sequences? Consider the three amino acid sequences VSNS, SNA, AS. Put one sequence per axis (x, y, z) → 3dimensional array required. Complexity of Multiple Dynamic Programming: Given k sequences of length n each Compute a global multiple alignment with optimal SP-score via DP Space complexity is O(n k) SU6-8 BME355 STUDY UNIT 6 Time Complexity is O(k 2k n k) Cells in DP matrix: O(n k) Cells in Recurrence Relation: O(2 k) Evaluating SP-score: O(k) Conclusion: It’s going to take forever to align, say, 10 length-200 strings via Multiple DP. Since complexity of DP approach is exponential in the number of sequences, heuristic methods are usually used. Star Alignment Input: – k sequences S1,…,Sk – scoring scheme Procedure: Pick one sequence as the “centre”, called Sc; For each Si Sc, determine an optimal pairwise alignment between Si and Sc; Aggregate pairwise alignments Output: – MSA result from the aggregate How good is the solution of the star alignment method? Under the following conditions it is possible to derive a “Bounded Error Approximation”: Use Distance d(x,y) instead of similarity s(x,y); Assume the “Triangle Inequality” holds: dist(x,z) dist(x,y) + dist(y,z) for all characters x, y, z of the alphabet including space Theorem: The star alignment method outputs an MSA with SP-score within 2optimal. ClustalW Most practical multiple alignments are made using the progressive alignment method. The alignment is constructed by adding one sequence at a time to a SU6-9 BME355 STUDY UNIT 6 growing alignment. Progressive alignments are fast enough to allow hundreds or even thousands of sequences to be aligned. We use the program ClustalW to make a multiple sequence alignment and compare the sequences with each other. The ClustalW algorithm is a fully automated method for global multiple sequence alignment of DNA and Protein sequences. It attempts to optimise the weighted sums-of-pairs with gap penalties. The algorithm provides weights to the sequence and adjustable parameters with reasonable defaults. ClustalW can create multiple alignments and can also create phylogenetic trees. Briefly, the ClustalW program runs in three stages: Stage 1 Pairwise Alignment Compute pairwise alignments for each sequence against all other sequences and store the result in a similarity matrix. Convert the values in the sequence similarity matrix to distance measures which reflect the evolutionary distance between each pair of sequences. Stage 2 Guiding Tree Construct a guide tree which defines the order in which pairs of sequences are aligned and combined with previous alignments using the sequence similarity matrix and a neighbour-joining algorithm. Stage 3 Progressive Alignment Align progressively following the guide tree. Start by aligning most closely related pairs of sequences and at each step align two sequences or one to an existing group of sub-alignment. SU6-10 BME355 STUDY UNIT 6 Figure 6.3 MSA steps and results using ClustalW (Image courtesy: Baxenavis and Oullette, 2001) Shortcoming of Progressive Approach: No guarantee that the global optimal solution will be found. “Once a gap, always a gap” Any mistakes (misaligned regions) made early in the alignment process cannot be corrected later as new information from other sequences is added. These mainly result from an incorrect branching order in the initial tree. The initial phylogenetic trees are derived from a matrix of distances between separately aligned pairs of sequences and are much less reliable than trees from complete multiple alignments. When all the sequences are highly divergent (e.g.. less than 25-30% identity between any pair), this progressive approach becomes much less reliable. Runtime is still high, although ClustalW has polynomial complexity. SU6-11 BME355 STUDY UNIT 6 6.2 Profile HMMs and Alternative MSA tools Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue arranged in a column of a multiple sequence alignment. HMMs are probabilistic models. Like a hammer is more refined than a blast, an HMM provides more sensitive alignment than traditional techniques using progressive alignments. Hidden Markov Models (Access video via iStudyGuide) An HMM is constructed from a MSA, for example, five lipocalins: GTWYA (hs RBP) GLWYA (mus RBP) GRWYE (apoD) GTWYE (E Coli) GEWFS (MUP4) SU6-12 BME355 STUDY UNIT 6 Prob. 1 p(G) 1.0 2 p(T) 0.4 p(L) 0.2 p(R) 0.2 p(E) 0.2 p(W) 3 4 5 0.4 1.0 p(Y) 0.8 p(F) 0.2 p(A) 0.4 p(S) 0.2 P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064 log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75 Figure 6.4a & b Structure of a hidden Markov model (HMM) SU6-13 BME355 STUDY UNIT 6 HMMER: build a hidden Markov model Determining effective sequence number … done. [4] Weighting sequences heuristically ... done. Constructing model architecture ... done. Converting counts to probabilities ... done. Setting model name, etc. ... done. [x] Constructed a profile HMM (length 230) Average score: 411.45 bits Minimum score: 353.73 bits Maximum score: 460.63 bits Std. deviation: 52.58 bits HMMER: calibrate a hidden Markov model HMM file: lipocalins.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: random seed: 5000 1034351005 histogram(s) saved to: [not saved] POSIX threads: 2 -------------------------------- SU6-14 BME355 STUDY UNIT 6 HMM mu :x : -123.894508 lambda : max 0.179608 : -79.334000 Figure 6.4 b HMMER: search an HMM against GenBank Match to a bacterial lipocalin: SU6-15 BME355 STUDY UNIT 6 HMMER: search an HMM against GenBank (continue) Scores for complete sequences (score includes all domains): Sequence Description Score E-value N -------- ----------- ----- ------- --- gi|3041715|sp|P27485|RETB_PIG Plasma retinol- 614.2 1.6e-179 1 gi|89271|pir||A39486 plasma retinol- 613.9 1.9e-179 1 gi|20888903|ref|XP_129259.1| (XM_129259) ret 608.8 6.8e-178 1 gi|132407|sp|P04916|RETB_RAT Plasma retinol- 608.0 1.1e-177 1 gi|20548126|ref|XP_005907.5| (XM_005907) sim 607.3 1.9e-177 1 gi|20141667|sp|P02753|RETB_HUMAN Plasma retinol- 605.3 7.2e-177 1 gi|5803139|ref|NP_006735.1| (NM_006744) ret 600.2 2.6e-175 1 Two Kinds of Multiple Sequence Alignment Resources Databases of multiple sequence alignments: Text-based searches of CDD, Pfam (profile HMMs), PROSITE Database searches with a query sequence with BLAST, CDD, PFAM Examples: BLOCKS (HMM) CDD (HMM) DOMO (Gapped MSA) INTERPRO (Integrative resources) iProClass (Integrative resources) MetaFAM (Integrative resources) Pfam (profile HMM library) PRINTS PRODOM (PSI-BLAST) SU6-16 BME355 STUDY UNIT 6 PROSITE SMART Multiple sequence alignment by manual input: PileUp, CLUSTAL W, CLUSTAL X Examples: AMAS CINEMA ClustalW ClustalX DIALIGN HMMT Match-Box MultAlin MSA Musca PileUp SAGA T-COFFEE MSA databases: manual vs. automated curation Manual curation: Pfam PROSITE BLOCKS SU6-17 BME355 STUDY UNIT 6 PRINTS Automated curation: DOMO PRODOM MetaFam + comprehensive - alignment errors Strategy for Assessment of Alternative MSA Algorithms Categories of multiple sequence alignment algorithms: Local Progressive Global CLUSTAL PIMA PileUp Iterative DIALIGN SAGA Figure 6.5 Categories of Multiple Sequence Alignment algorithms 1. Create or obtain a database of protein sequences for which the 3D structure is known. Thus we can define “true” homologs using structural criteria. SU6-18 BME355 STUDY UNIT 6 2. Try making multiple sequence alignments with many different sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers) 3. Compare the answers. BaliBase: comparison of multiple sequence alignment algorithms • As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity: Proteins <25% identity: 65% of residues align well Proteins <40% identity: 80% of residues align well SU6-19 BME355 STUDY UNIT 6 • “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local. • Separate multiple sequence alignments can be combined (e.g., RBPs and lactoglobulins). Iterative algorithms (PRRP, SAGA) outperform progressive alignments (ClustalX) • When proteins have large N-terminal or C-terminal extensions, local alignment algorithms are superior. PileUp (global) is an exception. 6.3 Ethical, Legal and Social Implications (ELSI) and Metagenomics In this course, we have looked at the role of genomics and use of genomic sequences in various studies involving evolution and functional analysis of the cell. We have looked at various techniques which have made genomic sequences an integral part of medicine and diagnosis. In the following two audio files, we key in one specific application (Metagenomics) which uses the sequence information from complex environments to generate a new and novel cure for an infection. Metagenomics (Access video via iStudyGuide) Additionally, we look at some of the questions being raised by the use of genomic data and the ethical, legal and social implications of incorporation of such data in medicine and disease studies. Ethical, Legal, Social Implications in Genomics (Access video via iStudyGuide) SU6-20 BME355 STUDY UNIT 6 1. Which of the following programs does not generate a multiple sequence alignment? (a) PSI-BLAST (b) ClustalW (c) PileUp (d) PHYLIP 2. Which of the following is not a database consisting primarily of hidden Markov models? (a) Pfam (b) PRINTS (c) SMART (d) TIGRFAMs Answers: (d) (b) Solve a HMM tutorial in class and understand why HMM is popularly used in many applications such as speech recognition, sonar and in genomics. Answers will be discussed in class. Carryout a short MSA in class of 25 bases to understand how the various algorithms compute and score for deletions, gaps, etc., and create a neighbour joining tree from your MSA result. Answers will be discussed in class. Do a google search to learn more about Myriad Genetics and the saga of Breast Cancer patients depending on Myriad’s BRCA-tests. Discuss your findings on this subject in class. SU6-21 BME355 STUDY UNIT 6 Summary The followings key points are discussed in this unit: Multiple sequence alignments: their features and uses Methods used for multiple sequence alignments (MSA) What are profile HMM’s? How are they used in studying evolution? Strategy and alternative MSA methods Summary of MSA algorithms Ethical, Legal and Social Implications of using Genomic data Introduction to Metagenomics SU6-22