Uploaded by Jhung L

BME355 StudyGuide SUSS

advertisement
BME355
GENOMIC SEQUENCE ANALYSIS
COURSE GUIDE
STUDY GUIDE (5CU)
Course Development Team
Head of Programme
:
Assoc Prof Ooi Chui Ping
Course Developer(s)
:
Dr. Lakshmi V. Madabusi, Dr. LIN Feng
Production
:
Educational Technology & Production Team
© 2020 Singapore University of Social Sciences. All rights reserved.
No part of this material may be reproduced in any form or by any means
without permission in writing from the Educational Technology &
Production, Singapore University of Social Sciences.
Educational Technology & Production
Singapore University of Social Sciences
461 Clementi Road
Singapore 599491
Release V1.3
CONTENTS
COURSE GUIDE
1. Welcome .............................................................................................................1
2. Course Description and Aims .........................................................................1
3. Learning Outcomes .......................................................................................... 3
4. Learning Material ............................................................................................. 4
5. Assessment Overview ...................................................................................... 4
6. Course Schedule ................................................................................................ 5
7. Learning Mode ..................................................................................................5
STUDY UNIT 1
INTRODUCTION TO MOLECULAR GENETICS
Learning Outcomes ......................................................................................... SU1-1
Overview ........................................................................................................... SU1-1
Chapter 1 Introduction to Molecular Genetics ............................................ SU1-2
Summary ......................................................................................................... SU1-13
References ....................................................................................................... SU1-14
STUDY UNIT 2
MOLECULAR GENETICS AND DATABASES
Learning Outcomes ......................................................................................... SU2-1
Overview ........................................................................................................... SU2-1
Chapter 2 Molecular Genetics and Databases ............................................. SU2-2
Summary ......................................................................................................... SU2-39
STUDY UNIT 3
PAIRWISE SEQUENCE ALIGNMENT
Learning Outcomes ......................................................................................... SU3-1
Overview ........................................................................................................... SU3-1
Chapter 3 Pairwise Sequence Alignment ..................................................... SU3-2
Summary ......................................................................................................... SU3-29
STUDY UNIT 4
DATABASE SEARCHING WITH BLAST AND FASTA
Learning Outcomes ......................................................................................... SU4-1
Overview ........................................................................................................... SU4-1
Chapter 4 Database Searching with BLAST and FA4.1 Basic Concepts,
BLAST Searches and Interpretation of Results ........................................... SU4-2
Summary ......................................................................................................... SU4-17
STUDY UNIT 5
ADVANCED BLAST SEARCHING
Learning Outcomes ......................................................................................... SU5-1
Overview ........................................................................................................... SU5-1
Chapter 5 Advanced BLAST Searching ....................................................... SU5-2
Summary ......................................................................................................... SU5-20
STUDY UNIT 6
MULTIPLE SEQUENCE ALIGNMENT
Learning Outcomes ......................................................................................... SU6-1
Overview ........................................................................................................... SU6-1
Chapter 6 Multiple Sequence Alignment (MSA) ........................................ SU6-2
Summary ......................................................................................................... SU6-22
COURSE GUIDE
BME355 COURSE GUIDE
1. Welcome
(Access video via iStudyGuide)
Welcome to the course BME355 Genomic Sequence Analysis, a 5 credit unit (CU) course.
This Study Guide will be your personal learning resource to take you through the
course learning journey. The guide is divided into two main sections – the Course
Guide and Study Units.
The Course Guide describes the structure for the entire course and provides you with
an overview of the Study Units. It serves as a roadmap of the different learning
components within the course. This Course Guide contains important information
regarding the course learning outcomes, learning materials and resources, assessment
breakdown and additional course information.
2. Course Description and Aims
BME355 Genomic Sequence Analysis and BME 356 Functional Genomics form the
bioinformatics specialization in your BSBE degree program. The two courses are
delivered consecutively in the semester and students are encouraged to complete both
within the same semester.
The complete sequencing of the Human Genome in 2001, development of high
throughput genomic sequencing and the reduced costs of computation have acted as
catalysts to allow application of genomic sequencing information in medicine and
other applications.
The overall aim of the course is to understand and evaluate how genomic sequences
are utilized in research and medicine. Topics include introduction to molecular
genetics and genomic databases, computational methods and algorithms for
analysing and disseminating genomic information
Course Structure
This course is a 5-credit unit course presented over 6 weeks.
There are six Study Units in this course. The following provides an overview of each
Study Unit.
1
BME355 COURSE GUIDE
Study Unit 1 – Introduction to Molecular Genetics
This unit introduces the background and coverage of computational biology, and
bioinformatics in general. It defines the disciplines of bioinformatics, genomics and
functional genomics. Three perspectives are to summarize the subject of
bioinformatics: The Central Dogma of molecular biology, cellular processes, the
genetic code, model organisms, coding vs. non-coding paradox and the tree of life. A
consistent example of a gene and its corresponding protein product, Retinol-Binding
Protein, is introduced, for illustration of use of various computational tools
throughout the course.
Study Unit 2 – Molecular Genetics and Databases
This unit introduces the main concepts of cellular biology and molecular biology, as
well as biological databases for computational biology. It discusses how these
databases are organized to store data and what strategies are used to extract
information from them. Three publicly accessible databases store large amounts of
nucleotide and protein sequence data: GenBank at the National Centre for
Biotechnology Information (NCBI), DNA Database of Japan (DDBJ), and the European
Bioinformatics Institute (EBI). Five ways to access DNA and protein sequences are
studied, demonstrated by examples.
Study Unit 3 – Pairwise Sequence Alignment
This unit evaluates the methods for analysing the relatedness of genes and proteins,
with focus on the pairwise sequence alignment algorithms. We adopt an evolutionary
perspective in our description of how amino acids (or nucleotides) in two sequences
can be aligned and compared. We then describe the algorithms and programs for
global (Needleman-Wunsch) and local (Smith-Waterman) pairwise alignment.
Study Unit 4 – Database Searching with BLAST and FASTA
This unit introduces the Basic Local Alignment Search Tool (BLAST) which is the main
NCBI tool for comparing a query sequence to other sequences in various databases, as
well as FASTA. BLAST and FASTA are heuristic and rapid version of pairwise
alignment algorithm. Steps of the BLAST and FASTA search processes are described.
Strategies applied for BLAST database searching are discussed with examples.
Study Unit 5 – Advanced BLAST Searching
BLAST searches can be very versatile. This unit further explores the advanced BLAST
searching techniques. We begin with an overview of the specialized BLAST resources and
websites. We then focus on finding distantly related proteins with Position-specific Iterated
2
BME355 COURSE GUIDE
BLAST (PSI-BLAST) and significant pattern matches with Pattern-Hit Initiated BLAST (PHIBLAST). Finally, using BLAST for gene discovery is illustrated.
Study Unit 6 – Multiple Sequence Alignment
This unit considers establishment of relationship between multiple biological
sequences. By introducing sequences into a multiple alignment, we can define
members of a gene or protein family. If we know a feature of one of the proteins and
identify the homologous proteins, we can predict that they may have similar function.
Basic concepts and practical strategies of multiple sequence alignment are studied.
Databases of multiple sequence alignments are introduced. Two main multiple
sequence alignment programs are closely examined.
3. Learning Outcomes
Knowledge & Understanding (Theory Component)
By the end of this course, you should be able to:





Demonstrate competence in the basic concepts of molecular genetics and
computational biology.
Discuss the genomic sequence organization and select specific genomic
sequence data using GenBank, Ensembl, etc.
Examine the various scoring matrices used for protein/DNA alignment and
evaluate global vs. local sequence alignment tools used in studying evolution.
Assemble the target sequences in genomic databases using the homology
score matrices and the heuristic search tools
Evaluate various types of multiple sequence alignment algorithms & global
genomic analysis tools to formulate a solution for a research problem using
these tools
Key Skills (Practical Component)
By the end of this course, you should be able to:

Solve problems using multiple genomic tools (online) introduced in this
course to answer a complex research question
3
BME355 COURSE GUIDE
4. Learning Material
The following is a list of the required learning materials to complete this course.
Required Textbook(s)
Jonathan Pevsner, Bioinformatics and Functional Genomics (2009). John Wiley& Sons
Inc.,
5. Assessment Overview
The overall assessment weighting for this course is as follows:
Assessment
Assignment 1
Assignment 2
Examination
TOTAL
Description
Online Quiz (OCAS)
Online Quiz (OCAS)
ECA
Weight Allocation
15%
15%
70%
100%
The following section provides important information regarding Assessments.
Continuous Assessment:
Assignment 1 and 2 comprise of online quizzes weighted at 15% each, total 30%. Assignment
1 and 2 combined will constitute 100% of OCAS.
Examination:
The final examination is an End-of-Course Assessment (ECA) and is 100% of this
component.
Passing Mark:
To pass the course you need to achieve scores of 40% in each component. Your overall
rank score is the weighted average of all components.
For detailed information on the Course grading policy, please refer to The Student
Handbook (‘Award of Grades’ section under Assessment and Examination
Regulations). The Student Handbook is available from the Student Portal.
4
BME355 COURSE GUIDE
Non-graded Learning Activities:
Activities for the purpose of self-learning are present in each study unit. These
learning activities are meant to enable you to assess your understanding and
achievement of the learning outcomes. The type of activities can be in the form of Quiz,
Review Questions, Application-Based Questions or similar. You are expected to
complete the suggested activities either independently and/or in groups as required
for each individual activity.
6. Course Schedule
To help monitor your study progress, you should pay special attention to your Course
Schedule. It contains study unit related activities including Assignments, Selfassessments, and Examinations. Please refer to the Course Timetable in the Student
Portal for the updated Course Schedule.
Note: You should always make it a point to check the Student Portal for any
announcements and latest updates.
7. Learning Mode
The learning process for this course is structured along the following lines of learning:
(a) Self-study guided by the study guide units. Independent study will require at
least 3 hours per week.
(b) Working on assignments, either individually or in groups.
(c) Face to Face/ Online sessions (3 hours each session, 6 sessions in total).
iStudyGuide
You may be viewing the iStudyGuide version, which is the mobile version of the
Study Guide. The iStudyGuide is developed to enhance your learning experience with
interactive learning activities and engaging multimedia. Depending on the reader you
are using to view the iStudy Guide, you will be able to personalize your learning with
digital bookmarks, note-taking and highlight sections of the guide.
Interaction with Instructor and Fellow Students
Although flexible learning – learning at your own pace, space and time – is a hallmark
at SUSS, you are encouraged to engage your instructor and fellow students in online
5
BME355 COURSE GUIDE
discussion forums. Sharing of ideas through meaningful debates will help broaden
your learning and crystallize your thinking.
Academic Integrity
As a student of SUSS, it is expected that you adhere to the academic standards
stipulated in The Student Handbook, which contains important information
regarding academic policies, academic integrity and course administration. It is
necessary that you read and understand the information stipulated in the Student
Handbook, prior to embarking on the course.
6
STUDY UNIT 1
INTRODUCTION TO
MOLECULAR GENETICS
BME355 STUDY UNIT 1
Learning Outcomes
By the end of this unit, you should be able to:
1. Develop a working knowledge of the terms involved in molecular
biology and computational biology
2. Differentiate between regulation at the organism level vs. cellular and
gene level
3. Illustrate uses of genetic sequence information in studies of evolution
4. Demonstrate knowledge of the cell components, cell cycle, and basics
of molecular genetics and molecular biology
5. Contrast between genomic tools pre vs. post Human Genome Project
(HGP) era
6. Use key molecular biology tools (PCR, sequencing, others) to propose
solutions to a molecular biology research problem
7. Compile a list of the main databases and internet resources used for
genomic sequence analysis
You can refer to Chapter 1 of the textbook
All the figures and tables in this unit are from the recommended text book by
Jonathan Pevsner.
Overview
This unit introduces the background and coverage of computational biology,
and bioinformatics in general. It defines the disciplines of bioinformatics,
genomics and functional genomics. Three perspectives are to summarize the
subject of bioinformatics: The Central Dogma of molecular biology, cellular
processes, the genetic code, model organisms, coding vs. non-coding paradox
and the tree of life. A consistent example of a gene and its corresponding
protein product, Retinol-Binding Protein, is introduced, for illustration of use
of various computational tools throughout the course.
SU1-1
BME355 STUDY UNIT 1
Chapter 1 Introduction to Molecular Genetics
1.1 Definition
Bioinformatics focuses on the use of computer databases and computer
algorithms to analyse proteins, genes, and the complete collections of
deoxyribonucleic acid (DNA) that comprises an organism (the genome). A
major challenge in biology is to make sense of the enormous quantities of
sequence data and structural data that are generated by genome-sequencing
projects, proteomics, and other large-scale molecular biology efforts. The tools
of bioinformatics include computer programs that help to reveal fundamental
mechanisms underlying biological problems related to the structure and
function of macromolecules, biochemical pathways, disease processes, and
evolution.
According to a National Institutes of Health (NIH) definition, bioinformatics is:
The research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioural or
health data, including those to acquire, store, organise, analyse, or visualise
such data.
Such a definition is also used for the closely related discipline of
computational biology, although more accurately, computational biology is the
development and application of data-analytical and theoretical methods,
mathematical modelling and computational simulation techniques to the
study of biological, behavioural, and social systems.
In many textbooks and references, the two terms “bioinformatics” and
“computational biology” are exchangeable.
Genes, Genomes and Genomics
(Access video via iStudyGuide)
SU1-2
BME355 STUDY UNIT 1
1.2 Introduction and Molecular Genetics
We can summarise the entire field of bioinformatics with three perspectives.
The first perspective on bioinformatics is the cell (depicted in the picture
below). The central dogma of molecular biology is that DNA is transcribed
into RNA and translated into protein. The focus of molecular biology has been
on individual genes, messenger RNA (mRNA) transcripts, and proteins.
A focus of the field of bioinformatics is the complete collection of DNA (the
genome), RNA (the transcriptome), and protein sequences (the proteome) that
have been amassed (Henikoff, 2002). These millions of molecular sequences
present both great opportunities and great challenges. A bioinformatics
approach to molecular sequence data involves the application of computer
algorithms and computer databases to molecular and cellular biology. Such
an approach is sometimes referred to as functional genomics.
This typifies the essential nature of bioinformatics: biological questions can be
approached from levels ranging from single genes and proteins to cellular
pathways and networks or even whole genomic responses (Ideker et al., 2001).
Our goals are to understand how to study both individual genes and proteins
and collections of thousands of genes/proteins.
Figure 1.1 The first perspective of the field of bioinformatics is the cell
Bioinformatics has emerged as a discipline as biology has become
transformed by the emergence of molecular sequence data. Databases such as
the European Molecular Biology Laboratory (EMBL), GenBank, and the DNA
Database of Japan (DDBJ) serve as repositories for billions of nucleotides of DNA
sequence data (see Chapter 2). Corresponding databases of expressed genes (RNA)
and protein have been established. A main focus of the field of bioinformatics is to
study molecular sequence data to gain insight into a broad range of biological
problems.
SU1-3
BME355 STUDY UNIT 1
From the cell we can focus on individual organisms which represent the
second perspective of the field of bioinformatics. Each multicellular organism
changes during different stages of its development as its body specialises into
various organs with specific functions. For example, while we may sometimes
think of genes as static entities that specify features such as eye colour or
height, they are in fact dynamically regulated across time and region and in
response to physiological conditions. Gene expression varies in disease states
or in response to a variety of signals, both intrinsic and environmental. Many
bioinformatics tools are available to study the broad biological questions
relevant to the individual: There are many databases of expressed genes and
proteins derived from different tissues and conditions. One of the most
powerful applications of functional genomics is the use of DNA microarrays
to measure the expression of thousands of genes in biological samples.
Two tools of note that have changed the way genomic information could be
integrated into medicine are Polymerase Chain Reaction (PCR) and Next
Generation Sequencing (NGS). Before we embark on learning more on NGS,
the following multimedia will introduce you to basic sequencing techniques
and basic requirements for PCR.
Find Out More
Polymerase Chain Reaction:
https://www.youtube.com/watch?v=eEcy9k_KsDI
DNA sequencing: https://www.youtube.com/watch?v=bEFLBf5WEtc
SU1-4
BME355 STUDY UNIT 1
Time of
Body region
development
physiology, pharmacology, pathology
Figure 1.2 The second perspective of bioinformatics is the organism
Broadening our view from the level of the cell to the organism, we can
consider the individual’s genome (collection of genes), including the genes
that are expressed as RNA transcripts and the protein products. Thus, for an
individual organism bioinformatics tools can be applied to describe changes
through developmental time, changes across body regions, and changes in a
variety of physiological or pathological states.
At the largest scale is the tree of life (figure below). There are many millions
of species alive today, and they can be grouped into the three major branches
of bacteria, archaea (single-celled microbes that tend to live in extreme
environments), and eukaryotes. Molecular sequence databases currently hold
DNA sequence from over 100,000 different organisms. The complete genome
sequences of several hundred organisms will soon become available. One of
the main lessons we are learning is the fundamental unity of life at the
SU1-5
BME355 STUDY UNIT 1
molecular level. We are also coming to appreciate the power of comparative
genomics, in which genomes are compared.
Figure 1.3 The third perspective of the field of bioinformatics
is represented by the tree of life
The scope of bioinformatics includes all of life on Earth, including the three
major branches of bacteria, archaea, and eukaryotes. Viruses, which exist on
the borderline of the definition of life, are not depicted here. For all species,
the collection and analysis of molecular sequence data allow us to describe
the complete collection of DNA that comprises each organism (the genome).
We can further learn the variations that occur between species and among
members of a species, and we can deduce the evolutionary history of life on
Earth. (After Pace, 1997) Used with permission.
SU1-6
BME355 STUDY UNIT 1
A Consistent Example: Retinol-Binding Protein
Throughout the textbook we will focus on the example of a gene and its
corresponding protein product: retinol-binding protein (RBP4), a small,
abundant secreted protein that binds retinol (vitamin A) in blood (Newcomer
and Ong, 2000). Retinol, obtained from carrots in the form of vitamin A, is
very hydrophobic. RBP4 helps transport this ligand to the eye where it is used
for vision. We will study RBP4 in detail because it has a number of interesting
features:

There are many proteins that are homologous to RBP4 in a variety of
species, including human, mouse, and fish (orthologs). We will use
these as examples of how to align proteins, perform database
searches, and study phylogeny.

There are other human proteins that are closely related to RBP4
(paralogs). Altogether the family that includes RBP4 is called the
lipocalins, a diverse group of small ligand-binding proteins that tend
to be secreted into extracellular spaces (Akerstrom et al., 2000; Flower
et al., 2000). Other lipocalins have fascinating functions such as
apoliprotein D (which binds cholesterol), a pregnancy-associated
lipocalin, aphrodisin (an “aphrodisiac” in hamsters), and an odorantbinding protein in mucus.

There are even bacterial lipocalins, which could have a role in
antibiotic resistance (Bishop, 2000). We will explore how bacterial
lipocalins could be ancient genes that entered eukaryotic genomes by
a process called lateral gene transfer.

The gene expression levels of some lipocalins are dramatically
regulated.

Because the lipocalins are small, abundant, and soluble proteins, their
biochemical properties have been characterised in detail. The threedimensional protein structure has been solved for several of them by
X-ray crystallography.

Some lipocalins have been implicated in human disease.
Another molecule we will introduce is the pol (polymerase) gene of human
immunodeficiency virus 1 (HIV-1). HIV presents one of the greatest public
health challenges in the world today. Over 42 million people are infected as
SU1-7
BME355 STUDY UNIT 1
of the end of the year 2002 and over 16 million people have died. The HIV-1
genome encodes just nine proteins, including pol (Frankel and Young, 1998).
We will examine pol throughout the book because the properties of this gene,
its protein products, and the HIV-1 genome are distinct from the lipocalins.

The pol gene is a multi-domain protein: it is a single polypeptide with
several structurally and functionally distinct domains. The pol gene
encodes a protein of 1003 amino acids with reverse transcriptase
activity (that is, an RNA-dependent DNA polymerase). It is also an
aspartyl protease, and it has integrase activity. These multiple
activities are typical of multidomain proteins.

Themodular nature of the pol protein affects our ability to perform
database searches and multiple sequence alignments.

The pol gene incorporates substitutions extremely rapidly. A typical
individual infected by HIV may have over a million variants of pol.
The study of the evolution of pol complements our study of the
lipocalins.
As a viral protein, our study of pol gives us the opportunity to learn how to
access bioinformatics resources relevant to studying viruses. Database
searches with pol will help emphasise how to restrict searches to particular
domains of the tree of life.
Model organisms:
Much of the knowledge we have acquired in biology over the decades has
been through experimentation in a set of few model organisms. Shorter
lifespans, the ease of manipulation of genetic material in these organisms has
allowed evaluations which are unethical in human subjects. By using these
biological models we are able to glean information on how the human cell
itself is regulated.
Model organisms in Genomics
(Access video via iStudyGuide)
SU1-8
BME355 STUDY UNIT 1
Web Exercises
Often, students of bioinformatics have a particular research area of interest
such as a gene, a physiological process, a disease, or a genome. It is hoped that
by studying RBP4 and other specific proteins and genes throughout this book,
students can simultaneously apply the principles of bioinformatics to their
own research questions.
It has been helpful to complement lectures with computer labs. All the
websites described in the textbook are freely available on the World Wide
Web, and many of the software packages are free for academic use.
Another feature of the course is that each student is required to discover a
novel gene by the last day of the course. The student must begin with any
protein sequence of interest and perform database searches to identify
genomic DNA that encodes a protein no one has described before. This
problem is described in Chapter 5. The student thus chooses the name of the
gene and its corresponding protein and describes information about the
organism and evidence that the gene has not been described before.
Then, the student creates a multiple sequence alignment of the new protein
(or gene) and creates a phylogenetic tree showing its relation to other known
sequences. A benefit of this exercise is that it requires a student to actively use
the principles of bioinformatics. Most students choose a gene (or protein)
relevant to their own research area, while others find new lipocalins.
1.3 Key Bioinformatics Websites
The field of bioinformatics relies heavily on the Internet as a place to access
sequence data, to access software that is useful to analyse molecular data, and
as a place to integrate different kinds of resources and information relevant to
biology.
We will describe a variety of websites. Initially, we will focus on the three
main publicly accessible databases that serve as repositories for DNA and
protein data (Table 1.1). In Chapter 2, we begin with the National Centre for
Biotechnology Information (NCBI), which hosts GenBank. The NCBI website
offers a variety of other bioinformatics-related tools. We will gradually
introduce the European Bioinformatics Institute (EBI) web server, which hosts
a complementary DNA database (EMBL, the European Molecular Biology
Laboratory database). We will also introduce the DNA Database of Japan
(DDBJ). The research teams at GenBank, EMBL, and DDBJ share sequence
SU1-9
BME355 STUDY UNIT 1
data on a daily basis. A general theme of the discipline of bioinformatics is
that many databases are closely interconnected.
Table 1.1 Three Primary Bioinformatics Web Servers That Serve as Centralised Repositories
for DNA and Protein Sequence Data
Throughout the course we will introduce many websites that are relevant to
bioinformatics. Table 1.2 lists several additional servers that offer databases
as well as many programs for the analysis of biological sequences.
Table 1.2 Additional Bioinformatics Web Servers
SU1-10
BME355 STUDY UNIT 1
Table 1.3 lists several additional sites that offer links to bioinformatics
resources. We present them now for those who wish to explore the types of
bioinformatics resources that are currently available.
Table 1.3 Bioinformatics Sites with Useful Links
Overviews of the field of bioinformatics have been written by Mark Gerstein
and colleagues (Luscombe et al., 2001) and Claverie et al. (2001). Kaminski
(2000) also introduces bioinformatics, with practical suggestions of websites
to visit. Russ Altman (1998) discusses the relevance of bioinformatics to
medicine, while David Searls (2000) introduces bioinformatics tools for the
study of genomes.
Read the following material and discuss the answers with your lecturer
during class:
http://en.wikipedia.org/wiki/Translation_(biology)
a. Which enzyme attaches amino acids to their corresponding tRNA?
b. Can you name the triplet codons coding for a stop codon? What does the
ribosome do when it reaches a stop codon?
c. Where does translation occur in the cell?
SU1-11
BME355 STUDY UNIT 1
Find Out More
Self-learn sessions (to be completed before the first lecture):
DNA and RNA: History and Structure
https://www.youtube.com/watch?v=qoERVSWKmGk&index=404&list=UUE
ik-U3T6u6JA0XiHLbNbOw
Central Dogma, Replication, Transcription and Translation
https://www.youtube.com/watch?v=W4mYwsr9gGE&list=UUEikU3T6u6JA0XiHLbNbOw&index=403
Cell cycle: Mitosis and Meiosis
https://www.youtube.com/watch?v=2aVnN4RePyI&list=UUEikU3T6u6JA0XiHLbNbOw
Polymerase Chain Reaction:
https://www.youtube.com/watch?v=eEcy9k_KsDI
DNA sequencing:
https://www.youtube.com/watch?v=bEFLBf5WEtc
SU1-12
BME355 STUDY UNIT 1
Summary
The followings key points are discussed in this unit:




Cellular components and how they are regulated
Introduction to PCR and DNA sequencing
The role of model organisms and their use in genomics research
A brief summary of the key bioinformatics websites and datasets
SU1-13
BME355 STUDY UNIT 1
References
Akerstrom, B., Flower, D. R., and Salier, J. P. Lipocalins: Unity in diversity.
Biochim. Biophys. Acta 1482, 1–8 (2000).
Altman, R. B. Bioinformatics in support of molecular medicine. Proc. AMIA
Symp., 53–61 (1998).
Bishop, R. E. The bacterial lipocalins. Biochim. Biophys. Acta 1482, 73–83 (2000).
Boguski, M. S. Bioinformatics. Curr. Opin. Genet. Dev. 4, 383–388 (1994).
Claverie, J. M., Abergel, C., Audic, S., and Ogata, H. Recent advances in
computational genomics. Pharmacogenomics 2, 361–372 (2001).
Flower, D. R., North, A. C., and Sansom, C. E. The lipocalin protein family:
Structural and sequence overview. Biochim. Biophys. Acta 1482, 9–24
(2000).
Frankel, A. D., and Young, J. A. HIV-1: Fifteen proteins and an RNA. Annu.
Rev. Biochem. 67, 1–25 (1998).
Goodman, N. Biological data becomes computer literate: New advances in
bioinformatics. Curr. Opin. Biotechnol. 13, 68–71 (2002).
Henikoff, S. Beyond the central dogma. Bioinformatics 18, 223–225 (2002).
Ideker, T., Galitski, T., and Hood, L. A new approach to decoding life: Systems
biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 (2001).
Kaminski, N. Bioinformatics. A user’s perspective. Am J. Respir. Cell Mol. Biol.
23, 705–711 (2000).
Luscombe, N. M., Greenbaum, D., and Gerstein, M. What is bioinformatics?
A proposed definition and overview of the field. Methods Inf. Med. 40,
346–358 (2001).
Newcomer, M. E., and Ong, D. E. Plasma retinol binding protein: Structure
and function of the prototypic lipocalin. Biochim. Biophys. Acta 1482, 57–
64 (2000).
SU1-14
BME355 STUDY UNIT 1
Pace, N. R. A molecular view of microbial diversity and the biosphere. Science
276, 734–740 (1997).
Searls, D. B. Bioinformatics tools for whole genomes. Annu. Rev. Genomics
Hum. Genet. 1, 251–279 (2000).
SU1-15
STUDY UNIT 2
MOLECULAR GENETICS AND
DATABASE
BME355 STUDY UNIT 2
Learning Outcomes
Upon completion of this unit, you will be able to:
1. Indicate knowledge of the Central Dogma and its applicability post
Human Genome project
2. Differentiate between gene regulation in Prokaryotic and Eukaryotic
cells
3. Differentiate between coding and non-coding regulatory elements in
the cell
4. Discuss in vitro and in vivo analyses and the requirements for such
experiments in cell biology
5. Use genetic database tools to locate DNA/ Protein sequences
6. Illustrate ability to work with GenBank sequences
7. Differentiate between various types of mutations and database
resources for studying them
8. Illustrate the role of mutations in disease diagnosis and prognosis
9. Practise literature searches using Medline using different search
strategies
10. Analyse data from disease related databases to glean information on
mutations involved, taxonomy, structure and population related
information
11. Illustrate knowledge of the most current Next Generation Sequence
analysis tools and instrumentation
You can refer to Chapter 2 of the textbook
Overview
This unit introduces the main concepts of cellular biology and molecular
biology, as well as biological databases for computational biology. It discusses
how these databases are organized to store data and what strategies are used
to extract information from them. Three publicly accessible databases store
large amounts of nucleotide and protein sequence data: GenBank at the
National Centre for Biotechnology Information (NCBI), DNA Database of
Japan (DDBJ), and the European Bioinformatics Institute (EBI). Five ways to
access DNA and protein sequences are studied, demonstrated by examples.
SU2-1
BME355 STUDY UNIT 2
Chapter 2 Molecular Genetics and Databases
2.1 Cellular Biology and Molecular Genetics
All living organisms are characterised by the capacity to reproduce and
evolve. The genome of an organism is defined as the collection of DNA within
that organism, including the set of genes that encode proteins.
Prokaryotic & Eukaryotic Cells
Cells are the structural and functional unit of all living organisms. Some
organisms, such as bacteria, are unicellular, and they consist of a single cell.
Other organisms, such as humans, are multicellular. Basically, there are two
general categories of cells: prokaryotes and eukaryotes.
The simplest cells were prokaryotic cells, organisms that lack nuclear
membrane. Bacteria are the most studied form of prokaryotic organisms.
Prokaryotes are unicellular organisms that do not develop or differentiate into
multicellular forms. Prokaryotic cells have three architectural regions:
appendages called flagella and pili (proteins attached to the cell surface); a cell
envelope consisting of a capsule, a cell wall and a plasma membrane; and a
cytoplasmic region that contains the cell genome and ribosomes and various
sorts of inclusions.
Eukaryotes include fungi, animals, and plants as well as some unicellular
organisms. The major and extremely significant difference between
prokaryotes and eukaryotes is that eukaryotic cells contain membranebounded compartments in which specific metabolic activities take place. Most
important among these is the presence of a nucleus, a membrane-delineated
compartment that houses the eukaryotic cell’s deoxyribonucleic acid (DNA).
Eukaryotic organisms also have other specialised structures, called organelles,
which are small structures within cells that perform dedicated functions.
SU2-2
BME355 STUDY UNIT 2
Figure 2.1 Organisation of the cell and its organelles
(Picture courtesy https://en.wikipedia.org/wiki/File:Endomembrane_system_diagram_en.svg)
Eukaryotic and prokaryotic cells differ in the organisation of the nucleus. A
well- defined envelope encompassing the DNA and chromosomes forms the
nucleus of a eukaryotic cell. The nucleus for a prokaryotic cell is not so
demarcated. Key processes such as DNA replication and transcription are
carried out here. Others such as translation take place in the cytoplasm.
Specialised organelles such as Golgi, mitochondria are distinct and serve
other unique functions in the cell.
For most unicellular organisms, reproduction is a simple matter of cell
duplication, also known as replication. But for multicellular organisms, cell
replication and reproduction are two separate processes. Multicellular
organisms replace damaged or worn out cells through a replication process
called mitosis - the division of a eukaryotic cell nucleus to produce two
identical daughter nuclei. Every time a cell divides, it must ensure that its
DNA is shared between the two daughter cells. Mitosis is the process of
“divvying up” the genome between the daughter cells.
SU2-3
BME355 STUDY UNIT 2
Find Out More
DNA and RNA: History and Structure
https://www.youtube.com/watch?v=qoERVSWKmGk&index=404&list=UUE
ik-U3T6u6JA0XiHLbNbOw
Central Dogma, Replication, Transcription and Translation
https://www.youtube.com/watch?v=W4mYwsr9gGE&list=UUEikU3T6u6JA0XiHLbNbOw&index=403
Cell cycle: Mitosis and Meiosis
https://www.youtube.com/watch?v=2aVnN4RePyI&list=UUEikU3T6u6JA0XiHLbNbOw
Meiosis is a specialised type of cell division that occurs during the formation
of gametes. Although meiosis may seem much more complicated than mitosis,
it is really just two cell divisions in sequence. Each of these sequences
maintains strong similarities to mitosis. To reproduce, eukaryotes must first
create special cells called gametes (eggs and sperms) that then fuse to form
the beginning of a new organism. Gametes are but one of the many unique
cell types that multicellular organisms require in order to function as a
complete organism. The gametes are created through the process of meiosis.
Meiosis serves to reduce the chromosome number for that particular
organism by half. The sperm and egg join to make a single cell, which restores
the chromosome number. This joined cell then divides and differentiates into
different cell types that eventually form an entire functioning organism.
SU2-4
BME355 STUDY UNIT 2
Figure 2.2 Meiosis and Mitosis pathways in the cell
(Picture courtesy http://www.accessexcellence.org/AB/GG/comparison.html)
Mitosis results in a cell with two sets of chromosomes (diploid). These
daughter cells are exact copies of their parent. Meiosis, on the other hand
takes place in specialised gamete or reproductive cells and results in haploid
cells with just one copy of the chromosomes. Recombination during the
formation of gamete cells enables a shuffling of the genetic information in the
daughter cells. These gametes (ova or sperm) give rise to the diploid embryo
after fertilisation.
All organisms suffer a certain number of small mutations, or random changes
in a DNA sequence, during the process of DNA replication. These are called
spontaneous mutations and occur at a rate characteristic for that organism.
Genetic recombination refers more to a large-scale rearrangement of a DNA
molecule. This process involves pairing between complementary strands of
two parental duplex, or double-stranded DNAs, and results from a physical
exchange of chromosome material.
The position at which a gene is located on a chromosome is called a locus. In
a given individual, one might find two different versions of this gene at a
particular locus. These alternate gene forms are called alleles. Recombination
results in a new arrangement of maternal and paternal alleles on the same
SU2-5
BME355 STUDY UNIT 2
chromosome. Although the same genes appear in the same order, the alleles
are different. This process explains why offspring from the same parents can
look so different.
All the different cell types in our body are all derived from a single, fertilised
egg cell through differentiation. Differentiation is the process by which an
unspecialised cell becomes specialised into one of the many cells that make
up the body, such as a heart, liver or muscle cell. During differentiation,
certain genes are turned on, or become activated, while other genes are
switched off, or inactivated. This process is intricately regulated. As a result,
a differentiated cell will develop specific structures and perform certain
functions.
The Central Dogma
The most fundamental property of all living things is their ability to reproduce.
All cells arise from pre-existing cells. That is, their genetic material must be
replicated and passed from parent cell to progeny. Likewise, all multicellular
organisms inherit their genetic information specifying structure and function
from their parents.
Every organism, including humans, has a genome that contains all the
biological information needed to build and maintain a living example of that
organism. The biological information contained in a genome is encoded in its
DNA and divided into discrete units called genes. Genes code for proteins
that attach to the genome at the appropriate positions and switch on a series
of reactions called gene expression.
SU2-6
BME355 STUDY UNIT 2
Figure 2.3 The central dogma and flow of genetic information in the cell
Genetic information in the cell flows from the DNA, RNA and then to
proteins. Transcription is the process by which the genetic information is
decoded into functional mRNA. Translation, which occurs in the cytoplasm,
uses this message to make proteins. Proteins are the workhorses of the cell
and carry out much of the structural, functional and enzymatic roles in the
cell.
The Central Dogma, a fundamental principle of molecular biology, states that
genetic information flows from DNA to RNA to protein. The genetic code
resides in DNA is passed from generation to generation.
In the process of making a protein, the encoded information must be faithfully
transmitted first to RNA then to protein.
The process of duplicating a cell’s genome, or DNA replication, is required
every time a cell divides. Replication, like all cellular activities, requires
specialised proteins for carrying out the job.
We see evidence of a process called genetic variation among the same species,
such as different hair and eye colour, skin pigment, height and blood type.
These expressed, or phenotypic, traits are due to genotypic variation in a
person's DNA sequence. When two individuals display different phenotypes
SU2-7
BME355 STUDY UNIT 2
of the same trait, they are said to have two different “alleles” for the same
gene. This means that the gene's sequence is slightly different in the two
individuals and the gene is said to be polymorphic. These polymorphic sites
influence gene expression, and also serve as markers for genomic research
efforts.
Most genetic variation occurs during the phases of the cell cycle when DNA
is duplicated. Mutations in the new DNA strand can manifest as base
substitutions, such as when a single base gets replaced with another; deletions,
where one or more bases are left out; or insertions, where one or more bases
are added.
Mutations in Genomics
(Access video via iStudyGuide)
While mutations can cause improper cell development, they also provide a
species with the opportunity to adapt to new environments, as well as to
protect a species from new pathogens. Mutations are what lie behind the
popular saying of “survival of the fittest”, the basic theory of evolution
proposed by Charles Darwin in 1859.
Inside each of our cells lies a nucleus - a membrane bounded region that
provides a sanctuary for genetic information. The nucleus contains long
strands of DNA that encode this genetic information. A DNA chain is made
up of four chemical bases: adenine (A) and guanine (G), which are called
purines, and cytosine (C) and thymine (T), referred to as pyrimidines. Each
base has a slightly different composition, or combination of oxygen, carbon,
nitrogen and hydrogen. In a DNA chain, every base is attached to a sugar
molecule (deoxyribose) and a phosphate molecule, resulting in a nucleic acid
or nucleotide. Individual nucleotides are linked through the phosphate group
and it is the precise order, or sequence, of nucleotides that determines the
product made from that gene.
The DNA that constitutes a gene is a double-stranded molecule consisting of
two chains running in opposite directions. The chemical nature of the bases
in double-stranded DNA creates a slight twisting force that gives DNA its
characteristic gently coiled structure, known as the double helix. The two
strands are connected to each other by chemical pairing of each base on one
strand to a specific partner on the other strand. The base Adenine (A) pairs
with thymine (T), while guanine (G) pairs with cytosine (C). Thus, A-T and
SU2-8
BME355 STUDY UNIT 2
G-C base pairs are said to be complementary. This complementary base
pairing is what makes DNA a suitable molecule for carrying our genetic
information - one strand of DNA can act as a template to direct the synthesis
of a complementary strand. In this way, the information in a DNA sequence
is readily copied and passed on to the next generation of cells.
In the first step of replication, a special protein, called a helicase, unwinds a
portion of the parental DNA double helix. Next, a molecule of DNA
polymerase binds to one strand of the DNA. DNA polymerase begins to move
along the DNA strand in the 3' to 5' direction, using the single-stranded DNA
as a template. This newly synthesised strand is called the leading strand and
is necessary for forming new nucleotides and reforming a double helix.
Because DNA synthesis can only occur in the 5' to 3' direction, a second DNA
polymerase molecule is used to bind to the other template strand as the
double helix opens. This molecule synthesises discontinuous segments of
polynucleotides, called Okazaki fragments. Another enzyme, called DNA
ligase, is responsible for stitching these fragments together into what is called
the lagging strand.
Figure 2.4 DNA replication of the leading and lagging strand
(Picture courtesy
http://www.ultranet.com/~jkimball/BiologyPages/D/DNAReplication.html)
DNA replication enables duplication of the two strands of the DNA. During
this process the leading strand is manufactured in a contiguous fashion, while
the lagging strand is synthesised in parts called Okazaki fragments. The
Okazaki fragments are stitched together using the enzyme DNA ligase. DNA
polymerase used for synthesis of both strands is identical. Other proteins
SU2-9
BME355 STUDY UNIT 2
such as helicases and topoisomerases are needed to unwind and re-wind the
DNA before and after replication.
There are many replication origins sites on a eukaryotic chromosome.
Therefore, replication can begin at some origins earlier than at others. As
replication nears completion, “bubbles” of newly replicated DNA meet and
fuse, forming two new molecules.
Just like DNA, ribonucleic acid (RNA) is a chain, or polymer, on nucleotides
with the same 5' to 3' direction of its strands. However, the ribose sugar
component of RNA is slightly different chemically than that of DNA.
Ribonucleic acid has a 2' oxygen atom that is not present in deoxyribonucleic
acid. Other fundamental structural differences exist. For example, uracil takes
the place of the thymine nucleotide found in DNA and RNA is, for the most
part, a single-stranded molecule.
DNA directs the synthesis of a variety of RNA molecules, each with a unique
role in cellular function. For example, all genes that code for proteins are first
made into an RNA strand in the nucleus called a messenger RNA (mRNA).
The mRNA carries the information encoded in DNA out of the nucleus to the
protein assembly machinery, called the ribosome, in the cytoplasm. The
ribosome complex uses mRNA as a template to synthesise the exact protein
coded by the gene.
DNA transcription refers to the synthesis of RNA from a DNA template. This
process is very similar to DNA replication. There are different proteins that
are responsible for transcription. The most important enzyme is RNA
polymerase. It is an enzyme that influences the synthesis of RNA from a DNA
template. In order for transcription to be initiated, RNA polymerase must be
able to recognise the beginning sequence of a gene so that it knows where to
start synthesising an mRNA. It is directed to this initiation site by the ability
of one of its subunits to recognise a specific DNA sequence found at the
beginning of a gene called the promoter sequence. The promoter sequence is
a unidirectional sequence found on one strand of the DNA that instructs the
RNA polymerase in both where to start synthesis and in which direction
synthesis should continue. The RNA polymerase then unwinds the double
helix at that point and begins synthesis of an RNA strand complementary to
one of the strands of DNA. This strand is called the antisense or template
strand, while the other strand is referred to as the sense or coding strand.
Synthesis can then proceed in a unidirectional.
Genes make up about one percent of the total DNA in our genome. In the
human genome, the coding portions of a gene, called exons, are interrupted
SU2-10
BME355 STUDY UNIT 2
by intervening sequences, called introns. In addition, a eukaryotic gene does
not code for a protein in one continuous stretch of DNA. Both exons and
introns are transcribed into mRNA, but before it is transported to the
ribosome, the primary mRNA transcript is edited. This editing process
removes the introns, joins the exons together, and adds unique features to
each end of the transcript to make a mature mRNA.
Figure 2.5 Schema of splicing of hnRNA transcript and translation of proteins
(Picture courtesy
https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/0326_Splicing.jpg/220px0326_Splicing.jpg)
The transcription machinery in the nucleus gives rise to a RNA transcript
called the heterogeneous nuclear RNA (hnRNA). This is then spliced to
remove the portions that do not encode for the protein (called introns). Only
the exon portion (containing information to form the final protein) is joined
to form the messenger RNA (mRNA). The messenger RNA serves as a template
for the translational machinery to generate proteins.
While DNA is the carrier of genetic information in a cell, proteins do the bulk
of the work. Proteins are long chains containing as many as 20 different kinds
of amino acids. The genetic code carried by DNA is what specifies the order
SU2-11
BME355 STUDY UNIT 2
and number of amino acids and, therefore, the shape and function of the
protein.
A given amino acid can have more than one codon. These redundant codons
usually differ at the third position. For example, the amino acid serine is
encoded by UCU, UCC, UCA, and/or UCG. This redundancy is the key to
accommodating mutations that occur naturally as DNA is replicated and new
cells are produced. By allowing some of the random changes in DNA to have
no effect on the ultimate protein sequence, a sort of genetic safety net is
created. Some codons do not code for an amino acid at all, but instruct the
ribosome when to stop adding new amino acids.
Figure 2.6 Standard Genetic Code
Information from the mRNA codes in triplets (3 bases) each to represent an
amino acid. The codons are degenerate but not ambiguous. Thus, while an
amino acid can be coded by more than one triplet codon, each codon is specific
for one amino acid only. Exceptions to the standard genetic code are seen in
mitochondrion and chloroplast within the cell.
The Genetic Code
(Access video via iStudyGuide)
Proteins are polymers of amino acids. Three bases of the DNA nucleotides
codes for one particular amino acid – this is called a triplet codon. Note that
SU2-12
BME355 STUDY UNIT 2
the triplet codon is read in a non-overlapping manner. Also there is no
thymine base in RNA – in its place is a base known as uracil (U). Proteins fold
into complex 3D structures and can function in metabolism and structure of
the cell.
DNA sequence
: 5’- ATG CCC TGC TTG GCC …- 3’
RNA sequence
: 5’- AUG CCC UGC UUG GCC …-3’
Protein sequence :
M
P
C
L
A …
Figure 2.7 Schema of Protein structure: Primary to quaternary structures
Protein sequence is a linear polymer of amino acids and constitutes the
primary structure of a protein. The sequence along with the cytoplasmic
conditions dictates folding of the protein into secondary structures, such as
alpha-helices, beta-sheets, etc. Further folds of the molecule give rise to more
intricate structures constituting the tertiary structure. Multiple peptides can
be linked together to form the quaternary structure. For example, hemoglobin
is made up of 2 alpha and 2 beta subunits.
The cellular machinery responsible for synthesising proteins is the ribosome.
The ribosome consists of structural RNA and about 80 different proteins. In
its inactive state, it exists as two subunits: a large subunit and a small subunit.
When the small subunit encounters an mRNA, the process of translating an
mRNA to a protein begins. In the large subunit, there are two sites for amino
acids to bind, and thus be close enough to each other to form a bond. The “A
site” accepts a new transfer RNA, or tRNA, which bears an amino acid and is
SU2-13
BME355 STUDY UNIT 2
the adaptor molecule acting as a translator between mRNA and protein. The
“P site” binds the tRNA that becomes attached to the growing chain.
The tRNA is a specific RNA molecule. Each tRNA has a specific acceptor site
that binds a particular triplet of nucleotides, codon, and an anticodon site that
binds a sequence of three unpaired nucleotides, the anticodon, which can then
bind to the codon. Each tRNA also has a specific charger protein; called an
aminoacyl tRNA synthetase. This protein can only bind to that particular
tRNA and attach the correct amino acid to the acceptor site. Each codon
specifies a particular amino acid. In this way, the ribosomal complex builds a
protein one amino acid at a time, with the order of amino acids determined
precisely by the order of the codons in the mRNA.
A protein will often undergo further modification, called post-translational
modification. For example, it might be cleaved by a protein-cutting enzyme,
called a protease, at a specific place or have a few of its amino acids altered.
The modified sequence of amino acids affects the structure and function of
the protein.
Each cell contains thousands of different proteins: structural components that
give cells their shape and help them move; enzymes that make new molecules
and catalyse nearly all chemical processes in cells; hormones that transmit
signals throughout the body; antibodies that recognise foreign molecules; and
transport molecules that carry oxygen.
Since the completion of the Human Genome Project (HGP) in 2001, we have
discovered that majority of the genome does not code for protein-expressing
genes. Only about 2% of the genomic real-estate is represented by such
sequences. Non-coding DNA represents a larger bulk of the genome. We are
now trying to solve this paradox of how non-protein forming entities regulate
the cell and how they may be important to define a species.
Coding vs. Non-coding paradox
(Access video via iStudyGuide)
SU2-14
BME355 STUDY UNIT 2
Impact of Molecular Genetics
Molecular genetics is the study of the agents that pass information blueprint
that directs all cellular activities and specifies the developmental plan from
generation to generation. These molecules, our genes, are long polymers of
DNA. Just four chemical building blocks (guanine (G), adenine (A), thymine
(T) and cytosine (C)) are placed in a unique order to code for all the genes in
all living organisms.
Genes determine hereditary traits, such as the colour of our hair or our eyes.
They do this by providing instructions for how every activity in every cell of
our body should be carried out. Many diseases are caused by mutations.
When the information coded by a gene changes, the resulting protein may not
function properly or may not even be made at all. In either case, the cells
containing that genetic change may no longer perform as expected.
Most sequencing and analysis technologies were developed from studies of
nonhuman genomes, notably those of the bacterium Escherichia coli, the yeast
Saccharomyces cerevisiae, the fruit fly Drosophila melanogaster, the roundworm
Caenorhabditis elegans, and the laboratory mouse Mus musculus. These simpler
systems provide excellent models for developing and testing the procedures
needed for studying the much more complex human genome.
A large amount of genetic information has already been derived from these
organisms, providing valuable data for the analysis of normal human gene
regulation, genetic diseases, and evolutionary processes. For example,
researchers have already identified single genes associated with a number of
diseases, such as cystic fibrosis. As research progresses, investigators will also
uncover the mechanisms for diseases caused by several genes or by single
genes interacting with environmental factors. Genetic susceptibilities have
been implicated in many major disabling and fatal diseases including heart
disease, stroke, diabetes, and several kinds of cancer. The identification of
these genes and their proteins will pave the way to more effective therapies
and preventive measures. Investigators determining the underlying biology
of genome organisation and gene regulation will also begin to understand
how humans develop, why this process sometimes goes awry, and what
changes take place as people age.
SU2-15
BME355 STUDY UNIT 2
Genomic Databases
(Access video via iStudyGuide)
There are three major public DNA databases:



GenBank at the National Centre for Biotechnology Information
(NCBI) of the National Institutes of Health (NIH) in Bethesha, USA
DNA Database of Japan (DDBJ)
European Bioinformatics Institute (EBI)
The underlying raw DNA sequences are identical. In addition, there are other
categories of bioinformatics datasets that contain DNA and/or protein
sequence data.
2.2 GenBank: Database of Most Known Nucleotide and
Protein Sequences
Building of GenBank
GenBank is the NIH genetic sequence database, an annotated collection of all
publicly available DNA sequences. It is built primarily from the submission
of sequence data from authors and from the bulk submission of expressed
sequence tag (EST), genome survey sequence (GSS) and other highthroughput data from sequencing centres.
Over 100,000 species are represented in GenBank, with over 1000 new species
added per month.
all species
viruses
bacteria
archaea
eukaryota
128,941
6,137
31,262
2,100
87,147
SU2-16
BME355 STUDY UNIT 2
The most sequenced organisms in GenBank are listed as follows:
Homo sapiens (6.9 million entries)
Mus musculus (5.0 million)
Zea mays (896,000)
Rattus norvegicus (819,000)
Gallus gallus (567,000)
Arabidopsis thaliana (519,000)
Danio rerio (492,000)
Drosophila melanogaster (350,000)
Oryza sativa (221,000)
A new release of GenBank is made every two months. The growth of the
database in the past ten years is presented in the following table:
Year
Base Pairs
Sequences
1994
217,102,462
215,273
1995
384,939,485
555,694
1996
651,972,984
1,021,211
1997
1,160,300,687
1,765,847
1998
2,008,761,784
2,837,897
1999
3,841,163,011
4,864,570
2000
11,101,066,288
10,106,023
2001
15,849,921,438
14,976,310
SU2-17
BME355 STUDY UNIT 2
2002
28,507,990,166
22,318,883
2003
36,553,368,485
30,968,418
Convenient and quick submission of sequence data to GenBank can be done
through a WWW form, called BankIt or by using a stand-alone submission
software SequIn. The number of bases grows at an exponential rate. At the
moment of this writing, there are over 38,989,342,565 bases.
Access to GenBank is available via several methods. Each GenBank record,
consisting of both a sequence and its annotations, is assigned a stable and
unique identifier, the accession number, which remains constant over the
lifetime of the record even when there is a change to the sequence or
annotation. The DNA sequence within a GenBank record is also assigned a
unique identifier, called a ‘GI’, that appears on the VERSION line of GenBank
flat file records following the accession number. A third identifier of the form
‘Accession.version’, also displayed on the VERSION line of flat file records,
consolidates the information present in the GI and accession numbers. An
entry appearing in the database for the first time has an ‘Accession.version’
identifier equivalent to the ACCESSION number of the GenBank record
followed by ‘.1’ to indicate the first version of the sequence for the record, e.g.,
ACCESSION AF000001 VERSION AF000001.1 GI: 987654321. When a change
is made to a sequence given in a GenBank record, a new GI number is issued
to the sequence and the version extension of the ‘Accession.version’ identifier
is incremented. The accession number for the record as a whole remains
unchanged and the older sequence remains available under the old
‘Accession.version’ identifier and GI.
GenBank Records and Divisions
Each GenBank entry includes a concise description of the sequence, the
scientific name and taxonomy of the source organism, bibliographic
references and a table of features listing areas of biological significance, such
as coding regions and their protein translations, transcription units, repeat
regions and sites of mutation or modification.
The files in the GenBank distribution have traditionally been divided into
‘divisions’ that roughly correspond to taxonomic groups such as bacteria
(BCT), viruses (VRL), primates (PRI) and rodents (ROD).
SU2-18
BME355 STUDY UNIT 2
In recent years, divisions have been added to support specific sequencing
strategies. These include divisions for EST, GSS, high-throughput genomic
(HTG) and high-throughput cDNA (HTC) sequences, making a total of 17
divisions. For convenience in file transfer, the larger divisions, such as the EST
and PRI, are partitioned into multiple files when posting the bimonthly
GenBank releases.
One point to note before we discuss the EST database is the concept of
naturally occurring molecules and those created in the laboratory (in vitro).
This is crucial as it bears on the ethical use of some of the data and the concept
of ownership of sequence data.
In vitro vs. in vivo
(Access video via iStudyGuide)
EST Database
Expressed Sequence Tags (ESTs) are small pieces of DNA sequence (200 to 500
nucleotides) generated by sequencing an expressed gene. They are sequence
bits of DNA that represent genes expressed in certain cells, tissues, or organs
from different organisms and scientists use these tags to fish a gene out of a
portion of chromosomal DNA by matching base pairs.
Due to their utility, speed with which they may be generated, and the low cost
associated with this technology, many individual scientists as well as large
genome sequencing centres have been generating hundreds of thousands of
ESTs for public use. Once an EST was generated, scientists would submit their
tags to GenBank.
ESTs continue to be the major source of new sequence records and gene
sequences. Over the past year the number of ESTs has increased by over 45%
to a total of 18.1 million sequences representing over 580 different organisms.
The top five organisms represented in the EST division are H.sapiens (5.4
million records), M.musculus (3.8 million records), R.norvegicus (540 000
records), Triticum aestivum (500 000 records) and Ciona intestinalis (490 000
records).
With the rapid submission of so many ESTs, it became difficult to identify a
sequence that had already been deposited in the database. It was becoming
SU2-19
BME355 STUDY UNIT 2
increasingly apparent that if ESTs were to be easily accessed and useful as
gene discovery tools, they needed to be organised in a searchable database
that also provided access to other genome data. Therefore, in 1992, a new
database was designed to serve as a collection point for ESTs. Once an EST
that was submitted to GenBank had been screened and annotated, it was then
deposited in this new database, called dbEST.
Using dbEST, a scientist can access not only data on human ESTs, but
information on ESTs from over 300 other organisms as well. Whenever
possible, NCBI scientists annotate the EST record with any known
information. For example, if an EST matches a DNA sequence that codes for
a known gene with a known function, that gene’s name and function is placed
on the EST record. Annotating EST records allows public scientists to use
dbEST as an avenue for gene discovery. By employing a database search tool,
such as BLAST, any interested party can conduct sequence similarity searches
against dbEST.
Non-Redundant Set of Gene-Oriented Clusters
Because a gene can be expressed as mRNA many times, ESTs ultimately
derived from this mRNA may be redundant. That is, there may be many
identical, or similar, copies of the same EST. Such redundancy and overlap
means that when someone searches dbEST for a particular EST, he or she may
retrieve a long list of tags, many of which may represent the same gene.
Searching through all these identical ESTs can be very time consuming.
To resolve the redundancy and overlap problem, NCBI investigators
developed the UniGene database. As part of its daily processing of GenBank
EST data, the NCBI identifies through BLAST searches all homologies for new
EST sequences and incorporates that information into the companion
database, dbEST. The data in dbEST is further processed to produce the
UniGene database.
UniGene is an experimental system for automatically partitioning GenBank
sequences into a non-redundant set of gene-oriented clusters. Each UniGene
cluster contains sequences that represent a unique gene, as well as related
information such as the tissue types in which the gene has been expressed and
map location.
Consequently, the collection may be of use to the community as a resource for
gene discovery. UniGene has also been used by experimentalists to select
reagents for gene mapping projects and large-scale expression analysis.
SU2-20
BME355 STUDY UNIT 2
However, it should be noted that the procedures for automated sequence
clustering are still under development and the results may change from time
to time as improvements are made.
Currently, sequences from the animals including human, rat, mouse, cow,
zebrafish, clawed frog, fruitfly and mosquito have been processed. Plant
organisms are wheat, rice, barley, maize and cress. These species were chosen
because they have the greatest amounts of EST data available and represent a
variety of species. Additional organisms may be added in the future.
Sequence-Tagged Sites Database
The STS division of GenBank, dbSTS, contains over 240 000 sequences
including anonymous STSs based on genomic sequence as well as gene-based
STSs derived from the 3′ ends of genes and ESTs. These STS records usually
include primer sequences, annotations and PCR conditions.
GSS Database
The GSS division of GenBank, dbGSS, is similar to the EST division, with the
exception that most of the sequences are genomic in origin, rather than cDNA
(mRNA). It should be noted that two classes (exon trapped products and gene
trapped products) may be derived via a cDNA intermediate. Care should be
taken when analysing sequences from either of these classes, as a splicing
event could have occurred and the sequence represented in the record may be
interrupted when compared to genomic sequence. The GSS division contains
(but is not limited to) the following types of data:





random “single pass read” genome survey sequences
cosmid/BAC/YAC end sequences
exon trapped genomic sequences
Alu PCR sequences
transposon-tagged sequences
The GSS division of GenBank has grown over the past year by 73% to a total
of 6.4 million records with over 2.0 billion nucleotides.
GSS records are predominantly single reads from bacterial artificial
chromosomes (‘BAC-ends’) used in a variety of genome sequencing projects.
The most highly represented species in the GSS division are Z. mays (1.3
million records), M. musculus (952 000 records), H. sapiens (893 000 records)
and Brassica oleracea (595 000 records).
SU2-21
BME355 STUDY UNIT 2
Although dbGSS sequences are incorporated into the GSS Division of
GenBank, annotation in dbGSS is more comprehensive and includes detailed
information about the contributors, experimental conditions, and genetic map
locations.
Single Nucleic Polymorphism Database
Because SNPs occur frequently throughout the genome and tend to be
relatively stable genetically, they serve as excellent biological markers.
Biological markers are segments of DNA with an identifiable physical location
that can be easily tracked and used for constructing a chromosome map that
shows the position of known genes, or other markers, relative to each other.
These maps allow researchers to study and pinpoint traits resulting from the
interaction of more than one gene.
To facilitate the identification and cataloguing of SNPs, the SNP database,
dbSNP, has been created. It is intended to stimulate many areas of biological
research, including the identification of the genetic components of disease.
dbSNP links directly to a number of software tools designed to aid in SNP
analysis.
Records in dbSNP are cross-annotated within other internal information
resources such as PubMed, genome project sequences and the dbSTS database.
2.3 NCBI databases and Tools
The NCBI (http://www.ncbi.nlm.nih.gov) creates public databases, conducts
research in computational biology, develops software tools for analysing
genome data, and disseminates biomedical information.
Figure 2.8 Screen shot of the NCBI database header at www.ncbi.nlm.nih.gov
NCBI serves as a central repository to several genomic resources including
the Medline database, Entrez- the retrieval tool, BLAST- the sequence search
and alignment tool, OMIM- the repository for diseases and several others.
Today, it serves as an indispensable resource for genomic research.
SU2-22
BME355 STUDY UNIT 2
PubMed is the National Library of Medicine’s search service that provides
access to
•
•
•
11 million citations in MEDLINE,
links to participating online journals, and
PubMed tutorial (via “Education” on side bar),
Entrez is a search and retrieval system that integrates
•
•
•
•
•
the scientific literature,
DNA and protein sequence databases,
3D protein structure data,
population study data sets, and
assemblies of complete genomes,
Figure 2.9 Interrelatedness between the various genomic databases
Genomic databases are inter-related units with several unique databases
linked to each other for ease of flow of information and data analysis. Thus,
one can easily navigate from one database to another to download multiple
types of data.
BLAST (Basic Local Alignment Search Tool) is NCBI’s sequence similarity
search tool designed to
• support analysis of DNA and protein databases
• 80,000 searches per day
OMIM (Online Mendelian Inheritance in Man) is
•
•
a catalogue of human genes and genetic disorders
edited by Dr. Victor McKusick, others at JHU
SU2-23
BME355 STUDY UNIT 2
Books is a searchable resource of on-line books.
TaxBrowser is a browser for the major divisions of living organisms (archaea,
bacteria, , viruses). The site features
•
•
taxonomy information such as genetic codes
molecular data on extinct organisms
Structure site maintains the Molecular Modelling Database (MMDB), a
database of macromolecular three-dimensional structures, as well as tools for
their visualisation and comparative analysis. It includes
•
•
•
biopolymer structures obtained from the Protein Data Bank (PDB)
Cn3D (a 3D-structure viewer)
vector alignment search tool (VAST)
How to find information about a particular gene or protein
There are five ways to access protein and DNA sequences:





LocusLink with RefSeq
UniGene
Entrez
EMBL
ExPASy Sequence Retrieval System (this is separate from NCBI)
SU2-24
BME355 STUDY UNIT 2
Figure 2.10 Screen shot of NCBI site with information on locating
a particular gene or protein
Unique links on the right side of the NCBI home page allow for easy retrieval
of information directly from several resources including gene, protein data
and others.
LocusLink with RefSeq
LocusLink is a great starting point: it collects key information on each
gene/protein from major databases. It now covers 8 organisms. RefSeq
provides a curated, optimal accession number for each DNA or protein.
SU2-25
BME355 STUDY UNIT 2
An accession number is a label that is used to identify a sequence. It is a string
of letters and/or numbers that corresponds to a molecular sequence. The
following are some examples (all for retinol-binding protein, RBP4):
X02775
NT_030059
Rs7079946
N91759.1
NM_006744
NP_007635
AAC02945
Q28369
1KT7
GenBank genomic DNA sequence
Genomic contig
dbSNP (single nucleotide polymorphism)
An expressed sequence tag (1 of 170)
RefSeq DNA sequence (from a transcript)
RefSeq protein
GenBank protein
SwissProt protein
Protein Data Bank structure record
Figure 2.11 Screen capture of LocusLink site at NCBI
SU2-26
BME355 STUDY UNIT 2
UniGene
UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt
to form one gene per cluster. Use UniGene to study where your gene is
expressed in the body, when it is expressed, and see its abundance.
Entrez
Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM,
and more.
Figure 2.12 Screen Capture of the Entrez web site at NCBI
Entrez is linked to multiple databases including those for nucleotides, books,
Domains, literature searches, taxonomy and others.
SU2-27
BME355 STUDY UNIT 2
The Nucleotide database contains sequence data from GenBank, EMBL, and
DDBJ, the members of the tripartite, international collaboration of sequence
databases. EMBL is the European Molecular Biology Laboratory (EMBL) at
Hinxton Hall, UK; DDBJ is the DNA Database of Japan (DDBJ) in Mishima,
Japan. Sequence data is also incorporated from the Genome Sequence Data
Base (GSDB), Santa Fe, NM. Patent sequences are incorporated through
arrangements with the U.S. Patent and Trademark Office (USPTO), and via
the collaborating international databases from other international patent
offices.
The Protein database contains sequence data from the translated coding
regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein
sequences submitted to Protein Information Resource (PIR), SWISSPROT,
Protein Research Foundation (PRF), and Protein Data Bank (PDB) (sequences
from solved structures).
The Genome database provides views for a variety of genomes, complete
chromosomes, contig sequence maps, and integrated genetic and physical
maps.
One of the challenges facing genomics today is the ease of sequencing without
a concomitant development of data analysis tools. With the advent of new
sequencing techniques such as the Next-generation sequencing (NGS), the
amount of data available for several genomes is exploding.
Next Generation Sequencing
(Access video via iStudyGuide)
The Structure database or Molecular Modelling DataBase (MMDB) contains
experimental data from crystallographic and NMR structure determinations.
The data for MMDB are obtained from the Protein Data Bank (PDB). The
NCBI has cross-linked structural data to bibliographic information, to the
sequence databases, and to the NCBI taxonomy. Use the NCBI 3D structure
viewer, Cn3D, for easy interactive visualisation of molecular structures from
Entrez.
The PopSet database contains aligned sequences submitted as a set resulting
from a population, a phylogenetic, or mutation study. These alignments
SU2-28
BME355 STUDY UNIT 2
describe such events as evolution and population variation. The PopSet
database contains both nucleotide and protein sequence data.
The OMIM (Online Mendelian Inheritance in Man) database is a catalogue of
human genes and genetic disorders.
The Taxonomy database contains the names of all organisms that are
represented in the NCBI genetic database by at least one nucleotide or protein
sequence.
The Bookshelf has a collection of Biomedical books that are linked in Entrez
and can also be separately searched at Bookshelf.
ProbeSet database is an Entrez view of NCBI’s GEO (Gene Expression
Omnibus). GEO is a gene expression and hybridisation array repository.
3D Domains contains protein domains from the NCBI Conserved Domain
Database.
A unified, non-redundant view of sequence tagged sites (STSs), UniSTS
integrates marker and mapping data from a variety of public resources. Data
sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic
map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map,
Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map),
various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson
laboratory’s MGD map).
Database Interlinking
What makes Entrez more powerful than many services is that most of its
records are linked to other records, both within a given database (such as
Nucleotide) and between databases. Links within a database are called
“neighbours” (e.g., Nucleotide neighbours).
Protein and Nucleotide neighbours are determined by performing similarity
searches using the BLAST algorithm to compare the entry amino acid or DNA
sequence to all other amino acid or DNA sequences in the database.
Links between databases are also possible. Nucleotide sequence records in the
Nucleotide database are linked to the PubMed citation of the article in which
the sequences were published. Protein sequence records are linked to the
nucleotide sequence from which the protein was translated.
SU2-29
BME355 STUDY UNIT 2
You can use limits (such as RefSeq) to focus your Entrez search. Limits allow
restriction of a search to a defined subset of the database. Limits can be set to
restrict a search to a particular database field (e.g., the author field). Limits
can be set to search everything but a particular type of data (e.g., exclude
patent records). Alternatively, limits can be set to only search a particular type
of data (e.g., Genomic RNA/DNA) or to only search data from a particular
source database (e.g., EMBL). Date limits and sequence length limits are also
possible. The contents of each Entrez database differ and therefore the Limits
available for each database differ.
EMBL
The European Bioinformatics Institute provides access to sequences via the
EMBL nucleotide database. The searches are comparable to those of the NCBI
GenBank database using Entrez. EBI also sponsors ENSEMBL for
bioinformatic analysis of the human genome. Try ENSEMBL at
www.ensembl.org for a premier human genome web browser.
ExPASy
One of the most useful resources available to obtain protein sequences and
associated data is provided by ExPASy (Expert Protein Analysis System). Try
ExPASy’s sequence retrieval system at http://www.expasy.ch.
How to do a literature search using PubMed
PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic
citations and author abstracts from over 4,000 journals published in the
United States and in 70 other countries. It has 12 million records dating back
to 1966.
MeSH is the acronym for “Medical Subject Headings.” MeSH is the list of the
vocabulary terms used for subject analysis of biomedical literature at NLM.
MeSH vocabulary is used for indexing journal articles for MEDLINE. The
MeSH controlled vocabulary imposes uniformity and consistency to the
indexing of biomedical literature.
PubMed search strategies:



Try the tutorial (“education” on the left sidebar)
Use boolean queries lipocalin AND disease
Try using “limits”
SU2-30
BME355 STUDY UNIT 2


Try “LinkOut” to find external resources
Obtain articles on-line via Welch Medical Library (and
download pdf files): http://www.welch.jhu.edu/
Figure 2.13 Literature search strategies in PubMed
Medline can be searched using single search terms. Alternatively, to reduce
the number of results returned and increase specificity, a combination of
search terms can be strung together and used in a Boolean search. Specificity
can be introduced by excluding certain search terms.
SU2-31
BME355 STUDY UNIT 2
Figure 2.14a Screen capture of the NCBI website with information on PubMed
SU2-32
BME355 STUDY UNIT 2
Figure 2.14b Screen capture of PubMed depicting search for RBP protein
WelchWeb (from the Welch Medical Library) is available at
http://welch.jhmi.edu/welchone/. We use WelchWeb to do literature (and
other) searches through Email gateway/ PubMed gateway/ Library catalogue/
Remote access to Welch services/request literature, and then browse journals
or database.
How to find information about a particular disease
There are two main types of disease databases for general and locus-specific
respectively. For the general diseases: OMIM, GeneCards (Weizmann) at
http://www.genecards.org/,
and
NCBI
Genes
&
Disease
at
http://www.ncbi.nlm.nih.gov/disease/.
For the locus-specific diseases: Human Gene Mutation Database (HGMD) at
http://www.hgmd.cf.ac.uk/docs/oth_mut.html.
e.g., Try OMIM for RBP
SU2-33
BME355 STUDY UNIT 2
Genomics in Medicine
(Access video via iStudyGuide)
Figure 2.15a Screen capture of the Location of the OMIM db in the NCBI web site
SU2-34
BME355 STUDY UNIT 2
Figure 2.15b Screen capture of search results from OMIM db for RBP protein
Figure 2.15c Screen capture showing entry for RBP protein in the OMIM db
SU2-35
BME355 STUDY UNIT 2
1. Which of the following is a RefSeq accession number corresponding to an
mRNA?
(a) J01536 (b) NM_15392 (c) NP_52280 (d) AAB134506
2. Is it possible for a single gene to have more than one UniGene cluster?
(a) Yes (b) No
3. If you want literature information, what is the best website to visit?
(a) OMIM (b) Entrez (c) PubMed (d) PROSITE
Answers: (b) (a) (c)
Visit the following link, https://www.youtube.com/watch?v=hRw0TtKgR7Y
and learn more about non-coding DNA (regulatory elements). Answers will
be discussed in class.
a. What are non-coding regulatory elements? Name any two.
b. What is the role of the non-coding region sequences in the cell?
c. How much of the DNA sequences in the genome code for protein coding
genes?
Visit the following websites and carry out the below mentioned searches in
class.
a) Visit the NCBI website (http://www.ncbi.nlm.nih.gov/) to learn more
about Medline and search the OMIM database for specific mutations in any
disease of interest to you. List at least one or two mutations and provide a
screen capture of the details from OMIM for these.
b) Visit the SNP database (http://www.ncbi.nlm.nih.gov/snp) and use the
help function to learn how to search for SNPs in diabetes. Tabulate the
SU2-36
BME355 STUDY UNIT 2
following: exact base location, base pair changed, chromosome position,
organism and clinical details if available.
Carry out the following activity with your lecturer in class. Complete the
PCR figure and provide sequences of the primers, and compute number of
copies after 5 rounds of amplification.
SU2-37
BME355 STUDY UNIT 2
Summary
The followings key points are discussed in this unit:






Cellular Processes: Mitosis, Meiosis, Transcription, Replication,
Splicing and Translation
Differences between Prokaryotic/Eukaryotic Cells
The standard genetic code
Coding vs. non-coding cellular regulators
NCBI databases: EST, SNP, Entrez and others
Carrying out PubMed searches and OMIM database searches
SU2-38
STUDY UNIT 3
PAIRWISE SEQUENCE
ALIGNMENT
BME355 STUDY UNIT 3
Learning Outcomes
Upon completion of this unit, you will be able to:
1. Differentiate between Sequence homology, similarity and identity
2. Apply knowledge of pairwise alignment and homology to problems of
evolution
3. Differentiate between the role of protein sequence vs. structure
analysis in understanding evolution
4. Illustrate use of Dayhoff models in pairwise alignment
5. Apply the PAM and BLOSUM matrices to problems of protein
sequence homology
6. Compare and contrast between the various global
and
local
alignment algorithms and their applications in pairwise alignments
7. Apply the Needleman-Wunsch algorithms for global alignment
8. Apply the Smith-Waterman algorithm for local alignments
9. Employ FASTA and BLAST algorithms to nucleic acid and protein
sequence alignments
You can refer to Chapter 3 of the textbook
Overview
This unit evaluates the methods for analysing the relatedness of genes and
proteins, with focus on the pairwise sequence alignment algorithms. We
adopt an evolutionary perspective in our description of how amino acids (or
nucleotides) in two sequences can be aligned and compared. We then
describe the algorithms and programs for global (Needleman-Wunsch) and
local (Smith-Waterman) pairwise alignment.
SU3-1
BME355 STUDY UNIT 3
Chapter 3 Pairwise Sequence Alignment
3.1 Concepts and General Alignment Processes
One of the most basic questions about a gene or protein is whether it is related
to any other gene or protein. Relatedness of two proteins at the sequence level
suggests that they are homologous. Relatedness also suggests that they may
have common functions. By analysing many DNA and protein sequences, it
is possible to identify domain or motifs that are shared among a group of
molecules.
The analyses of the relatedness of proteins and genes are accomplished by
aligning sequences. Hence, pairwise sequence alignment is the most
fundamental operation of computational biology, and its main applications
are as follows:
 It is used to decide if two proteins (or genes) are related structurally
or functionally;
 It is used to identify domains or motifs that are shared between
proteins;
 It is the basis of BLAST searching (next unit); and
 It is used in the analysis of genomes.
Protein Alignment: Often more Informative than DNA
Alignment
Given the choice of aligning a DNA sequence or the sequence of the protein it
encodes, it is usually more informative to compare protein sequence. There
are several reasons:




Protein is more informative (20 vs 4 characters); many amino acids
share related biophysical properties;
Codons are degenerate: changes in the third position often do not alter
the amino acid that is specified;
Protein sequences offer a longer “look-back” time; and
DNA sequences can be translated into protein, and then used in
pairwise alignments.
DNA can be translated into six potential proteins, for example,
SU3-2
BME355 STUDY UNIT 3
5’ CAT CAA
5’ ATC AAC
5’ TCA ACT
5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’
5’ GTG GGT
5’ TGG GTA
5’ GGG TAG
But sometimes, DNA alignments are more appropriate



to confirm the identity of a cDNA
to study noncoding regions of DNA
to study DNA polymorphisms
For example: Neanderthal vs modern human DNA:
Homology, Similarity and Identity
Conservation: Changes at a specific position of an amino acid or (less
commonly, DNA) sequence that preserve the physico3-chemical properties of
the original residue.
Identity is the extent to which two (nucleotide or amino acid) sequences are
invariant.
Similarity is the extent to which nucleotide or protein sequences are related. It
is based upon identity plus conservation.
Homology is the similarity attributed to descent from a common ancestor.
SU3-3
BME355 STUDY UNIT 3
There are two types of homology, as illustrated below:
Figure 3.1 Illustration of the various types of sequence homologies in genes
Gene duplication events are advantageous to an organism. As a functional
copy of the gene exists, the other copy is free to be mutated to evolve into new
functionalities. In the early globin ancestor, several gene duplication events
allowed formation of two types of globin chains which diverged further
during speciation events.
Orthologs Homologous sequences in different species that arose from a common
ancestral gene during speciation; may or may not be responsible for a similar
function.
SU3-4
BME355 STUDY UNIT 3
For example, the following tree shows RBP orthologs:
Figure 3.2 Example of a tree showing RBP orthologs
Paralogs Homologous sequences within a single species that arose by gene
duplication.
SU3-5
BME355 STUDY UNIT 3
For example, the following tree shows paralogous human lipocalins:
Figure 3.3 Tree showing paralogs in Lipocalins
Pairwise Alignment, Homology and Evolution of Life
Pairwise alignment is the process of lining up two or more sequences to achieve
maximal levels of identity (and conservation, in the case of amino acid
sequences) for the purpose of assessing the degree of similarity and the
possibility of homology.
For example, the following is the pairwise alignment of retinol-binding
protein and b-lactoglobulin:
SU3-6
BME355 STUDY UNIT 3
Positions at which a letter is paired with a null are called gaps:



Gap scores are typically negative.
Since a single mutational event may cause the insertion or deletion of
more than one residue, the presence of a gap is ascribed more
significance than the length of the gap.
In BLAST, it is rarely necessary to change gap values from the default.
Pairwise sequence alignment allows us to look back billions of years ago (BYA)
Figure 3.4 Overview of the history of life on Earth
See Chapter 13 of the text book for details. Gene/Protein sequences are
analysed in the context of evolution: Which organisms have orthologous
genes? When did these organisms evolve? How related are human and
bacterial globins?
Source: Recommended text book (page 56)
General approach to pairwise alignment:






Choose two sequences
Select an algorithm that generates a score
Allow gaps (insertions, deletions)
Score reflects degree of similarity
Alignments can be global or local
Estimate probability that the alignment occurred by chance
SU3-7
BME355 STUDY UNIT 3
Calculation of an alignment score:
Protein sequence vs. structure:
As seen in the previous sections, studies of proteins enrich our knowledge of
evolution much more than that of DNA sequence analysis. However, one of
the questions typically raised in studying proteins is whether the sequence or
the structure of a protein is more informative. We look at the role of protein
structure in the following audio and also understand why protein sequences
are studied. Many of the algorithms and programs used to study proteins in
units 3 to 5 relate to this issue.
Protein Structure
(Access video via iStudyGuide)
3.2 Dayhoff Model and Substitution Matrices
Margaret Dayhoff and colleagues catalogued thousands of proteins and
compared the sequences of closely related proteins in many families. They
considered the question of which specific amino acid substitutions are
observed to occur when two homologous protein sequences are aligned.
An Accepted Point Mutation (or PAM) is defined as a replacement of one amino
acid in a protein by another residue that has been accepted by natural
selection. An amino acid change that is accepted by natural selection occurs,
when
SU3-8
BME355 STUDY UNIT 3

A gene undergoes a DNA mutation such that it encodes a different
amino acid, and
 The entire species adopts that change as predominant form of the
protein.
Dayhoff’s 34 protein superfamilies:
Protein
PAMs per 100 million years
Ig kappa chain
37
Kappa casein
33
Lactalbumin
27
Hemoglobin a
12
Myoglobin
8.9
Insulin
4.4
Histone H4
0.10
Ubiquitin
0.00
Dayhoff’s numbers of “accepted point mutations”: What amino acid
substitutions occur in proteins?
A
Ala
A
R
N
D
C
Q
E
G
H
R
Arg
N
Asn
D
Asp
C
Cys
Q
Gln
E
Glu
G
Gly
30
109
17
154
0
532
33
10
0
0
93
120
50
76
0
266
0
94
831
0
422
579
10
156
162
10
30
112
21
103
226
43
10
243
23
10
Figure 3.5 The figure depicts a subset of the modified Dayhoff matrix
SU3-9
BME355 STUDY UNIT 3
Number of accepted point mutations, multiplied by 10, observed in 1572 cases
of amino acid substitutions from closely related protein sequences were used
to generate this data. Amino acids are presented alphabetically according to
the three-letter code. Some substitutions such as V and I or S and T are
common. Substitutions of C and W are rarely allowed.
Multiple sequence alignment of glyceraldehyde 3-phosphate dehydrogenases
The relative mutability of amino acids:
Asn
134
His
66
Ser
120
Arg
65
Asp
106
Lys
56
Glu
102
Pro
56
Ala
100
Gly
49
Thr
97
Tyr
41
Ile
96
Phe
41
Met
94
Leu
40
Gln
93
Cys
20
Val
74
Trp
18
SU3-10
BME355 STUDY UNIT 3
Normalised frequencies of amino acids:
Gly
8.9%
Arg
4.1%
Ala
8.7%
Asn
4.0%
Leu
8.5%
Phe
4.0%
Lys
8.1%
Gln
3.8%
Ser
7.0%
Ile
3.7%
Val
6.5%
His
3.4%
Thr
5.8%
Cys
3.3%
Pro
5.1%
Tyr
3.0%
Glu
5.0%
Met
1.5%
Asp
4.7%
Trp
1.0%
blue=6 codons; red=1 codon
Dayhoff’s PAM1 mutation probability matrix (original amino acid):
Figure 3.6 The PAM1 mutation probability matrix
SU3-11
BME355 STUDY UNIT 3
The original amino acid j is arranged in columns (across the top), while the
replacement amino acid I is arranged in rows.
Dayhoff’s PAM1 mutation probability matrix (Each element of the matrix
shows the probability that an original amino acid (top) will be replaced by
another amino acid (side)).
Substitution Matrix
A substitution matrix contains values proportional to the probability that
amino acid i mutates into amino acid j for all pairs of amino acids. Substitution
matrices are constructed by assembling a large and diverse sample of verified
pairwise alignments (or multiple sequence alignments) of amino acids.
Substitution matrices should reflect the true probabilities of mutations
occurring through a period of evolution.
The two major types of substitution matrices are PAM and BLOSUM. PAM
matrices are based on global alignments of closely related proteins. The PAM1
is the matrix calculated from comparisons of sequences with no more than 1%
divergence. Other PAM matrices are extrapolated from PAM1. All the PAM
data come from closely related proteins (>85% amino acid identity).
PAM Matrices
Dayhoff’s PAM0 mutation probability matrix (the rules for extremely slowly
evolving proteins; Top: original amino acid, Side: replacement amino acid):
Figure 3.7 A PAM 2000 matrix has similar values that tend to converge on the same limits
In a PAM 2000 matrix, the proteins being compared are an extreme of unrelatedness. In contrast, at PAM0, no mutations are tolerated, and the
residues of the proteins are perfectly conserved.
SU3-12
BME355 STUDY UNIT 3
Dayhoff’s PAM 2000 mutation probability matrix: (the rules for very distantly
related proteins; Top: original amino acid, Side: replacement amino acid):
Figure 3.8 Portion of the matrices for an infinite Pam (infinite) value. This results by multiplying a
PAM 1 matrix against itself an infinite number of times
PAM250 mutation probability matrix (Top: original amino acid, Side:
replacement amino acid):
Figure 3.9 The PAM250 mutation probability matrix
At this evolutionary distance, only one in five amino acid residues remains
unchanged from an original amino acid sequence (columns) to a replacement
amino acid (rows).
SU3-13
BME355 STUDY UNIT 3
PAM250 log odds scoring matrix:
Figure 3.10 Log-odds matrix for PAM250
High PAM values (e.g., PAM 250) are useful for aligning very divergent
sequences. A variety of algorithms for pairwise alignment, multiple sequence
alignment, and database searching (e.g., BLAST) allow you to select an
assortment of PAM matrices such as PAM250, PAM70, and PAM30.
Why do we go from a mutation probability matrix to a log odds matrix? We
want a scoring matrix so that when we do a pairwise alignment (or a BLAST
search) we know what score to assign to two aligned amino acid residues.
Logarithms are easier to use for a scoring system. They allow us to sum the
scores of aligned residues (rather than having to multiply them).
How do we go from a mutation probability matrix to a log odds matrix? The
cells in a log odds matrix consist of an “odds ratio”:
the probability that an alignment is authentic /
the probability that the alignment was random
The score S for an alignment of residues a,b is given by:
S(a, b) = 10 log10 (Mab/pb)
As an example, for tryptophan,
S(a, tryptophan) = 10 log10 (0.55/0.010) = 17.4
What do the numbers mean in a log odds matrix? A score of +17 for
tryptophan means that this alignment is 50 times more likely than a chance
alignment of two Trp residues.
SU3-14
BME355 STUDY UNIT 3
S(a, b) = 17
Probability of replacement (Mab/pb) = x, Then
17 = 10 log10 x
1.7 = log10 x
101.7 = x = 50
A score of +2 indicates that the amino acid replacement occurs 1.6 times as
frequently as expected by chance.
A score of 0 is neutral.
A score of –10 indicates that the correspondence of two amino acids in an
alignment that accurately represents homology (evolutionary descent) is one
tenth as frequent as the chance alignment of these amino acids.
Comparing two proteins with a PAM1 matrix gives completely different
results than PAM250! Consider two distantly related proteins. A PAM40
matrix is not forgiving of mismatches, and penalises them severely. Using this
matrix you can find almost no match:
hsrbp, 136 CRLLNLDGTC
btlact, 3 CLLLALALTC
* ** *
**
A PAM250 matrix is very tolerant of mismatches:
BLOSUM Matrices
BLOSUM (BLOcks sUbstitution Matrix) matrices are based on local
alignments. All BLOSUM matrices are based on observed alignments; they
are not extrapolated from comparisons of closely related proteins. The
BLOCKS database contains thousands of groups of multiple sequence
alignments.
SU3-15
BME355 STUDY UNIT 3
BLOSUM62 is a matrix calculated from comparisons of sequences with no less
than 62% divergence. BLOSUM62 is the default matrix in BLAST 2.0. Though
it is tailored for comparisons of moderately distant proteins, it performs well
in detecting closer relationships. A search for distant relatives may be more
sensitive with a different matrix.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1 5
-2 0 6
-2 -2 1 6
0 -3 -3 -3 9
-1 1 0 0 -3 5
-1 0 0 2 -4 2 5
0 -2 0 -1 -3 -2 -2 6
-2 0 1 -1 -3 0 0 -2 8
-1 -3 -3 -3 -1 -3 -3 -4 -3 4
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
-1 2 0 -1 -1 1 1 -2 -1 -3 -2 5
-1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2
2
7
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A R N D C Q E G H I
L K M F P S T W Y
V
Figure 3.11 BLOSUM 62 scoring matrix
This matrix merges all proteins in an alignment that have 62% amino acid
identity or greater into one sequence. BLOSUM62 performs better than
alternative BLOSUM matrices or a variety of PAM matrices at detecting
distant relationships between proteins. It is thus the default scoring matrix
for most database search programs such as BLAST.
3.3 Global and Local Alignment Algorithms
There are two kinds of sequence alignment: global and local. We will first
consider the global alignment algorithm of Needleman and Wunsch (1970).
We will then explore the local alignment algorithm of Smith and Waterman
(1981). Finally, we will consider BLAST, a heuristic version of SmithWaterman. We will cover BLAST in detail.
SU3-16
BME355 STUDY UNIT 3
Global Alignment with the Algorithm of Needleman and
Wunsch
Two sequences can be compared in a matrix along x- and y-axes. If they are
identical, a path along a diagonal can be drawn.
Find the optimal sub-paths, and add them up to achieve the best score. This
involves



adding gaps when needed
allowing for conservative substitutions
choosing a scoring system (simple or complicated)
Three steps to global alignment with the Needleman-Wunsch algorithm:
1. set up a matrix
2. score the matrix
3. identify the optimal alignment(s)
Four possible outcomes in aligning two sequences:




identity (stay along a diagonal)
mismatch (stay along a diagonal)
gap in one sequence (move vertically!)
gap in the other sequence (move horizontally!)
Figure 3.12a&b Outcomes of aligning two sequences
SU3-17
BME355 STUDY UNIT 3
Figure 3.12b Pairwise alignment of two amino acid sequences using a dynamic programming
algorithm of Needleman and Wunsch (1970) for global alignment
Two sequences can be assigned a diagonal path through the matrix (top left
panel), a mismatch in one still results in a diagonal path (top right panel), a
deletion in sequence 2 (or insertion in 1) results in insertion of a gap position
and a resulting vertical path in the optimal alignment (bottom left panel); a
gap in the first sequence is represented by a horizontal path through the
matrix (bottom right panel).
The Needleman-Wunsch algorithm is guaranteed to find optimal
alignment(s), although the algorithm does not search all possible alignments.
It is an example of a dynamic programming algorithm: an optimal path
(alignment) is identified by incrementally extending optimal sub-paths. Thus,
a series of decisions is made at each step of the alignment to find the pair of
residues with the best score.
Scoring matrices for computing alignment scores are often based on observed
substitution rates, derived from the substitution frequencies seen in multiple
alignments of sequences. The score of an alignment is the sum of the scores
for pairs of aligned characters plus the scores for gaps, e.g.,
substitution matrix s(x,y) = +5 if x=y and 3 if xy, and
linear gap penalty g = 4
SU3-18
BME355 STUDY UNIT 3
A - C - G G A C T
|
|
|
| |
A T C G G A T C T
Score = s(A,A) + g + s(C,C) + g + s(G,G) + s(G,A) + s(A,T) + s(C,C) + s(T,T)
=
4+
5
5
4+
5

3

3
+
5
+
5 = +11
A - C G G - A C T
|
| | |
| |
A T C G G A T C T
Score = s(A,A) + g + s(C,C) + s(G,G) + s(G,G) + g + s(A,T) + s(C,C) + s(T,T)
=
5
4+
5
+
5
+
5
4
3
+
5
+
5 = +9
The Dynamic Programing algorithms can be specified by recurrence relations.
The recurrence relation for global alignment with linear gap penalty to fill
the DP matrix for all 1<im & 1<jn
M (0,0)  0
M (i,0)  i  g
M (0, j )  j  g
M (i  1, j  1)  s( S1[i], S 2 [ j ])

M (i, j )  max 
M (i  1, j )  g

M (i, j  1)  g

The DP guarantees an optimal alignment between two sequences. We have to
provide a scoring system for the comparison of symbol pairs (nucleotides for
DNA sequences and amino acids for protein sequences), and a scheme for
insertion / deletion (GAP) penalties, but once those parameters have been set,
the resulting alignment should always be the same.
SU3-19
BME355 STUDY UNIT 3
To find alignment itself, we must find the path of choices that led to this score.
Procedure for doing this is known as Traceback.





Start from the bottom-right corner and trace back to the up-left.
Each arrow introduces one character at the end of each aligned
sequence.
A horizontal move puts a gap in the left sequence.
A vertical move puts a gap in the top sequence.
A diagonal move uses one character from each sequence.
Local Alignment with the Algorithm of Smith and Waterman
Global alignment (Needleman-Wunsch) extends from one end of each
sequence to the other. Local alignment finds optimally matching regions
within two sequences (“subsequences”). Local alignment is almost always
used for database searches such as BLAST. It is useful to find domains (or
limited regions of homology) within sequences. Smith and Waterman (1981)
solved the problem of performing optimal local sequence alignment. Other
methods (BLAST, FASTA) are faster but less thorough.
In the local alignment, the alignment tends to stop at the ends of regions of
identity or strong similarity. A much higher priority is given to finding these
local regions than to extending the alignment to include more neighbouring
amino acid pairs. This type of alignment favours finding conserved amino
acid motifs in related protein sequences.
Given:



A pair of sequences S1 and S2,
A method for scoring the similarity of a pair of characters,
Penalty function for gaps.
Task:

Find subsequences of S1 and S2, whose similarity score is maximum
over all pairs of subsequences of S1 and S2.
Slight modification of the basic DP algorithm:


Given a sequence S1 of length m, a sequence S2 of length n, s(x,y) and
g,
Construct an (m+1)(n+1) matrix M.
SU3-20
BME355 STUDY UNIT 3


M(i,j) = score of the best alignment of a suffix of S1[1..i] and a suffix
of S2[1..j].
Initialise first row and first column of the matrix with 0:
M (0,0)  0; M (i,0)  0; M (0, j )  0

Fill in the rest of matrix top to bottom, left to right and store
corresponding pointers to parent cells:
0

M (i  1, j  1)  s( S [i], S [ j ])

1
2
M (i, j )  max 
M (i  1, j )  g


M (i, j  1)  g
Traceback:


Find maximum value of M(i,j); can be anywhere in the matrix;
Traceback pointers from the maximum cell until you hit a cell with
value 0.
FASTA & BLAST–- Rapid, heuristic versions of SmithWaterman Algorithm
Sequence Databases are large and growing fast:
Swiss-Prot: 5107 amino-acids (19-July-2003)
TrEMBL: 3108 amino-acids (18-July-2003)
Genebank: 31010 base-pairs (07-FEB-2003)
Even fast workstations are too slow for complete dynamic programming
alignment. Assume 107 matrix cells/sec (that’s pretty fast) for an amino-acid
sequence of length 400, this leads to a runtime of:
(4005107/107) sec = 2’000 sec = 33 min for Swiss-Prot
(4003108/107) sec = 12’000 sec = 3,3 hours for TrEMBL
For a DNA sequence of length 1000 this leads to a runtime of:
(100031010/107) sec = 3’000’000 sec = 35 days for GenBank
The Smith-Waterman algorithm is very rigorous and it is guaranteed to find
an optimal alignment. But it is slow. It requires computer space and time
proportional to the product of the two sequences being aligned (or the
SU3-21
BME355 STUDY UNIT 3
product of a query against an entire database). Gotoh (1982) and Myers and
Miller (1988) improved the algorithms so both global and local alignment
require less time and space. FASTA and BLAST provide rapid alternatives to
the Smith-Waterman algorithm.
How FASTA works?
1. A “lookup table” is created. It consists of short stretches of amino
acids (e.g., k=3 for a protein search).
The length of a stretch is called a k-tuple. The FASTA algorithm finds
the ten highest scoring segments that align to the query.
2. These ten aligned regions are re-scored with a PAM or BLOSUM
matrix.
3. High-scoring segments are joined.
4. The Needleman-Wunsch or Smith-Waterman algorithm is then
performed.
How BLAST works? Pairwise alignment–- BLAST 2 SEQUENCES:
Figure 3.13 a&b Screen Capture of NCBI web site showing BLAST 2 Sequences algorithm and
results
SU3-22
BME355 STUDY UNIT 3
Go to http://www.ncbi.nlm.nih.gov/BLAST. Choose BLAST SEQUENCES. In
the program:
1. choose blastp or blastn
2. paste in your accession numbers (or use FASTA format)
3. select optional parameters:
3 BLOSUM and 3 PAM matrices
gap creation and extension penalties
filtering
word size
Figure 3.13 b Results from a BLAST 2 sequence query
Regions aligning with each other are shown as boxes along the diagonal. The
Query sequence is depicted in the top, bases identical in the query and subjects
are shown with their amino acid letters, allowed substitutions are shown
with a + sign.
SU3-23
BME355 STUDY UNIT 3
Tests for Significance of Pairwise Alignments
Figure 3.14 Tabulation of sensitivity and specificity for sequences
Sensitivity and specificity are dictated by the ability to identify true positive
data while minimising false positives. All models/ algorithms need to be
tested to ensure a balance between getting results while reducing artefacts.
Source: with permission from Sonego P. et al., Brief Bioinform (2008) 9(3): 198209.
Randomisation test: scramble a sequence:
irst compare two proteins and obtain a score:
Next scramble the bottom sequence 100 times, and obtain 100 “randomised”
scores (+/- S.D.)
Composition and length are maintained
If the comparison is “real” we expect the authentic score to be several
standard deviations above the mean of the “randomised” scores
For example, a randomisation test shows that RBP is significantly related to
b-lactoglobulin. (But this test assumes a normal distribution of scores!)
SU3-24
BME355 STUDY UNIT 3
You can perform this randomisation test in GCG using the gap or bestfit
pairwise alignment programs.
Type > gap–-ran=100
or > bestfit–-ran=100
Z = (Sreal – Xrandomised score) / (standard deviation)
The PRSS program performs a scramble test for you at
http://fasta.bioch.virginia.edu/fasta/prss.htm (But these scores are not
normally distributed!)
SU3-25
BME355 STUDY UNIT 3
Figure 3.15 Series of alignments for scoring sequences from the database
Note that these alignments are produced post hoc and do not actually
represent the search process. BLAST and FASTA share a common strategy:


fast screening to eliminate unrelated sequences
complete alignment of top scoring sequences
BLAST and FASTA differ in:

statistical model
SU3-26
BME355 STUDY UNIT 3

heuristics and tuning
1. Which of the following amino acids is least mutable according to the PAM
scoring matrix?
(a) Alanine (b) Glutamine (c) Methionine (d) Cysteine
2. You have two distantly related proteins. Which BLOSUM or PAM matrix is
best to use to compare them?
(a) BLOSUM45 or PAM250
(b) BLOSUM45 or PAM1I) BLOSUM80 or PAM 250
(c) BLOSUM80 or PAM1
3. True or False: Two proteins that share 30% amino acid identity are 30%
homologous.
Answers: (d) (a) (False)
Carry out the following exercise and post your answers on Learning
Management System. Do a Google search for the CDD website and search for
specific set of motifs, and conserved sequences in Human tbx-18 protein.
You should now read the following:
Gene networks: http://en.wikipedia.org/wiki/Gene_regulatory_network
Epigenetics: http://en.wikipedia.org/wiki/Epigenetics
Find Out More
Ion-torrent: http://www.youtube.com/watch?v=WYBzbxIfuKs
Illumina SBS: http://technology.illumina.com/technology/next-generationsequencing/sequencing-technology.html
SU3-27
BME355 STUDY UNIT 3
Is the DNA or protein sequence/structure more useful in understanding
evolution? Why?
Answers will be discussed in class.
SU3-28
BME355 STUDY UNIT 3
References
Gotoh, O. An improved algorithm for matching biological sequences. J. Mol.
Biol. 162, 705-708 (1982).
Myers, E.W., and Miller, W. Optimal alignments in linear space. Comput. Appl.
Biosci. 4, 11-17 (1988).
Needleman, S.B., and Wunsch, C.D. A general method applicable to the search
for similarities in the amino acid sequence of two proteins. J. Mol. Biol.
48, 443-453 (1970).
Smith, T.F., Waterman, M.S., and Fitch, W.M., Comparative bio-sequence
metrics. J. Mol. Evol. 18, 38-46 (1981).
SU3-29
BME355 STUDY UNIT 3
Summary
The followings key points are discussed in this unit:







Concept of Pairwise Alignment
Protein vs. DNA sequence alignments
Homology, Similarity and Identity
Protein Sequence vs. Structure information in studying evolution
Dayhoff model and the concept of Accepted Point Mutations (PAM)
Global vs. Local Alignment tools
Comparison of BLAST and FASTA
SU3-30
STUDY UNIT 4
DATABASE SEARCHING WITH
BLAST AND FASTA
BME355 STUDY UNIT 4
Learning Outcomes
Upon completion of this unit, you will be able to:
1.
2.
3.
4.
Discuss the advantages and limitations of the BLAST algorithm
Differentiate between Normal Gaussian Distribution and EVD
Apply the concepts of basic BLAST to searches for genes/protein
sequences and evaluate significance of the search results
Differentiate between BLAST and FASTA algorithms for pairwise
alignments
You can refer to Chapter 4 of the textbook
Overview
This unit introduces the Basic Local Alignment Search Tool (BLAST) which is
the main NCBI tool for comparing a query sequence to other sequences in
various databases, as well as FASTA. BLAST and FASTA are heuristic and
rapid version of pairwise alignment algorithm. Steps of the BLAST and
FASTA search processes are described. Strategies applied for BLAST database
searching are discussed with examples.
SU4-1
BME355 STUDY UNIT 4
Chapter 4 Database Searching with BLAST and
FASTA
4.1 Basic Concepts, BLAST Searches and Interpretation
of Results
Selectivity: Describes the ability of a search tool to discard false positive, i.e.,
higher selectivity means that the method identifies fewer matches between
unrelated sequences.
Sensitivity: Describes the ability of a search tool to discard false negatives, i.e.,
higher sensitivity means that the method identifies more matches between
distantly related sequences.
Significance: A significant result is one that has not simply occurred by
chance. Significance levels show how likely a result is due to chance,
expressed as probability. In sequence analysis, the significance of an
alignment score maybe calculated as the chance that such a score would be
found between random sequences.
Most database search methods involve a trade-off: Sensitivity vs Selectivity,
for example:
Suppose a database contains 1’000 globin sequences.
Suppose a search of this database for globins reported 900 results, 700 of
which were really globin sequences and 200 of which were not.
This result would be said to have 300 false negatives (misses) and 200 false
positives.
Lowering a tolerance threshold will most likely increase the number of both
false negatives and false positives, i.e., higher sensitivity, but lower selectivity.
Another trade-off in database searching: Sensitivity vs Speed
BLAST (Basic Local Alignment Search Tool) allows rapid sequence
comparison of a query sequence against a database. The BLAST algorithm is
fast, accurate, and web-accessible.
SU4-2
BME355 STUDY UNIT 4
Why do we need BLAST searching? BLAST searching is fundamental to
understanding the relatedness of any favourite query sequence to other
known proteins or DNA sequences.
Applications of BLAST include identifying orthologs and paralogs,
discovering new genes or proteins, discovering variants of genes or proteins,
investigating expressed sequence tags (ESTs), and exploring protein structure
and function.
BLAST Search Steps
Four steps to a BLAST search:




Choose the sequence (query)
Select the BLAST program
Choose the database to search
Choose optional parameters
Specifying Sequence of Interest
Sequence can be input in FASTA format or as accession number.
Figure 4.1 Screen capture of NCBI home page showing RBP protein in FASTA format
Generating the FASTA format is a pre-requisite to search data in multiple
databases. This can be carried out for a single entity or multiple sequences to
query in a batch mode.
SU4-3
BME355 STUDY UNIT 4
Selecting BLAST Program
DNA can be translated into six potential proteins, for example,
Figure 4.2 Schema of the six potential protein start sites from any ds DNA
While carrying out search for proteins encoded from a DNA sequence, all six
potential start sites of translation (along both strands) need to be
interrogated and the proteins generated from all of these frames need to be
aligned against a query sequence.
blastn (nucleotide BLAST):
blastp (protein BLAST)
tblastn (translated BLAST)
blastx (translated BLAST)
tblastx (translated BLAST)
Selecting a Database
nr = non-redundant (most general database)
dbest = database of expressed sequence tags
dbsts = database of sequence tag sites
gss = genomic survey sequences
htgs = high throughput genomic sequence
Selecting Optimal Search and Formatting Parameters
You can choose the organism to search, turn filtering on/off, change the
substitution matrix, change the expect (e) value, change the word size, change
the output format, and so on.
SU4-4
BME355 STUDY UNIT 4
For example, choosing filter, selecting alignment view, descriptions and
alignments:
SU4-5
BME355 STUDY UNIT 4
Figure 4.3 Output from a BLAST search using Retinol binding protein
The Local Alignment Strategy for BLAST Search
The Smith-Waterman algorithm is guaranteed to find optimal alignments, but
it is computationally expensive (requires O(n2) time). BLAST and FASTA are
heuristic approximations to local alignment. Each requires only O(n 2/k) time;
they examine only part of the search space.
How BLAST works? The central idea of the BLAST algorithm is to confine
attention to segment pairs that contain a word pair of length w with a score
of at least T. The original BLAST algorithm works in 3 phases:
Phase 1: compile a list of word pairs (w=3) above threshold T
Example: For a human RBP query…FSGTWYA… (the query word is in bold)
SU4-6
BME355 STUDY UNIT 4
A list of words (w=3) is:
FSG SGT GTW TWY WYA
YSG TGT ATW SWY WFA
FTG SVT GSW TWF WYS
Figure 4.3a 3 steps used by the BLAST algorithm to align sequences
The BLAST algorithm parses the query sequence into 3 letter words, these are
compared to the words in the database and the scores for the various hits are
recorded. If multiple (2 or more) hits with scores higher than threshold are
obtained, the alignment is continued in both directions until the next set of
identical hits are encountered.
Phase 2: Scan the database for entries that match the compiled list. This is fast
and relatively easy.
Phase 3: When you manage to find a hit (i.e., a match between a “word” and
a database entry), extend the hit in either direction.


Keep track of the score (use a scoring matrix)
Stop when the score drops below some cutoff.
For example,
Figure 4.3b 3 steps used by the BLAST algorithm to align sequences
SU4-7
BME355 STUDY UNIT 4
In the original (1990) implementation of BLAST, hits were extended in either
direction. In a 1997 refinement of BLAST, two independent hits are required.
The hits must occur in close proximity to each other. With this modification,
only one seventh as many extensions occur, greatly speeding the time
required for a search.
You can modify the threshold parameter. The default value for blastp is 11.
To change it, enter “-f 16” or “-f 5” in the advanced options.
How to interpret BLAST: E values and p values
It is important to assess the statistical significance of search results. For global
alignments, the statistics are poorly understood. For local alignments
(including BLAST search results), the scores follow an extreme value
distribution (EVD) rather than a normal distribution. The probability density
function of the extreme value distribution (characteristic value u=0 and decay
constant l=1) is illustrated as follows:
Figure 4.4 EVD vs. Normal Gaussian distribution plot
Evolution tends to be sporadic and not continuous. Thus, a few hot spots of
high mutation rates dictate the divergence of species and generation of newer
species. Using sequence data to pick up such events requires use of the EVD
pattern of distribution, rather than the normal symmetrical distribution
followed by Gaussian plots. Hence, BLAST uses the EVD pattern to identify
homology.
SU4-8
BME355 STUDY UNIT 4
Gaussian vs. Extreme Value Distribution
(Access video via iStudyGuide)
The expect value E is the number of alignments with scores greater than or
equal to score S that are expected to occur by chance in a database search. An
E value is related to a probability value p. The key equation describing an E
value is:
E = Kmn e- λ S
This equation is derived from a description of the extreme value distribution,
where
S = the score
E = the expect value = the number of HSPs expected to
occur with a score of at least S
m, n = the length of two sequences
λ, K = Karlin Altschul statistics
Some properties of the equation E = Kmn e- λ S:
•
•
•
•
The value of E decreases exponentially with increasing S
(higher S values correspond to better alignments). Very high
scores correspond to very low E values.
The E value for aligning a pair of random sequences must be
negative! Otherwise, long random alignments would acquire
great scores.
Parameter K describes the search space (database).
For E=1, one match with a similar score is expected to occur by
chance. For a very much larger or smaller database, you would
expect E to vary accordingly.
There are two kinds of scores:
Raw scores (calculated from a substitution matrix) and bit scores (normalised
scores)
SU4-9
BME355 STUDY UNIT 4
Bit scores are comparable between different searches because they are
normalised to account for the use of different scoring matrices and different
database sizes
S’ = bit score = (λ S – ln K) / ln 2
The E value corresponding to a given bit score is
E = mn 2 -S’
To make sense of raw scores with bit scores: Bit scores allow you to compare
results between different database searches, even using different scoring
matrices.
The expect value E is the number of alignments with scores greater than or
equal to score S that are expected to occur by chance in a database search. A p
value is a different way of representing the significance of an alignment.
p = 1 - e-E
Very small E values are very similar to p values. E values of about 1 to 10 are
far easier to interpret than corresponding p values.
E
10
5
2
1
0.1
0.05
0.001
0.0001
p
0.99995460
0.99326205
0.86466472
0.63212056
0.09516258 (about 0.1)
0.04877058 (about 0.05)
0.00099950 (about 0.001)
0.0001000
SU4-10
BME355 STUDY UNIT 4
Figure 4.5a Summary of results from a typical BLAST search
Sometimes, a real match has an E value > 1, for example,
SU4-11
BME355 STUDY UNIT 4
Figure 4.5 b Alignment results from a BLAST search
Assessing whether proteins are homologous, as illustrated in the following:
Figure 4.6 Deducing Protein homology from alignment scores
Deducing homology from sequence alignment scores alone can be misleading.
Many homologous proteins have 20% or fewer identical amino acids in their
primary sequence. Complementing sequence data with a comparison of the
structural (or possible structural) data is more conclusive. However, there are
fewer protein structures than sequence data available.
SU4-12
BME355 STUDY UNIT 4
RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”).
But they are indeed homologous. Try a BLAST search with PAEP as a query,
and find many other lipocalins.
BLAST searching with HIV-1 pol, a multi-domain protein:
4.2 FASTA Procedure
FASTA is a family of heuristic algorithms developed by William Pearson of
the University of Virginia. FastA lies between BLAST and Smith-Waterman
in both accuracy and speed. An optimised FastA option makes use of SmithWaterman for part of the alignment process. The FastA family includes DNA
to DNA, protein to protein, and translation searches. Refer to Pearson, W. R.
and Lipman, D. J., Improved tools for biological sequence comparison, Proceedings
of the National Academy, 1988.
Procedure:
–
–
–
–
–
Find best regions on diagonals.
Re-scoring 10 best region using a PAM or BLOSUM substitution
matrix. The best of these new scores is a first measure of the
similarity between two sequences, called INIT1.
INIT1 is computed for every database sequence with respect to the
query sequence. INIT1 is used to rank all database sequences.
For the highest ranking sequences an optimised score opt is
computed running a DP algorithm restricted to a band around the
initial alignment.
Calculate significance of scores.
The FASTA program sets a size k for k-tuple sub-words. The program then
looks for diagonals in the comparison matrix between query and search
sequence along which many k-tuples match. This can be done very quickly
based on a preprocessed list of k-tuples contained in the query sequence. The
set of k-tuples can be identified with an array whose length corresponds to
the number of possible tuples of size k. This array is linked to the indices
where the particular k-tuples occur in the query sequence. Note that a
matching k-tuple at index i in the query and at index j in the database
sequence can be attributed to a diagonal by subtracting the one index from
the other. Therefore, when inspecting a new sequence for similarity, one
walks along this sequence inspecting each k-tuple. For each of them, one looks
up the indices where it occurs in the query, computes the index-difference to
identify the diagonal and increases a counter for this diagonal. After
SU4-13
BME355 STUDY UNIT 4
inspecting the search sequence in this way, a diagonal with a high count is
likely to contain a well-matching region. In terms of the execution time, this
procedure is only linear in the length of the database sequence and can easily
be iterated for a whole database. Of course, this rough outline needs to be
adapted to focus on regions on diagonals where the match density is high and
link nearby, good diagonals into alignments.
Step 1
Determine k-tuples common to both sequences
k = 1 or 2 for proteins
k = 4, 5, or 6 for DNA
The value of k is a parameter called ktup in the program.
In addition, the offset of a common k-tuple is determined:
The offset is a value between n+1 and m1 that determines the relative
displacement of sequence S1 relative to sequence S2 (n and m are the lengths
of the sequences).
Specifically, if the common k-tuple starts a positions S1[i] and S2[j], then offset
= ij. This is called “diagonal method”, because an offset can be viewed as a
diagonal in the dot plot matrix.
Scan S2 and each k-tuple in S2 is looked-up in the table. For all common
occurrences the entry of the corresponding offset in the offset vector is
incremented.
Offsets correspond to diagonals in the “dot plot”. Hashing technique used by
FASTA is an efficient way to count number of “hot spots” in each diagonal of
the dot plot.
Complexity Analysis:
Given two sequences of length n and m
Time complexity to calculate offset vector: O(m+n+x)
x is the number of hot-spots, i.e., exact matches of length k
Increasing ktup decreases x
Higher ktup value  FASTA is faster
Higher ktup value  FASTA is less sensitive
Trade-off: sensitivity vs speed
SU4-14
BME355 STUDY UNIT 4
ktup = 1 or 2 for protein sequences
ktup = 4, 5 or 6 for DNA sequences
Join two or more k-tuples in the same diagonal if they are not very far apart.
The combined k-tuples form a region. A region is a gapless local alignment.
Regions are given a score depending on the matches and mismatches:
Score = sum of hot-spots  distance between hot-spots
Store the 10 best regions.
Step 2
Re-score 10 best regions using a substitution matrix. The best of these new
scores is used as a first measure of similarity between S1 and S2, called init1.
Steps 3
From now on, only database sequences are considered with init1 > cutoff.
FASTA then tries to join nearby regions of the 10 best scoring regions using
gaps. The best joined score is called initn.
Steps 4
Additionally, an opt score is computed by computing a local banded
alignment around the init1 region (band has 16 diagonals for ktup=2, 32
diagonals for ktup=1).
Steps 5
Assessing the statistical significance of a score:
Z-score:
It is a way of measuring the significance of a score considering the mean of
the random score distribution. Difference between the similarity score for a
single alignment and the mean of the random score distribution is normalised
by the standard deviation of that random score distribution. Higher Z-scores
are better because the further the real score is from this mean (in standard
deviation units) the more significant it is.
SU4-15
BME355 STUDY UNIT 4
E value:
It is the probability that an alignment score is as good as the one found between
a query sequence and a database sequence in as many comparisons between
random sequences as was done to find the matching sequence. When Z-score
goes up, the E-value goes down.
1. You have a short DNA sequence. Basically, how many proteins can it
potentially code?
(a) 1 (b) 2 (c) 3 (d) 6
2. You can limit a BLAST search using any Entrez term. For example, you can
limit the results to those containing a researcher’s name.
(a) True (b) False
3. As the E value of a BLAST search becomes smaller:
(a) The value of K also becomes smaller
(b) The score tends to be larger
(c) The probability p tends to be larger
(d) The Extreme Value Distribution becomes less skewed
Answers: (d) (a) (b)
Visit the link, http://blast.ncbi.nlm.nih.gov/Blast.cgi and do a nucleotide
search and protein search using the term human and tbx-18. List top 5 results.
Search using only tbx-18 and list the top 5 results. Answers will be discussed
in class.
SU4-16
BME355 STUDY UNIT 4
Summary
The followings key points are discussed in this unit:





Basic Concepts used in the BLAST algorithm
Selectivity, Sensitivity and Specificity
BLAST Search steps
FASTA principles
Interpretation of BLAST results: E-values and p-values
SU4-17
STUDY UNIT 5
ADVANCED BLAST SEARCHING
BME355 STUDY UNIT 5
Learning Outcomes
Upon completion of this unit, you will be able to:
1. Compare and contrast the various BLAST programs and their output
and explain the need for these programs
2. Evaluate the role of PSI-BLAST and PHI-BLAST to study homologous
proteins with limited sequence similarity
3. Explain how protein sequence patterns are utilised in homology
studies
4. Practise carrying out search using Advanced BLAST tools
5. Illustrate your knowledge of BLAST, Advanced BLAST and other
databases to solve a research problem
You can refer to Chapter 5 of the textbook
Overview
BLAST searches can be very versatile. This unit further explores the advanced
BLAST searching techniques. We begin with an overview of the specialised
BLAST resources and websites. We then focus on finding distantly related
proteins with Position-specific Iterated BLAST (PSI-BLAST) and significant
pattern matches with Pattern-Hit Initiated BLAST (PHI-BLAST). Finally,
using BLAST for gene discovery is illustrated.
SU5-1
BME355 STUDY UNIT 5
Chapter 5 Advanced BLAST Searching
5.1 Types of BLAST Programs and Search Parameters
We have used two BLAST resources, both from the NCBI websites: BLAST 2
Sequences and the standard five BLAST programs. There are other specialised
BLAST programs. First, there are many entire databases that consist of
molecular sequence data from a specific organism. Often, the data include
unfinished sequences that have not yet been deposited in GenBank.
For example, the Ensembl BLAST server allows the user to search the Ensembl
database, including the most finished sequence. Output of a BLAST search of
the Ensembl database using RBP4 as a query is presented in a graphical format
by chromosome, showing the best match to the long arm of chromosome 10
near the centromere. Weaker matches to paralogs on other chromosomes are
also evident, as in the following:
Figure 5.1 Location of RBP protein on the long arm of chromosome 10
The red box depicts the location of the RBP protein on chromosome 10.
Weaker matches are also depicted in blue and green.
BLAST search from The Institute for Genomic Research (TIGR) allows the
choice of databases from various organisms as well as optional parameters
such as a choice from dozens of substitution matrices. The TIGR BLAST
output resembles that of NCBI BLAST, but with fewer organisms. Also, there
SU5-2
BME355 STUDY UNIT 5
is typically only one entry per species because redundant or partial sequences
from assorted databases are unified into one accession number.
Specialised BLAST-related algorithms
Developed at Washington University, WU BLAST 2.0 is related to the
traditional NCBI BLAST algorithms, as both did not permit gapped
alignments. WU BLAST 2.0 may provide faster speed and increased
sensitivity, and it includes a variety of options such as a full Smith-Waterman
alignment on some pairwise alignments of database matches.
Figure 5.2 Screen capture of EMBL site showing WU-Blast 2
BLAST-like tools for genomic DNA searches
The analysis of genomic DNA presents special challenges:
•
•
There are exons (protein-coding sequence) and introns (intervening
sequences).
There may be sequencing errors or polymorphisms.
SU5-3
BME355 STUDY UNIT 5
•
The comparison may be between related species (e.g., human and
mouse).
Recently developed tools include:
•
•
MegaBLAST at NCBI.
BLAT (BLAST-like alignment tool). BLAT parses an entire genomic
DNA database into words (11mers), then searches them against a
query. Thus it is a mirror image of the BLAST strategy. See
http://genome.ucsc.edu
•
SSAHA at Ensembl uses a similar strategy as BLAT. See
http://www.ensembl.org
Position-Specific Iterated BLAST (PSI-BLAST)
Many homologous proteins share only limited sequence identity. PSI-BLAST
is a specialised kind of BLAST search that is often more sensitive than a
regular BLAST search. The purpose of PSI-BLAST is to look deeper into the
database for matches to your query protein sequence by employing a scoring
matrix that is customised to your query.
PSI-BLAST is performed in five steps:
1. Select a query and search it against a protein data;
2. PSI-BLAST constructs a multiple sequence alignment then creates a
“profile” or specialised position-specific scoring matrix (PSSM);
3. The PSSM is used as a query against the database;
4. PSI-BLAST estimates statistical significance (E values);
5. Repeat steps [3] and [4] iteratively, typically 5 times. At each new
search, a new profile is used as the query.
SU5-4
BME355 STUDY UNIT 5
Figure 5.3 Steps and results of a PSI-BLAST search
PSI-BLAST creates a profile of the conserved bases between the query and
search results. The following iterations further strengthen this profile.
Barring the introduction of spurious false positives, typically 3-5 iterations
are needed to establish homology relationships even between sequences with
low sequence identity.
Performance assessment: Evaluate PSI-BLAST results using a database in
which protein structures have been solved and all proteins in a group share <
40% amino acid identity.
PSI-BLAST is useful to detect weak but biologically meaningful relationships
between proteins. The main source of false positives is the spurious
amplification of sequences not related to the query. For instance, a query with
a coiled-coil motif may detect thousands of other proteins with this motif that
SU5-5
BME355 STUDY UNIT 5
are not homologous. Once even a single spurious protein is included in a PSIBLAST search above threshold, it will not go away.
The problem of corruption: Corruption is defined as the presence of at least
one false positive alignment with an E value < 10-4 after five iterations. Three
approaches to stopping corruption are:



Apply filtering of biased composition regions
Adjust E value from 0.001 (default) to a lower value such as E = 0.0001.
Visually inspect the output from the iterations. Remove suspicious
hits by unchecking the box.
Pattern-Hit Initiated BLAST (PHI-BLAST)
Given a protein sequence S and a regular expression pattern P occurring in S,
PHI-BLAST helps answer the question: What other protein sequences both
contain an occurrence of P and are homologous to S in the vicinity of the
pattern occurrences? PHI-BLAST may be preferable to just searching for
pattern occurrences because it filters out those cases where the pattern
occurrence is probably random and not indicative of homology.
PHI-BLAST is launched from the same page as PSI-BLAST. It combines
matching of regular expressions with local alignments surrounding the match.
For example, to align three lipocalins (RBP and two bacterial lipocalins),
pick a small, conserved region and see which amino acid residues are used.
Create a pattern using the appropriate syntax: GXW [YF][EA][IVLM]
The syntax for patterns in PHI-BLAST follows the conventions of PROSITE.
When using the stand-alone program, it is permissible to have multiple
patterns. When using the Web-page, only one pattern is allowed per query.
[ ] means any one of the characters enclosed in the brackets, e.g., [LFYT] means
one occurrence of L or F or Y or T;
- means nothing;
x(5) means 5 positions in which any residue is allowed;
SU5-6
BME355 STUDY UNIT 5
x(2,4) means 2 to 4 positions where any residue is allowed.
SU5-7
BME355 STUDY UNIT 5
Figure 5.4 Steps and results of a typical PHI-BLAST search
PHI-BLAST starts with the use of a fixed pattern and uses this pattern
(profile/motif) to query the database for proteins containing these motifs.
Using BLAST for Gene Discovery
You can use BLAST to find a “novel” gene, as summarised in the figure below:
SU5-8
BME355 STUDY UNIT 5
SU5-9
BME355 STUDY UNIT 5
Figure 5.5 Diagrammatic view of schema to search for novel genes using BLAST
5.2 A Case Study
Come back to the NCBI Map Vewer to search the human genome for
sequences similar to that of the red opsin.
Figure 5.6 Genomic overview of Red Opsin analogs in other organisms
SU5-10
BME355 STUDY UNIT 5
Pick the following options from the various menus:
Database: Protein (Search the database of proteins sequences.)
Program: blastp (Use the version of BLAST that compares protein sequences,
unlike blastn, which compares nucleotide sequences.)
Expect: 10 (The higher the number, the less stringent that matching, and the
more hits you will get.)
Parameter settings enable the user to optimise their BLAST search for each
query sequence. The filter option enables the program to mask regions of a
query sequence in order to exclude regions of low compositional complexity
such as repetitive elements.
Both the blastn and blastp search tools offer fully gapped alignments while
blastx and tblastn have "in-frame" gapped alignments and the tblastx search
tool provides only un-gapped alignments.
Statistical matrices are used both to identify sequences in a database, and to
predict the biological significance of the match.
PAM matrices are most sensitive for alignments of sequences with
evolutionary related homologs. The greater the number in the matrix name,
the greater the expected evolutionary (mutational) distance.
BLOSUM matrices are most sensitive for local alignment of related sequences
and are therefore ideal when trying to identify an unknown nucleotide
sequence.
SU5-11
BME355 STUDY UNIT 5
SU5-12
BME355 STUDY UNIT 5
Retrieved information includes:




a schematic distribution of alignments of the query sequence to those
in the databases,
a series of one-line descriptions of the database sequences which have
significantly aligned to the query sequence,
actual sequence alignments, and
a list of statistics specific to the BLAST search method.
Look down the page to the graphical display, a box containing lots of coloured
lines. Each line represents a hit from the BLAST search.
SU5-13
BME355 STUDY UNIT 5
Figure 5.7: MEGABLAST results of a query against the Human Genome
If you pass your mouse cursor over a red line, the narrow box just above the
box gives a brief description of the hit.
The first hit is your red opsin - the best match should be to the query sequence
itself, and you got this sequence from that gene entry.
The second hit is the green opsin - the PubMed entry reported the red and
green pigments are the most similar.
The third and fourth hits are the blue opsin and the rod-cell pigment
rhodopsin.
Other hits have lower numbers of matching residues, and are colour coded
according to a score of matches.
If you click on any of the coloured lines, you will skip down to more
information about that hit, and you can see how much similarity each one has
SU5-14
BME355 STUDY UNIT 5
to the red opsin, your original query sequence. As you go down the list, each
succeeding sequence has less in common with red opsin:
Sequences producing significant alignments:
Value
ref|NP_000504.1|
ref|XP_301073.1|
179
ref|NP_001699.1|
5e-81
ref|NP_000530.1|
2e-78
ref|NP_055137.1|
2e-35
ref|NP_006574.1|
3e-34
ref|NP_150598.1|
2e-33
ref|NP_005949.1|
4e-20
ref|NP_001048.1|
19
ref|NP_000900.1|
18
ref|NP_000903.1|
17
ref|NP_062874.1|
17
ref|NP_062873.1|
17
ref|NP_000863.1|
17
ref|NP_001049.1|
17
ref|NP_001050.1|
17
ref|NP_004215.1|
17
Score E
(bits)
opsin 1 (cone pigments), medium-wave-sensi...
similar to Red-sensitive opsin (Red cone p...
729 0.0
625 e-
opsin 1 (cone pigments), short-wave-sensit...
298
rhodopsin; rhodopsin (retinitis pigmentosa...
289
opsin 3 (encephalopsin, panopsin); opsin 3...
146
peropsin [Homo sapiens]
142
opsin 4 (melanopsin); melanopsin [Homo sap...
139
melatonin receptor 1A; melatonin receptor ...
96
tachykinin receptor 2; NK-2 receptor; Tach...
93 2e-
neuropeptide Y receptor Y1; Neuropeptide Y...
89 3e-
opioid receptor, kappa 1; Opiate receptor,...
87 1e-
5-hydroxytryptamine receptor 7 isoform b; ...
87 2e-
5-hydroxytryptamine receptor 7 isoform d; ...
87 2e-
5-hydroxytryptamine receptor 7 isoform a; ...
87 2e-
tachykinin receptor 1 isoform long; NK-1 r...
86 3e-
tachykinin receptor 3; NK-3 receptor; neur...
86 5e-
G protein-coupled receptor 50 [Homo sapiens]
85 8e-
ref|XP_301490.1| similar to odorant receptor MOR10 [Homo sa...
09
58 8e-
…
SU5-15
BME355 STUDY UNIT 5
ref|XP_063312.2|
09
ref|NP_065110.1|
08
ref|XP_301842.1|
08
ref|NP_005217.1|
08
ref|XP_301795.1|
08
ref|NP_000857.1|
08
ref|NP_115892.1|
08
ref|NP_003605.1|
08
ref|NP_000856.1|
08
similar to seven transmembrane helix recep...
58 8e-
cysteinyl leukotriene receptor 2; cysteiny...
58 1e-
similar to D(1B) dopamine receptor (D(5) d...
58 1e-
endothelial differentiation, sphingolipid ...
57 1e-
similar to D(1B) dopamine receptor (D(5) d...
57 1e-
5-hydroxytryptamine (serotonin) receptor 1...
57 1e-
G protein-coupled receptor 145; G protein-...
57 1e-
galanin receptor 3; galanin receptor, fami...
57 2e-
5-hydroxytryptamine (serotonin) receptor 1...
57 2e-
The sequences are listed in order of increasing E (expect) value. The E value
is the probability that the associated match is due to randomness. The lower
the E value, the more specific/significant is the match. The alignments are
listed in order of most to least significant.
Line descriptions are useful for identifying biologically interesting database
matches and correlating this with the statistical significance of the alignment.
Identifiers for the database sequences appear in the first column and are
hyperlinked to the associated Genbank sequence record.
The Score (bits) is a value attributed to the alignment but is independent of
the scoring matrix used. The higher this value, the better the match.
Alignments found with the BLAST algorithms are gapped unless specified by
the user on the main BLAST input page. Both the Score and Expect values are
given.
Additionally, the percent identity is given; this is the percent of exact matches
between your query sequence and the database sequence. The positive value
is more relevant to protein alignments. This is the percent of exact + similar
(based on properties) amino acid matches.
The gap value is the percent of the alignment that has been gapped in order
to produce the alignment.
SU5-16
BME355 STUDY UNIT 5
In this case, each sequence is shown in comparison with red opsin in a
pairwise sequence alignment. (Later, you will make multiple sequence
alignments from which you can discern relationships among genes.)
For example, the blue opsin is aligned with the red opsin as in the following:
>ref|NP_001699.1| opsin 1 (cone pigments), short-wave-sensitive (colour
blindness,
tritan); blue cone pigment [Homo sapiens]
Length = 348
Score = 298 bits (762), Expect = 5e-81
Identities = 145/339 (42%), Positives = 220/339 (64%), Gaps = 2/339 (0%)
Query: 23 TQSSIFTYTNSNSTRGPFEGPNYHIAPRWVYHLTSVWMIFVVTASVFTNGLVLAATMKFK
82
++ + + N +S GP++GP YHIAP W ++L + +M V
N +VL AT+++K
Sbjct: 5 SEEEFYLFKNISSV-GPWDGPQYHIAPVWAFYLQAAFMGTVFLIGFPLNAMVLVATLRYK 63
Query: 83 KLRHPLNWILVNLAVADLAETVIASTISIVNQVSGYFVLGHPMCVLEGYTVSLCGITGLW
142
KLR PLN+ILVN++
+ + V +GYFV G +C LEG+ ++ G+ W
Sbjct: 64 KLRQPLNYILVNVSFGGFLLCIFSVFPVFVASCNGYFVFGRHVCALEGFLGTVAGLVTGW
123
Query: 143
SLAIISWERWLVVCKPFGNVRFDAKLAIVGIAFSWIWSAVWTAPPIFGWSRYWPHGLKTS 202
SLA +++ER++V+CKPFGN RF +K A+ + +W + PP FGWSR+ P GL+ S
Sbjct: 124 SLAFLAFERYIVICKPFGNFRFSSKHALTVVLATWTIGIGVSIPPFFGWSRFIPEGLQCS 183
Query: 203 CGPDVFSGSSYPGVQSYMIVLMVTCCIIPLAIIMLCYLQVWLAIRAVAKQQKESESTQKA
262
CGPD ++ + +SY L + C I+PL++I Y Q+ A++AVA QQ+ES +TQKA
Sbjct: 184 CGPDWYTVGTKYRSESYTWFLFIFCFIVPLSLICFSYTQLLRALKAVAAQQQESATTQKA
243
Query: 263
EKEVTRMVVVMIFAYCVCWGPYTFFACFAAANPGYAFHPLMAALPAYFAKSATIYNPVIY 322
E+EV+RMVVVM+ ++CVC+ PY FA + N + + +P++F+KSA IYNP+IY
Sbjct: 244 EREVSRMVVVMVGSFCVCYVPYAAFAMYMVNNRNHGLDLRLVTIPSFFSKSACIYNPIIY
303
Query: 323 VFMNRQFRNCILQLF-GKKVDDGSELSSASKTEVSSVSS 360
FMN+QF+ CI+++ GK + D S+ S+ KTEVS+VSS
Sbjct: 304 CFMNKQFQACIMKMVCGKAMTDESDTCSSQKTEVSTVSS 342
SU5-17
BME355 STUDY UNIT 5
To figure out what the scores mean:
Identities are residues that are identical in the hit and the query (red opsin),
when the two are optimally aligned.
Positives are residues that are very similar to each other. For example,
look at residue number 1 in the blue opsin, it is threonine in red opsin
and the very similar serine in the blue.
Gaps are sometimes introduced into a hit to improve its alignment with
the query.
Note: blue opsin and rhodopsin are only about 45% identical to the red
opsin.
Allocation of the Genes for the Hits in the Genome
We are interested in where all the genes for these hit proteins are in the
human genome. Click the Genome View button near just below the
introductory information at the top of this result page.
You have come full circle. You are back at the human chromosome
diagram, and all the hits of your search, in the colours that signify their
BLAST scores, are located on the diagram.
About 100 proteins (discovered so far) that have 40% or more positives
in alignment with red opsin. The opsins are members of the very large
family of G protein-coupled receptors, key players in signal
transduction.
SU5-18
BME355 STUDY UNIT 5
1. Raw DNA sequences (other than Ref seq) in the EMBL and NCBI databases:
(a) Overlap entirely
(b) Overlap to a substantial degree but have distinct sequences
(c) Have relatively little overlap
2. Which of the following BLAST programs uses a signature of amino acids
to find proteins within a family?
(a) PSI-BLAST (b) PHI-BLAST (c) MS BLAST (d) Worm-BLAST
3. Which of the following steps is crucial to validating a sequence you believe
to be that of a novel gene?
(a) Performing a PSI-BLAST search
(b) Checking the EST database to see where this gene might be expressed
(c) Checking Locus Link to see if other family members of this gene have
been annotated
(d) BLAST searching your novel sequence into the appropriate database to
evaluate whether anyone else has described your protein
Answers: (a) (b) (d)
Go to http://blast.ncbi.nlm.nih.gov/Blast.cgi and search for a nucleotide and
a protein using the search terms human and tbx-18 and tbx-18 alone. Record
the top 5 hits you obtain with each search parameter. Tabulate the E-values
and p-scores for the top 5 results.
Answers will be discussed in class.
SU5-19
BME355 STUDY UNIT 5
Summary
The followings key points are discussed in this unit:





Advanced BLAST Searching techniques and a summary of their
applications
BLAST tools for genomic searches
BLAST tools for protein homology searches
Using BLAST tools for gene discovery
Gene discovery: a case study
SU5-20
STUDY UNIT 6
MULTIPLE SEQUENCE
ALIGNMENT
BME355 STUDY UNIT 6
Learning Outcomes
Upon completion of this unit, you will be able to:
1. Define MSA and why MSA are necessary to glean information on
homology
2. Discuss various methods used for MSA, their advantages and
disadvantages
3. Evaluate use of HMM and probability-based models in sequence
analysis
4. Differentiate between the various methods of creating MSA in
comparison with profile HMMs
5. Illustrate knowledge of the various alternative MSA algorithms
6. Discuss how genomic sequences analysis has impacted daily life,
medicine, etc., post HGP
7. Critically evaluate the new disciplines such as Metagenomics enabled
by genome-wide sequence analysis
You can refer to Chapter 6 of the text book
Overview
This unit considers establishment of relationship between multiple biological
sequences. By introducing sequences into a multiple alignment, we can define
members of a gene or protein family. If we know a feature of one of the
proteins and identify the homologous proteins, we can predict that they may
have similar function. Basic concepts and practical strategies of multiple
sequence alignment are studied. Databases of multiple sequence alignments
are introduced. Two main multiple sequence alignment programs are closely
examined.
SU6-1
BME355 STUDY UNIT 6
Chapter 6 Multiple Sequence Alignment (MSA)
6.1 MSA: Introduction and Methods
The goals of this unit are as follows:




To define what a multiple sequence alignment is and how it is
generated; to describe profile HMMs;
To introduce databases of multiple sequence alignments;
To introduce ways you can make your own multiple sequence
alignments; and
To show how a multiple sequence alignment provides the basis for
phylogenetic trees.
Definition of multiple sequence alignment:




A collection of three or more protein (or nucleic acid) sequences that
are partially or completely aligned;
Homologous residues are aligned in columns across the length of the
sequences;
Residues are homologous in an evolutionary sense; and
Residues are homologous in a structural sense.
Properties of multiple sequence alignment:





Not necessarily one “correct” alignment of a protein family;
Protein sequences evolve;
The corresponding three-dimensional structures of proteins also
evolve;
May be impossible to identify amino acid residues that align properly
(structurally) throughout a multiple sequence alignment; and
For two proteins sharing 30% amino acid identity, about 50% of the
individual amino acids are superimposable in the two structures.
Features of multiple sequence alignment:




Some aligned residues, such as cysteine that form disulfide bridges,
may be highly conserved;
There may be conserved motifs such as a transmembrane domain;
There may be conserved secondary structure features; and
There may be regions with consistent patterns of insertions or
deletions (indels).
SU6-2
BME355 STUDY UNIT 6
Uses of multiple sequence alignment:





MSA is more sensitive than pairwise alignment to detect homologs;
BLAST output can take the form of a MSA, and can reveal conserved
residues or motifs;
Population data can be analysed in a MSA (PopSet);
A single query can be searched against a database of MSAs; and
Regulatory regions of genes may have consensus sequences
identifiable by MSA.
Methods for Multiple Sequence alignment
There are two main ways to make a multiple sequence alignment: progressive
alignment and iterative approaches.
We illustrate the progressive alignment using ClustalW. In the example, two
data sets are used: five distantly related lipocalins (human to E. coli), and five
closely related RBPs. Note: When you do this, obtain the sequences of interest
in the FASTA format. (You can save them in a Word document)
Visit http://www2.ebi.ac.uk/clustalw/
Figure 6.1 Sequence input for a MSA using the Clustal W algorithm
SU6-3
BME355 STUDY UNIT 6
Feng-Doolittle MSA occurs in 3 stages:
1. Do a set of global pairwise alignments (Needleman and Wunsch)
2. Create a guide tree
3. Progressively align the sequences
Stage 1 of 3: Generate global pairwise alignments
Number of pairwise alignments needed for N sequences, (N-1) (N) / 2, e.g.,
for 5 sequences, (4) (5) / 2 = 10.
For example, five distantly related lipocalins as follows:
Another example, five closely related lipocalins as follows:
SU6-4
BME355 STUDY UNIT 6
Stage 2 of 3: Generate guide tree
•
•
•
•
•
Convert similarity scores to distance scores
A tree shows the distance between objects
Use UPGMA
ClustalW provides a syntax to describe the tree
A guide tree is not a phylogenetic tree
It is calculated from the distance matrix. For the first example,
For the second example,
(
(
SU6-5
BME355 STUDY UNIT 6
gi|5803139|ref|NP_006735.1|:0.04284,
(
gi|6174963|sp|Q00724|RETB_MOUS:0.00075,
gi|132407|sp|P04916|RETB_RAT:0.00423)
:0.10542)
:0.01900,
gi|89271|pir||A39486:0.01924,
gi|132403|sp|P18902|RETB_BOVIN:0.01902);
Stage 3 of 3: Progressive alignment
•
•
•
•
•
Make a MSA based on the order in the guide tree
Start with the two most closely related sequences
Then add the next closest sequence
Continue until all sequences are added to the MSA
Rule: “once a gap, always a gap.”
For the first example, progressively align the sequences following the branch
order of the tree, as follows:
SU6-6
BME355 STUDY UNIT 6
For the second example,
Figure 6.2 Results from a MSA using RBP and Lipocalins as query
Why “once a gap, always a gap”?
•
•
•
•
•
There are many possible ways to make a MSA
Where gaps are added is a critical question
Gaps are often added to the first two (closest) sequences
To change the initial gap choices later on would be to give more
weight to distantly related sequences
To maintain the initial gap choices is to trust that those gaps are most
believable
SU6-7
BME355 STUDY UNIT 6
Sum-of-Pairs function
How do we assess the quality of an MSA? Usually, assumption is made that
the alignment score is the sum of column scores. Therefore, we need a way to
assign a score to each column and add them up to get the alignment score. To
score a column, we want a function with k arguments, where k is the number
of sequences.
Sum-of-Pairs function: Scores of pair-wise alignments in each column added
together.
e.g.,
SP-score(I, , I, V) = p(I,) + p(I,I) + p(I,V) + p(,I) + p(,V) + p(I,V)
p(a,b) is the pairwise score of symbols a and b, specified by a
substitution matrix
p(a,) or p(,a) is a specified gap penalty
p(,) = 0
SP-score is independent of the order of arguments, e.g., SPscore(I, , I, V) = SP-score(V, I, I, )
k(k1)/2 pairwise scores need to be added for k arguments
SP scoring system is widely used due to its simplicity and
effectiveness
Multiple Dynamic Programming
Dynamic programming with two sequences:


Relatively easy to code;
Guaranteed to obtain optimal alignment.
Can this be extended to multiple sequences? Consider the three amino acid
sequences VSNS, SNA, AS. Put one sequence per axis (x, y, z) → 3dimensional array required.
Complexity of Multiple Dynamic Programming:
Given k sequences of length n each
Compute a global multiple alignment with optimal
SP-score via DP
Space complexity is O(n k)
SU6-8
BME355 STUDY UNIT 6
Time Complexity is O(k  2k n k)
Cells in DP matrix: O(n k)
Cells in Recurrence Relation: O(2 k)
Evaluating SP-score: O(k)
Conclusion: It’s going to take forever to align, say, 10 length-200 strings via
Multiple DP. Since complexity of DP approach is exponential in the number
of sequences, heuristic methods are usually used.
Star Alignment
Input:
–
k sequences S1,…,Sk
–
scoring scheme
Procedure:
Pick one sequence as the “centre”, called Sc;
For each Si  Sc, determine an optimal pairwise alignment
between Si and Sc;
Aggregate pairwise alignments
Output:
–
MSA result from the aggregate
How good is the solution of the star alignment method? Under
the following conditions it is possible to derive a “Bounded
Error Approximation”:
Use Distance d(x,y) instead of similarity s(x,y);
Assume the “Triangle Inequality” holds: dist(x,z)
 dist(x,y) + dist(y,z) for all characters x, y, z of the
alphabet including space
Theorem: The star alignment method outputs an MSA with SP-score within
2optimal.
ClustalW
Most practical multiple alignments are made using the progressive alignment
method. The alignment is constructed by adding one sequence at a time to a
SU6-9
BME355 STUDY UNIT 6
growing alignment. Progressive alignments are fast enough to allow
hundreds or even thousands of sequences to be aligned.
We use the program ClustalW to make a multiple sequence alignment and
compare the sequences with each other. The ClustalW algorithm is a fully
automated method for global multiple sequence alignment of DNA and
Protein sequences. It attempts to optimise the weighted sums-of-pairs with
gap penalties. The algorithm provides weights to the sequence and adjustable
parameters with reasonable defaults. ClustalW can create multiple
alignments and can also create phylogenetic trees.
Briefly, the ClustalW program runs in three stages:
Stage 1 Pairwise Alignment
Compute pairwise alignments for each sequence against all other sequences
and store the result in a similarity matrix. Convert the values in the sequence
similarity matrix to distance measures which reflect the evolutionary distance
between each pair of sequences.
Stage 2 Guiding Tree
Construct a guide tree which defines the order in which pairs of sequences
are aligned and combined with previous alignments using the sequence
similarity matrix and a neighbour-joining algorithm.
Stage 3 Progressive Alignment
Align progressively following the guide tree. Start by aligning most closely
related pairs of sequences and at each step align two sequences or one to an
existing group of sub-alignment.
SU6-10
BME355 STUDY UNIT 6
Figure 6.3 MSA steps and results using ClustalW
(Image courtesy: Baxenavis and Oullette, 2001)
Shortcoming of Progressive Approach:






No guarantee that the global optimal solution will be found.
“Once a gap, always a gap”
Any mistakes (misaligned regions) made early in the alignment
process cannot be corrected later as new information from other
sequences is added. These mainly result from an incorrect branching
order in the initial tree.
The initial phylogenetic trees are derived from a matrix of distances
between separately aligned pairs of sequences and are much less
reliable than trees from complete multiple alignments.
When all the sequences are highly divergent (e.g.. less than 25-30%
identity between any pair), this progressive approach becomes much
less reliable.
Runtime is still high, although ClustalW has polynomial complexity.
SU6-11
BME355 STUDY UNIT 6
6.2 Profile HMMs and Alternative MSA tools
Hidden Markov models (HMMs) are “states” that describe the probability of
having a particular amino acid residue arranged in a column of a multiple
sequence alignment. HMMs are probabilistic models. Like a hammer is more
refined than a blast, an HMM provides more sensitive alignment than
traditional techniques using progressive alignments.
Hidden Markov Models
(Access video via iStudyGuide)
An HMM is constructed from a MSA, for example, five lipocalins:
GTWYA (hs RBP)
GLWYA (mus RBP)
GRWYE (apoD)
GTWYE (E Coli)
GEWFS (MUP4)
SU6-12
BME355 STUDY UNIT 6
Prob.
1
p(G)
1.0
2
p(T)
0.4
p(L)
0.2
p(R)
0.2
p(E)
0.2
p(W)
3
4
5
0.4
1.0
p(Y)
0.8
p(F)
0.2
p(A)
0.4
p(S)
0.2
P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064
log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75
Figure 6.4a & b Structure of a hidden Markov model (HMM)
SU6-13
BME355 STUDY UNIT 6
HMMER: build a hidden Markov model
Determining effective sequence number
… done. [4]
Weighting sequences heuristically
... done.
Constructing model architecture
... done.
Converting counts to probabilities
... done.
Setting model name, etc.
... done. [x]
Constructed a profile HMM (length 230)
Average score:
411.45 bits
Minimum score:
353.73 bits
Maximum score:
460.63 bits
Std. deviation:
52.58 bits
HMMER: calibrate a hidden Markov model
HMM file:
lipocalins.hmm
Length distribution mean: 325
Length distribution s.d.: 200
Number of samples:
random seed:
5000
1034351005
histogram(s) saved to: [not saved]
POSIX threads:
2
--------------------------------
SU6-14
BME355 STUDY UNIT 6
HMM
mu
:x
: -123.894508
lambda :
max
0.179608
: -79.334000
Figure 6.4 b HMMER: search an HMM against GenBank
Match to a bacterial lipocalin:
SU6-15
BME355 STUDY UNIT 6
HMMER: search an HMM against GenBank (continue)
Scores for complete sequences (score includes all domains):
Sequence
Description
Score
E-value
N
--------
-----------
-----
------- ---
gi|3041715|sp|P27485|RETB_PIG
Plasma retinol-
614.2
1.6e-179
1
gi|89271|pir||A39486
plasma retinol-
613.9
1.9e-179
1
gi|20888903|ref|XP_129259.1|
(XM_129259) ret
608.8
6.8e-178
1
gi|132407|sp|P04916|RETB_RAT
Plasma retinol-
608.0
1.1e-177
1
gi|20548126|ref|XP_005907.5|
(XM_005907) sim
607.3
1.9e-177
1
gi|20141667|sp|P02753|RETB_HUMAN
Plasma retinol-
605.3
7.2e-177
1
gi|5803139|ref|NP_006735.1|
(NM_006744) ret
600.2
2.6e-175
1
Two Kinds of Multiple Sequence Alignment Resources

Databases of multiple sequence alignments:
Text-based searches of CDD, Pfam (profile HMMs), PROSITE
Database searches with a query sequence with BLAST, CDD,
PFAM
Examples:
BLOCKS (HMM)
CDD (HMM)
DOMO (Gapped MSA)
INTERPRO (Integrative resources)
iProClass (Integrative resources)
MetaFAM (Integrative resources)
Pfam (profile HMM library)
PRINTS
PRODOM (PSI-BLAST)
SU6-16
BME355 STUDY UNIT 6
PROSITE
SMART

Multiple sequence alignment by manual input:
PileUp, CLUSTAL W, CLUSTAL X
Examples:
AMAS
CINEMA
ClustalW
ClustalX
DIALIGN
HMMT
Match-Box
MultAlin
MSA
Musca
PileUp
SAGA
T-COFFEE
MSA databases: manual vs. automated curation
Manual curation:
Pfam
PROSITE
BLOCKS
SU6-17
BME355 STUDY UNIT 6
PRINTS
Automated curation:
DOMO
PRODOM
MetaFam
+ comprehensive
- alignment errors
Strategy for Assessment of Alternative MSA Algorithms
Categories of multiple sequence alignment algorithms:
Local
Progressive
Global
CLUSTAL
PIMA
PileUp
Iterative
DIALIGN
SAGA
Figure 6.5 Categories of Multiple Sequence Alignment algorithms
1. Create or obtain a database of protein sequences for which the 3D
structure is known. Thus we can define “true” homologs using
structural criteria.
SU6-18
BME355 STUDY UNIT 6
2. Try making multiple sequence alignments with many different sets
of proteins (very related, very distant, few gaps, many gaps,
insertions, outliers)
3. Compare the answers.
BaliBase: comparison of multiple sequence alignment algorithms
•
As percent identity among proteins drops, performance (accuracy)
declines also. This is especially severe for proteins < 25% identity:
Proteins <25% identity: 65% of residues align well
Proteins <40% identity: 80% of residues align well
SU6-19
BME355 STUDY UNIT 6
•
“Orphan” sequences are highly divergent members of a family.
Surprisingly, orphans do not disrupt alignments. Also surprisingly,
global alignment algorithms outperform local.
•
Separate multiple sequence alignments can be combined (e.g., RBPs
and lactoglobulins). Iterative algorithms (PRRP, SAGA) outperform
progressive alignments (ClustalX)
•
When proteins have large N-terminal or C-terminal extensions, local
alignment algorithms are superior. PileUp (global) is an exception.
6.3 Ethical, Legal and Social Implications (ELSI) and
Metagenomics
In this course, we have looked at the role of genomics and use of genomic
sequences in various studies involving evolution and functional analysis of
the cell. We have looked at various techniques which have made genomic
sequences an integral part of medicine and diagnosis.
In the following two audio files, we key in one specific application
(Metagenomics) which uses the sequence information from complex
environments to generate a new and novel cure for an infection.
Metagenomics
(Access video via iStudyGuide)
Additionally, we look at some of the questions being raised by the use of
genomic data and the ethical, legal and social implications of incorporation of
such data in medicine and disease studies.
Ethical, Legal, Social Implications in Genomics
(Access video via iStudyGuide)
SU6-20
BME355 STUDY UNIT 6
1. Which of the following programs does not generate a multiple sequence
alignment?
(a) PSI-BLAST
(b) ClustalW
(c) PileUp
(d) PHYLIP
2. Which of the following is not a database consisting primarily of hidden
Markov models?
(a) Pfam (b) PRINTS (c) SMART (d) TIGRFAMs
Answers: (d) (b)
Solve a HMM tutorial in class and understand why HMM is popularly used
in many applications such as speech recognition, sonar and in genomics.
Answers will be discussed in class.
Carryout a short MSA in class of 25 bases to understand how the various
algorithms compute and score for deletions, gaps, etc., and create a neighbour
joining tree from your MSA result.
Answers will be discussed in class.
Do a google search to learn more about Myriad Genetics and the saga of Breast
Cancer patients depending on Myriad’s BRCA-tests. Discuss your findings on
this subject in class.
SU6-21
BME355 STUDY UNIT 6
Summary
The followings key points are discussed in this unit:







Multiple sequence alignments: their features and uses
Methods used for multiple sequence alignments (MSA)
What are profile HMM’s? How are they used in studying evolution?
Strategy and alternative MSA methods
Summary of MSA algorithms
Ethical, Legal and Social Implications of using Genomic data
Introduction to Metagenomics
SU6-22
Download