Lecture 7

advertisement
Bioinformatics
Lecture 7
• Types of databases.
• Principles of organizations and functioning.
• Sequence formats.
• Conversion of one sequence format to another.
• Database search.
• FASTA, BLAST.
Protein and DNA/RNA Databases
•The first biological database was created by Margaret
Dayhoff in 1960s as a reaction to development of proteinsequencing methods in 1950s.
• Proteins in this and other DB were organised into
families and superfamilies based on degree of similarity.
• Tables that reflected the frequency of changes observed
in the sequences of a group of closely related proteins
were then derived (Percent Accepted Mutations, PAM
Matrices).
• These tables were used to align sequences and
reconstruct the evolutionary pathways – phylogenetic
trees
Protein and DNA/RNA Databases
• In the following years numerous protein and other
databases were developed. SwissProt is an example
Protein and DNA/RNA Databases
• The first DNA DB were developed in 1979-80 by
American (GenBank) and European groups (EMBL) also
as a reaction to development of sequencing techniques.
• Hundreds different specialised DB were constructed
since then. Many DB contain DNA or RNA and protein
information. There are numerous links between DB and
regular exchange of data and tools occur.
• The Entrez Nucleotides database is a collection of
sequences from several sources, including GenBank,
RefSeq, and PDB. The number of stored bases grows at
an exponential rate. On 20.02.2004 the total number of
base stored in Entrez was 20,197,497,568.
ENTREZ
• DB of different kind merged together and become global
hubs of knowledge.
Protein and DNA/RNA Databases
• Each DB obviously has a complicated internal structure
• Only a section of GenBank dealing with sequences consists of 58
blocks. Each block has at least one but usually several links with
other blocks.
• The major components of DB include: storage of sequences, many
blocks responsible for retrieval of sequences, several blocks
responsible for alignment, blocks responsible for input of the data and
quality control, blocks responsible for statistical analysis of sequences
and many other functionalities available in the DB
• Complex DB like GenBank contain several sets of blocks, which
serve DNA, protein, genome, taxonomy and other domains of the DB
with numerous links operating between them.
• There are hundreds of thousand and even millions requests to the
major banks every day. Clearly this must be accommodated in the
structure of the bank
Collecting sequence and other data
• Primary sequences of DNA, RNA and proteins constitute a
significant portion of information accumulated in DB
Collecting sequences and other data
• All primary sequences, which are going to be in public domain,
must be submitted into a DB, otherwise a publication is not accepted.
• DNA sequences are usually submitted on-line in the following form.
• Each sequence provided with an ID
• If two or more identical sequences provided by different people a
problem of redundancy of a DB will emerge.
Collecting sequences and other data
• As size of genomes varies dramatically from 10,000 bp for simple
viruses up to several billion bp in higher animals and plants, the
number of sequences covering the whole genome also varies very
significantly 10 – 106.
• DNA fragments presented in DB have not only very different
lengths but also diverse origin. Some are large fragments of genome,
other represent genes or their fragments, some are repeats and noncoding sequences, etc.
• Many fragments have areas of overlaps.
• Many sequences are annotated. It means that their position on
genetic maps, internal structure of genes (exon-intron) and function
are known or predicted. However in many cases such information is
missing.
Sequence formats
• It is import to ensure that sequence files do not
contain special characters recognisable only by text
editors. ASCII files are suitable for most sequence
programs.
• However independent DB and some widely used
programs developed slightly different formats for
sequences.
• Correct using of different formats is critical as
well as a possibility to recognize and convert
sequence/file/entry from one format to another.
GenBank DNA sequence entry
Sequence formats
• There are many different (> 20) sequences formats including
GenBank, EMBL, SwissProt, FASTA, Genetics Computer
Group (GCG) Sequence Format and several others.
1. FASTA/Pearson
format
>seq1
2. GenBank format
LOCUS
seq1
16bp
agctagct actgg
DEFINITION seq1, 16 bases, 2688
checksum.
>seq2
ORIGIN
aactaact attcg
1 agctagctag
//
LOCUS
seq2
20bp
Conversion of one sequence format to another
• There are several computer programs able to convert
formats.
• READSEQ is one of such programs and is very useful.
Sequence formats recognized by format
conversion program READSEQ:
1. Abstract Syntax Notation (ASN.1)
2. DNA Strider
3. EMBL
4. FASTA
5. Fitch (phylogenetic analysis)
6. GenBank
7. GCG
8. Intelligenetics
9. Multiple sequence format
10. Nat. Biomedical Research
Foundation (NBRF)
11. Protein Information Resource (PIR)
12. And 6-8 additional specialised
formats
READSEQ
Format conversion in GenBank
Storage of information in a sequence database
• There are millions of entries in the major DNA and protein DB and
each entry usually contain significant amount of information.
• This information is organised into a tabular form, as it usually done
in relational DB. The number of columns (fields) in such DB is much
larger than in the table below.
• An index of these fields can be made, which allows very fast search
of a DB using one or few field simultaneously.
• The information in one DB can be cross-referenced to that in
another DB. For instance DNA, protein and reference DB have all
been cross-references so that moving between them is readily
accomplished.
Accession
No
123
124
Organism
Reference
Name
Keywords
Sequence
LexA
protein
SOS regulon,
repressor,…
ATGCCGG…
E. coli
Medline1,
H. sapiens
Medline2, glucorticoid transcriptional CCGATAAC
receptor
regulator
Database Types
• There are several types of DB; the two principal types are the relational
and object-oriented.
• The relational DB orders data in tables made up of rows giving specific
item in the DB, and columns giving the features as attributes of those
items. Careful indexation and cross-referencing are essential for each
item in DB has a unique set of identifying features.
•The object-oriented DB structure has been useful in the development of
biological DB. These DB are necessary to deal with complex and
constantly evolving biological objects like genetic maps. A sophisticated
architecture and use of unifying common language, like Interface
Definition Language (IDL) are required to make such DB functional and
united. To plan/construct such object-oriented DB a specific set of
procedures called the Unified Modelling Language (UML) was devised.
Example of object-oriented DB
Sequence retrieval from the public Databases
•The essential step in providing access to a functional DB is
development of software and Web pages that allow queries to be
made.
•ENTREZ system implemented in NCBI is a good example of such
retrieval system
•Three major versions of a query search are implemented in a number
of DBs. 1. ID search, 2. Molecule name or function search and 3.
Similarity search.
• The first two types are based on DB indexes and cross-references.
• The third is different as it has to create a comparative data and then
use these data for the retrieval and DB searches.
DB searches for similar sequences
• Since Charles Darwin the idea of common origin of species became
widely accepted view, however the level of similarity on molecular
level between distant species remained unclear until 1970s and 1980s.
• At that time the fact that many DNA and particularly protein
molecules retain significant (>60-70%) or high (>85%) similarity
hundreds of millions of years after separation from the common
ancestor was established.
• This discovery as well as practical needs to search growing DB lead
to development of effective methods of similarity search.
• Two programs, which greatly facilitated the similarity search, were
developed FASTA (Pearson and Lipman 1988) and BLAST (Altschul
et al. 1990).
Basics of similarity searches
• The basic step in any similarity search is an alignment of two or
more sequences. Principles of alignment will be considered during the
next lecture.
• The search provides a list of DB sequences with which a query
sequence can be aligned. Then scoring procedure is implemented,
which allows to measure degree of similarity from 100% identity to a
loose similarity.
• A common reason for performing a DB search is to find a related
gene. A matched gene (or any other sequence) may provide a clue as
to function.
• An alternative task can be be achieved when a sequence with known
function or role is used as a query for search in a species genome.
• The search must be fast and sensitive enough.
FASTA
• FASTA is a program for rapid alignment of pairs of protein and DNA
sequences.
• Comparison of all nucleotides or amino acids is not an option, even for
powerful computers, FASTA instead searches for matching sequence
patterns (“words”) called k-tuples. These patterns comprise k consecutive
matches in the compared sequences.
• Using k-tuples FASTA builds a local alignment.
• Finally FASTA scores this alignment and output a list of sequences
similar to a query in the descending order.
gaps
ATCGAACCTGGATCGTGGCCATCGAACCTGGATCGTGGCCATCGAACCTGGATCGTGGCC
GGCGAACCCCTATCGTGGCGTTACCGCCTTATTGACGGCCATCGAACCTGGATCGTGGCC
k=6
k=8
k = 14
k-tuples
FASTA
FASTA performs the following statistical tasks: 1. The average score for DB seq.
of the same length is determined, 2. The average score is plotted against the log of
average seq. length in each length range, 3. The points are then fitted to a straight line
by linear regression, 4. A z score, the number of standard deviations from fitted line, is
calculated for each score, 5. Low scoring seq. are removed. 6. A statistical comparison
with Z distribution follows, which allows to calculate E ( ) value. If E ( ) = 0, and z
score is high two sequences are identical, when E is higher then a threshold level, no
clear similarity is observed.
Methods used by FASTA to locate sequence
similarity:
A.
Rapid location of 10 best matching regions in each
pair. For DNA seq. k = 4-6, for protein k = 1-2. The
highest-density matches identified.
B.
The highest-density regions are evaluated using
special scoring matrixes (next lecture) and the best
initial regions (INIT1) are found (*-the best).
C.
Longer regions of identity of score INITN are
generated by joining INIT with scores higher than a
certain threshold, which include positive scores for
similarity and negative for gaps. Optimisation
procedure follows.
FASTA
Typical output of FASTA similarity search
Query – Motif2; #282 – is a fragment from a DB
>>#282
(18 aa)
initn:
48 init1:
48 opt:
71 z-score: 191.0 E(): 6.9e-06
Smith-Waterman score: 71;
61.111% identity in 18 aa overlap
10
20
Motif2 VKTYGFAATSVEEAKEVAEERGK
X:.:::.::X.:.:.. :
#282
GFVATSAEEAEEIAKKLG
10
BLAST
• Basic Local Alignment Search Tool (BLAST) was developed as
a new way to perform seq. similarity search. BLAST is faster
than FASTA while being nearly as sensitive.
• The minimal “word” (k-tuple) length is slightly higher than in
FASTA, 3 for proteins and 11 for DNA.
BLAST procedure
•The steps used by the BLAST algorithm:
• The seq is optionally filtered to remove low-complexity regions
(AGAGAG…)
• A list of words of certain length is made
• Using substitution scores matrixes (like PAM or BLOSUM62) the query seq.
words are evaluated for matches with any DB seq. and these scores (log) are
added
• A cutoff score (T) is selected to reduce number of matches to the most
significant ones
• The above procedure is repeated for each word in the query seq.
• The remaining high-scoring words are organised into efficient search tree and
rapidly compared to the DB seq.
• If a good match is found then an alignment is extended from the match area in
both directions as far as the score continue to grow. In the latest version of
BLAST more time-efficient method is used
BLAST procedure
Query sequence
• The essence of this method is finding a diagonal connecting ungapped
alignments and extending them
Database sequence
BLAST procedure
• 8) The next step is to determine those high scoring pairs (HSP) of seq., which
have score greater than a cutoff score (S). S is determined empirically by
examining a range of scores found by comparing random seq. and by choosing
a value that is significantly greater.
• 9) Then BLAST determines statistical significance of each HSP score. The
probability p of observing a score S equal to or greater than x is given by the
equation: p (S  x) = 1 – exp(-e-(x-u)), where u = [log (Km’n’)]/ and K and 
are parameters that are calculated by BLAST for amino acid or nucleotide
substitution scoring matrix, n’ is effective length of the query seq. and m’ is
effective length of the database seq.
• 10) On the next step a statistical assessments is made in the case if two or
more HSP regions are found and certain matching pairs are put in descending
order in the output file as far as their similarity/ score is concerned.
On line BLAST results
On line BLAST results
On line BLAST results
Download