Sequence Alignment

advertisement
Sequence Alignment
Topics:
• Introduction
• Exact Algorithm
• Alignment Models
• BioPerl functions
Sequence Alignment
Motivation:
•Storing, retrieving and comparing DNA sequences in Databases.
•Comparing two or more sequences for similarities.
•Searching databases for related sequences and subsequences.
•Exploring frequently occurring patterns of nucleotides.
•Finding informative elements in protein and DNA sequences.
•Various experimental applications (reconstruction of DNA, etc.)
Seq.Align.
Gene1
Protein Function
Gene2
More than 25%
sequence identity
?
Similar 3D structure
?
Similar function
?
Similar sequences produce similar proteins
Alignment - inexact matching
•Substitution - replacing a sequence base by another.
•Insertion
- an insertion of a base (letter) or
several bases to the sequence.
•Deletion
- deleting a base (or more) from the
sequence.
(Insertion and deletion are the reverse of one another)
Seq. Align. Score
Commonly used matrices:
PAM250, BLOSUM64
Local Alignment
Local Alignment
INPUT: Two sequences S and T .
QUESTION: What is the maximum similarity between a
subsequence of S and a subsequence of T ?
Find most similar subsequences.
The IDEA
s[1…n]
t[1…m]
To align s[1...i] with t[1…j] we have three choices:
* align s[1…i-1] with t[1…j-1] and match s[i] with t[j]
s[1…i-1] i
t[1…j-1] j
* align s[1…i] with t[1…j-1] and match a space with t[j]
s[1… i ] t[1…j-1] j
* align s[1…i-1] with t[1…j] and match s[i] with a space
s[1…i-1] i
t[1… j ] -
Local Alignment
Smith-Waterman 1981
*Penalties should be negative*
s: xxxcde
t: abcxdex
match
2
mismatch -1
xxxcde- - - - - -abcxdex
Local
alignment
cxde
c-de
Sequence Alignment
Complexity:
Time O(n*m)
Space O(n*m) (exist algorithm with O(min(n,m)) )
Global Alignment
Global Alignment
INPUT: Two sequences S and T of roughly the same length.
QUESTION: What is the maximum similarity between them?
Find one of the best alignments.
Global Alignment
Needleman-Wunsch 1970
Alignment Score:
V(n,m)
Ends free alignment
Ends free alignment
INPUT: Two equences S and T (possibly of different length).
QUESTION: Find one of the best alignments between
subsequences of S and T when at least one of these
subsequences is a prefix of the original sequence and one
(not necessarily the other) is a suffix.
or
Ends free alignment
m
n
Gap Alignment
Definition: A gap is the maximal contiguous run of spaces in a single
sequence within a given alignment.The length of a gap is the number of
indel operations on it. A gap penalty function is a function that measure
the cost of a gap as a (nonlinear) function of its length.
Gap penalty
INPUT: Two sequences S and T (possibly of different length).
QUESTION: Find one of the best alignments between the two
sequences using the gap penalty function.
Affine Gap:
Wtotal = Wg + qWs
Wg – weight to open the gap
Ws – weight to extend the gap
BioPerl
“Bioperl is a collection of perl modules that facilitate the
development of perl scripts for bio-informatics applications.”
Bioperl is open source software that is still under active
development.
www.bioperl.org
Tutorial
Documentation
BioPerl
• Accessing sequence data from local and remote databases
• Transforming formats of database/ file records
• Manipulating individual sequences
• Searching for "similar" sequences
• Creating and manipulating sequence alignments
• Searching for genes and other structures on genomic DNA
• Developing machine readable sequence annotations
BioPerl library at TAU
BioPerl is NOT yet installed globally on CS network.
In each script you should add the following two lines:
use lib "/a/netapp/vol/vol0/home/silly6/mol/lib/BioPerl/lib";
use lib "/a/netapp/vol/vol0/home/silly6/mol/lib/BioPerl/lib/i686-linux";
Sequence Object
Seq – stores sequence, identification labels (id,
accession number, molecule type = DNA, RNA,
Protein, …), multiple annotations and associated
“sequence features”.
Sequence Object
$seq = Bio::Seq->new('-seq'=>'actgtggcgtcaact',
'-desc'=>'Sample Bio::Seq object',
'-display_id' => 'something',
'-accession_number' => 'accnum',
'-moltype' => 'dna' );
Usually Seq is not created this way.
Sequence Object
$seqobj->display_id(); # the human read-able id of the sequence
$seqobj->seq(); # string of sequence
$seqobj->subseq(5,10); # part of the sequence as a string
$seqobj->accession_number(); # when there, the accession number
$seqobj->moltype(); # one of 'dna','rna','protein'
$seqobj->primary_id(); # a unique id for this sequence irregardless
# of its display_id or accession number
Accessing Data Base
Databases: genbank, genpept, swissprot and gdb.
$gb = new Bio::DB::GenBank();
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
$seq2 = $gb->get_Seq_by_acc('AF303112'));
$seqio = $gb->get_Stream_by_batch([ qw(J00522 AF303112
2981014)]));
Seq module
use Bio::DB::GenBank;
$gb = new Bio::DB::GenBank();
$seq1 = $gb->get_Seq_by_acc('AF303112');
$seq2=$seq1->trunc(1,90);
print $seq2->seq(), "\n";
$seq3=$seq2->translate;
print $seq3->seq(), “\n“;
ATGGAGCCCAAGCAAGGATACCTTCTTGTAAAATTGATAGAAGCTCGCAAGCTAGCA
TCTAAGGATGTGGGCGGAGGGTCAGATCCATAC
MEPKQGYLLVKLIEARKLASKDVGGGSDPY
SeqIO object
$seq = $gb->get_Seq_by_acc('AF303112'));
$out = Bio::SeqIO->new('-file' => ">f.fasta",
'-format' => 'Fasta');
$out->write_seq($seq);
SeqIO can read/write/transform data
in the following formats :
Fasta, EMBL. GenBank, Swissprot, PIR, GCG,
SCF, ACE, BSML
Transforming Sequence Files
$in = Bio::SeqIO->new('-file' => “f.fasta",
'-format' => 'Fasta');
$out = Bio::SeqIO->new('-file' => ">f.embl",
#better '-format' => ‘EMBL');
$in = Bio::SeqIO->new('-file' => “f.fasta",
$out->write_seq($in->next_seq());
'-format' => 'Fasta');
$out = Bio::SeqIO->new('-file' => ">f.embl",
#for several sequences
'-format' => ‘EMBL');
while ( my $seq = #$in->next_seq()
) { converter:
Fasta<->EMBL format
$out->write_seq($seq); }
print $out $_ while <$in>;
BioPerl: Pairwise Sequence Alignment
Smith-Waterman Algorithm
use Bio::Tools::pSW;
$factory = new Bio::Tools::pSW( '-matrix' => 'blosum62.bla',
'-gap' => 12,
'-ext' => 2, );
$factory->align_and_show($seq1, $seq2, STDOUT);
Currently works only on protein sequences.
Alignment Objects
SimpleAlign handles multiple alignments of sequences.
#pSW module
$aln = $factory->pairwise_alignment($seq1, $seq2);
foreach $seq ( $aln->eachSeq() ) {
print $seq->seq(), "\n"; }
$alnout = Bio::AlignIO->new(-format => 'fasta',
-fh => \*STDOUT);
$alnout->write_aln($aln);
Homework
Write a cgi script (using Perl) that performs pairwise Local/Global Alignment
for DNA sequences. All I/O is via HTML only.
Remarks:
1) you are allowed to use only linear space.
1. Choice for Local/Global alignment.
2)To compute z-score perform random shuffling:
2. Two sequences – text
boxes or two
accession
numbers.
srand(time|
$$);
#init, $$-proc.id
int(rand($i));
#returns rand. number between [0,$i].
3. Values for match, mismatch,
ins/dels.
3)Shuffling is done in windows (non-overlapping) of
4. Number of iterations for computing random scores.
10 bases length. Number of shuffling for each
Output:
window is random [0,10].
Input:
1.
Alignment score.
2.
z-score value (z= (score-average)/standard deviation.)
Download