Lecture 7: Biological sequences comparison

advertisement
Bioinformatics
GBIO0009 -1
Biological Sequences
Biological Sequences Comparison
GBIO0009-1
Presented by
Kirill Bessonov
Oct 27, 2015
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 1
Bioinformatics
GBIO0009 -1
Biological Sequences
Lecture structure
1.
2.
3.
4.
Introduction
Global Alignment: Needleman-Wunsch
Local Alignment: Smith-Waterman
Illustrations
– Sequences identification via BLAST
– Retrieval of sequences via UniProt and R
– Alignment of sequences via R (seqinr)
– Detection of ORFs with R
– Comparing genomes of two species
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 2
Bioinformatics
GBIO0009 -1
Biological Sequences
Biological sequence
• Single continuous molecule
– DNA
– RNA
– protein
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 3
Bioinformatics
GBIO0009 -1
Biological Sequences
Biological problem
• Given the DNA sequence
AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTT
AAGATCATGCTATTTTCGAGATTCGATTCTAGCTA
• Answer
–
–
–
–
–
–
Is it likely to be a gene?
What is its possible expression level?
What is the possible structure of the protein product?
Can we get the protein?
Can we figure out the key residues of the protein?
….
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 4
Bioinformatics
GBIO0009 -1
Biological Sequences
Alphabet
– In the case of DNA
• A, C, T, G
– In the case of RNA
• A,C, U, G
– In the case of protein
• 20 amino acids
• Complete list is found here
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 5
Bioinformatics
GBIO0009 -1
Biological Sequences
Words
• Short strings of letters from an alphabet
• A word of length k is called a k-word or k-tuple
• Examples:
– 1-tuple: individual nucleotide
– 2-tuple: dinucleotide
– 3-tuple: codon
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 6
Bioinformatics
GBIO0009 -1
Biological Sequences
Patterns
• Recognizing motifs, sites, signals, domains
– functionally important regions
– a conserved motif - consensus sequence
– Often words (in bold) are used interchangeably
• Gene starts with with an “ATG” codon
– Identify # of potential gene start sites
AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATT
CGATTCTAGCTAGGTTTAGCTTAGCTTAGTGCCAGAAATCGGATGCGCGTAGGATCGG
TAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATTCGATTCTAGCTAGGTTTTTAGT
GCCAGAAATCGTTAGTGCCAGAAATCGATT
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 7
Bioinformatics
GBIO0009 -1
Biological Sequences
Probability
• What is the probability distribution of the number of
times a given base N (A,C,T or G) occurs in a random
DNA sequence of length Ln?
– Assume X1, …, Xn is the sequence of Ln
– Count the number of times N appears
– Calculate probabilities
• P(Xi=N) = # {Xi=N}/ Ln
• P(Xi= not N) = 1 – pN
Ex: Given ATCCTACTGT Ln = 10
– P(Xi=A) = 2/10 = 0.2
– P(Xi=T or C or G) = 1 – 0.2 = 0.8
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 8
Bioinformatics
GBIO0009 -1
Biological Sequences
Alignments
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 9
Bioinformatics
GBIO0009 -1
Biological Sequences
Biological context
• Proteins may be multifunctional
– Sequence determines protein function
– Assumptions
• Pairs of proteins with similar sequence also share
similar biological function(s)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 10
Bioinformatics
GBIO0009 -1
Biological Sequences
Comparing sequences
• are important for a number of reasons.
– used to establish evolutionary relationships among
organisms
– identification of functionally conserved sequences
(e.g., DNA sequences controlling gene expression)
• ‘TATAAT’ box ļƒ  transcription initiation
– develop models for human diseases
• identify corresponding genes in model organisms (e.g. yeast,
mouse), which can be genetically manipulated
– E.g. gene knock outs / silencing
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 11
Bioinformatics
GBIO0009 -1
Biological Sequences
Comparing two sequences
• There are two ways of pairwise comparison
– Global using Needleman-Wunsch algorithm (NW)
– Local using Smith-Waterman algorithm (SW)
• Global alignment (NW)
entire sequence
• Alignment of the “whole” sequence
perfect
match
• Local alignment (SW)
• tries to align portions (e.g. motifs)
• more flexible
unaligned sequence
– Considers sequences “parts”
• works well on
– highly divergent sequences
aligned portion
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 12
Bioinformatics
GBIO0009 -1
Biological Sequences
Global alignment
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 13
Bioinformatics
GBIO0009 -1
Biological Sequences
Global alignment (NW)
• Sequences are aligned end-to-end along their entire length
• Many possible alignments are produced
– The alignment with the highest score is chosen
• Naïve algorithm is very inefficient (Oexp)
– To align sequence of length 15, need to consider
• (insertion, deletion, gap)15 = 315 = 1,4*107
– Impractical for sequences of length >20 nt
• Used to analyze homology/similarity of
– genes and proteins
– between species
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 14
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of global alignment (1 of 4)
• Define scoring scheme for each event
– mismatch between ai and bj
• š‘  š‘Žš‘– , š‘š‘— = −1 if š‘Žš‘– ≠ š‘š‘—
– gap (insertion or deletion)
• š‘  š‘Žš‘– , − = š‘  −, š‘š‘— = −2
– match between ai and bj
• š‘  š‘Žš‘– , š‘š‘— = +2 if š‘Žš‘– = š‘š‘—
• Provide no restrictions on minimal score
• Start completing the alignment MxN matrix
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 15
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of global alignment (2 of 4)
• The matrix should have extra column and row
– M+1 columns , where M is the length sequence M
– N+1 rows, where N is the length of sequence N
1. Initialize the matrix
– introduce gap penalty at every initial position
along rows and columns
– Scores at each cell are cumulative
W
-2
H
-4
A
T
-6 -2 -8
-2
-2
-2
0
-2
W -2
-2
H -4
-2
Y -6
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 16
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of global alignment (3 of 4)
2. Alignment possibilities
Gap (horiz/vert)
Match (W-W diag.)
W H
0 -2 -4
-2
W H
0 -2 -4
+2
-2
W -2 -4
W -2 +2
3. Select the maximum score
– Best alignment
W
H
Y
0
-2
-4
-6
W
-2
2
0
-2
H
-4
0
4
2
A
-6
-2
2
3
T
-8
-4
0
1
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 17
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of global alignment (4 of 4)
4. Select the most very bottom right cell
5. Consider different path(s) going to very top left cell
– How the next cell value was generated? From where?
W
H
Y
0
-2
-4
-6
W
-2
2
0
-2
H
-4
0
4
2
A
-6
-2
2
3
T
-8
-4
0
1
WHAT
WHYOverall score = 1
W
H
Y
0
-2
-4
-6
W
-2
2
0
-2
H
-4
0
4
2
A
-6
-2
2
3
T
-8
-4
0
1
WHAT
WH-Y
Overall score = 1
6. Select the best alignment(s)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 18
Bioinformatics
GBIO0009 -1
Biological Sequences
Local alignment
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 19
Bioinformatics
GBIO0009 -1
Biological Sequences
Local alignment (SW)
• Sequences are aligned to find regions where the
best alignment occurs (i.e. highest score)
• Assumes a local context (aligning parts of seq.)
• Ideal for finding short motifs, DNA binding sites
– helix-loop-helix (bHLH) - motif
– TATAAT box (a famous promoter region) – DNA binding site
• Works well on highly divergent sequences
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 20
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of local alignment (1 of 4)
• The scoring system is similar with one exception
– The minimum possible score in the matrix is zero
– There are no negative scores in the matrix
• Let’s define the scoring system as in global
mismatch between seq. ai and bj
š‘  š‘Žš‘– , š‘š‘— = −šŸ if š‘Žš‘– ≠ š‘š‘—
gap (insertion or deletion)
š‘  š‘Žš‘– , − = š‘  −, š‘š‘— = −šŸ
match between ai and bj
š‘  š‘Žš‘– , š‘š‘— = +šŸ if š‘Žš‘– = š‘š‘—
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 21
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of local alignment (2 of 4)
• Construct the MxN alignment matrix with M+1
columns and N+1 rows
• Initialize the matrix by introducing gap
penalty at 1st row and 1st column
0
W
0
H
0
Y
0
W
H
A
T
0
0
0
0
s(a,b) ≥ 0
(min value is zero)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 22
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of local alignment (3 of 4)
• For each subsequent cell consider alignments
– Vertical s(I, - )
– Horizontal s(-,J)
– Diagonal s(I,J)
• For each cell select the highest score
– If score is negative ļƒ  assign zero
W
H
A
T
0
0
0
0
0
W 0
2
0
0
0
H 0
0
4
2
0
Y 0
0
2
3
1
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 23
Bioinformatics
GBIO0009 -1
Biological Sequences
Methodology of local alignment (4 of 4)
• Select the initial cell with the highest score(s)
• Consider different path(s) leading to score of zero
– Trace-back the cell values
– Look how the values were originated (i.e. path)
I
J
W
H
Y
0
0
0
0
B
•
W
0
2
0
0
H
0
0
4
2
WH
WH
A
0
0
2
3
T
0
0
0
1
A
total score of 4
Mathematically
–
where S(I, J) is the score for sub-sequences I and J
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 24
Bioinformatics
GBIO0009 -1
Biological Sequences
Local alignment illustration (1 of 2)
• Determine the best local alignment and the
maximum alignment score for
• Sequence A: ACCTAAGG
• Sequence B: GGCTCAATCA
• Scoring conditions:
– š‘  š‘Žš‘– , š‘š‘— = +2 if š‘Žš‘– = š‘š‘— ,
– š‘  š‘Žš‘– , š‘š‘— = −1 if š‘Žš‘– ≠ š‘š‘— and
– š‘  š‘Žš‘– , − = š‘  −, š‘š‘— = −2
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 25
Bioinformatics
GBIO0009 -1
Biological Sequences
Local alignment illustration (2 of 2)
G
G
C
T
C
A
A
T
C
A
0
0
0
0
0
0
0
0
0
0
0
A
0
0
0
0
0
0
0
2
2
0
0
2
C
0
0
0
2
0
2
0
1
1
2
0
C
0
0
0
2
1
2
1
0
0
2
3
1
T
0
0
0
0
4
2
1
0
2
0
1
1
A
0
0
0
0
2
3
4
3
1
1
2
A
0
0
0
0
0
1
5
6
4
2
3
G
0
2
2
0
0
0
3
4
5
3
1
G
0
2
4
1
0
0
1
2
3
4
2
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 26
Bioinformatics
GBIO0009 -1
Biological Sequences
Local alignment illustration (3 of 3)
A
C
C
T
A
A
G
G
0
0
0
0
0
0
0
0
0
G
0
0
0
0
0
0
0
2
2
G
0
0
0
0
0
0
0
2
4
C
0
0
2
2
0
0
0
0
1
T
0
0
0
1
4
2
0
0
0
CTCAA
CT-AA
Best score: 6
locally
C
0
0
2
2
2
3
1
0
0
A
0
2
0
1
1
4
5
3
1
A
0
2
1
0
0
3
6
4
2
T
0
0
1
0
2
1
4
5
3
C
0
0
2
3
1
1
2
3
4
A
0
2
0
1
1
2
3
1
2
GGCTCAATCA
ACCT-AAGG
in the whole seq. context (globally)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 27
Bioinformatics
GBIO0009 -1
Biological Sequences
Aligning proteins
Globally and Locally
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 28
Bioinformatics
GBIO0009 -1
Biological Sequences
Biological context
• Find common functional units
– Structural motifs
• Helix-loop-helix
• Zinc finger
•…
• Phylogeny
– Distance between species
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 29
Bioinformatics
GBIO0009 -1
Biological Sequences
Protein Alignment
• Protein local and global alignment
– follows the same rules as we saw with DNA/RNA
• Differences (āˆ†)
– alphabet of proteins is 22 residues (aa) long
– scoring/substitution matrices used (BLOSUM)
• protein proprieties are taken into account
– residues that are totally different due to charge such as polar
Lysine and apolar Glycine are given a low score
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 30
Bioinformatics
GBIO0009 -1
Biological Sequences
Substitution matrices
• Protein sequences are more complex
– matrices = collection of scoring rules
• Matrices over events such as
– mismatch and perfect match
• Need to define gap penalty separately
• E.g. BLOcks SUbstitution Matrix (BLOSUM)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 31
Bioinformatics
GBIO0009 -1
Biological Sequences
BLOSUM-x matrices
• Constructed from aligned sequences with
specific x% similarity
– matrix built using sequences with no more then
50% similarity is called BLOSUM-50
• For highly mutating / dissimilar sequences use
– BLOSUM-45 and lower
• For highly conserved / similar sequences use
– BLOSUM -62 and higher
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 32
Bioinformatics
GBIO0009 -1
Biological Sequences
BLOSUM 62
• What diagonal represents?perfect match between a.a.
• What is the score for substitution Eļƒ D (acid a.a.)? Score = 2
• More drastic substitution Kļƒ I (basic to non-polar)? Score = -3
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 33
Bioinformatics
GBIO0009 -1
Biological Sequences
Practical problem:
Align following sequences both globally and locally
using BLOSUM 62 matrix with gap penalty of -8
Sequence A: AAEEKKLAAA
Sequence B: AARRIA
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 34
Bioinformatics
GBIO0009 -1
Biological Sequences
Aligning globally using BLOSUM 62
A
A
R
R
I
A
0
-8
-16
-24
-32
-40
-48
A
-8
4
-4
-12
-20
-28
-36
A
-16
-4
8
0
-8
-16
-24
E
-24
-12
0
8
0
-8
-16
E
-32
-20
-8
0
8
0
-8
K
-40
-28
-16
-6
2
5
-1
K
-48
-36
-24
-14
-4
-1
4
L
-56
-44
-32
-22
-12
-2
-2
A
-64
-52
-40
-30
-20
-10
2
A
-72
-60
-48
-38
-28
-18
-6
A
-80
-68
-56
-46
-36
-26
-14
AAEEKKLAAA
AA--RRIA-Score: -14
Other alignment options? Yes
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 35
Bioinformatics
GBIO0009 -1
Biological Sequences
Aligning locally using BLOSUM 62
A
A
E
E
K
K
L
A
A
A
0
0
0
0
0
0
0
0
0
0
0
A
0
4
4
0
0
0
0
0
4
4
4
A
0
4
8
3
0
0
0
0
4
8
8
R
0
0
3
8
3
2
2
0
0
3
7
R
0
0
0
3
8
5
4
0
0
0
2
I
0
0
0
0
0
5
2
6
0
0
0
A
0
4
4
0
0
0
4
1
10
4
4
KKLA
RRIA
Score: 10
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 36
Bioinformatics
GBIO0009 -1
Biological Sequences
Practical 1 of 5:
Using BLAST for sequence identification
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 37
Bioinformatics
GBIO0009 -1
Biological Sequences
BLAST
• Basic Local Alignment Search Tool
• Many different types
• http://blast.ncbi.nlm.nih.gov/Blast.cgi
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 38
Bioinformatics
GBIO0009 -1
Biological Sequences
Types
• blastn
– nucleotide query vs nucleotide database
• blastp
– protein query vs protein DB
• blastx
– translated in 6 frames nucleotide query vs protein DB
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 39
Bioinformatics
GBIO0009 -1
Biological Sequences
Sequence identity
•
Want to know which genes are coded by the genomic sequence
•
>human_genomic_seq
TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG
ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC
ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG
TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT
TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG
CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG
ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT
GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA
TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCC
GCCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAG
CATCCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGG
TGGGACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGG
GCCGTGCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGC
GGGTCGTGTCCTGG
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 40
Bioinformatics
GBIO0009 -1
Biological Sequences
BLAST GUI
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 41
Bioinformatics
GBIO0009 -1
Biological Sequences
Results
Top hit
• Mus musculus targeted KO-first, conditional ready, lacZ-tagged mutant
allele Tbl3:tm1a(EUCOMM)Hmgu; transgenic
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 42
Bioinformatics
GBIO0009 -1
Biological Sequences
Practical 2 of 5 :
Sequence Retrieval and Analysis
via R (seqinr)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 43
Bioinformatics
GBIO0009 -1
Biological Sequences
Protein database
• UniProt database (http://www.uniprot.org/) has
high quality protein data manually curated
• It is manually curated
• Each protein is assigned UniProt ID
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 44
Bioinformatics
GBIO0009 -1
Biological Sequences
Manual retrieval from
• In search field one can enter either use
UniProt ID or common protein name
– example: myelin basic protein
Uniprot ID
• We will use retrieve data for P02686
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 45
Bioinformatics
GBIO0009 -1
Biological Sequences
FASTA format
• FASTA format is widely used and has the
following parameters
– Sequence name start with > sign
– The fist line corresponds to protein name
Actual protein
sequence starts
from 2nd line
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 46
Bioinformatics
GBIO0009 -1
Biological Sequences
Retrieving protein data with R and SeqinR
• Can “talk” programmatically to UniProt
database using R and seqinR library
– seqinR library is suitable for
• “Biological Sequences Retrieval and Analysis”
• Detailed manual could be found here
– Install this library in your R environment
install.packages("seqinr")
library("seqinr")
– Choose database to retrieve data from
choosebank("swissprot")
– Download data object for target protein (P02686)
MBP_HUMAN = query("MBP_HUMAN", "AC=P02686")
– See sequence of the object MBP_HUMAN
MBP_HUMAN_seq = getSequence(MBP_HUMAN); MBP_HUMAN_seq
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 47
Bioinformatics
GBIO0009 -1
Biological Sequences
Dot Plot (comparison of 2 sequences) (1of2)
• Each sequence plotted on
vertical or horizontal
dimension
– If two a.a. from two
sequences at given
positions are identical the
dot is plotted
– matching sequence
segments appear as
diagonal lines (that could
be parallel to the absolute
diagonal line if insertion or
gap is present)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 48
Bioinformatics
GBIO0009 -1
Biological Sequences
Dot Plot (comparison of 2 sequences) (2of2)
INSERTION in MBP-Human or GAP in MBP-Mous
• Let’s compare two protein sequences
– Human MBP (Uniprot ID: P02686)
– Mouse MBP (Uniprot ID: P04370)
• Download 2nd mouse sequence
MBP_MOUSE = query("MBP_MOUSE", "AC=P04370");
MBP_MOUSE_seq = getSequence(MBP_MOUSE);
Breaks in diagonal line = regions of dissimilarity
Shift in diagonal line (identical regions)
• Visualize dot plot
dotPlot(MBP_HUMAN_seq[[1]], MBP_MOUSE_seq[[1]],xlab="MBP - Human", ylab = "MBP - Mouse")
- Is there similarity between human and mouse form of MBP protein?
- Where is the difference in the sequence between the two isoforms?
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 49
Bioinformatics
GBIO0009 -1
Biological Sequences
Practical 3 of 5:
Pairwise global and local alignments
via R and Biostrings
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 50
Bioinformatics
GBIO0009 -1
Biological Sequences
Installing Biostrings library
• Install library from Bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
• Define substitution martix (e.g. for DNA)
DNA_subst_matrix = nucleotideSubstitutionMatrix(match = 2,
mismatch = -1, baseOnly = TRUE)
• The scoring rules
DNA_subst_matrix
– Match: š‘  š‘Žš‘– , š‘š‘— = 2 if š‘Žš‘– = š‘š‘—
– Mismatch : š‘  š‘Žš‘– , š‘š‘— = -1 if š‘Žš‘– ≠ š‘š‘—
– Gap: š‘  š‘Žš‘– , − = -2 or š‘  −, š‘š‘— = -2
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 51
Bioinformatics
GBIO0009 -1
Biological Sequences
Global alignment using R and Biostrings
• Create two sting vectors (i.e. sequences)
seqA = "GATTA"
seqB = "GTTA"
• Use pairwiseAlignment() and the defined rules
globalAlignAB = pairwiseAlignment(seqA, seqB,
substitutionMatrix = DNA_subst_matrix, gapOpening = -2,
gapExtension=0, scoreOnly = FALSE, type="global")
• Visualize best paths (i.e. alignments)
globalAlignAB
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] GATTA
subject: [1] G-TTA
score: 6
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 52
Bioinformatics
GBIO0009 -1
Biological Sequences
Local alignment using R and Biostrings
• Input two sequences
seqA = "AGGATTTTAAAA"
seqB = "TTTT"
• The scoring rules will be the same as we used
for global alignment
localAlignAB = pairwiseAlignment(seqA, seqB,
substitutionMatrix = DNA_subst_matrix, gapOpening = -2,
scoreOnly = FALSE, type="local")
• Visualize alignment
localAlignAB
Local PairwiseAlignedFixedSubject (1 of 1)
pattern: [5] TTTT
subject: [1] TTTT
score: 8
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 53
Bioinformatics
GBIO0009 -1
Biological Sequences
Aligning protein sequences
• Protein sequences alignments are very similar
except the substitution matrix is specified
data(BLOSUM62)
BLOSUM62
• Will align sequences
seqA = "PAWHEAE"
seqB = "HEAGAWGHEE"
• Execute the global alignment
globalAlignAB <- pairwiseAlignment(seqA, seqB,
substitutionMatrix = "BLOSUM62", gapOpening = -2,
gapExtension = -8, scoreOnly = FALSE)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 54
Bioinformatics
GBIO0009 -1
Biological Sequences
Practical 4 of 5:
Open Reading Frame (ORF)
identification
via R
(i.e. Gene Finding)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 55
Bioinformatics
GBIO0009 -1
Biological Sequences
Gene expression sequences
• Central dogma
– DNA ļƒ  RNA ļƒ  protein
• Each codon codes for one amino acid (a.a.)
– residue = amino acid
• mRNA polymerase II
– Reads from 5’ to 3’ direction
– 3 nucleotides code for 1 a.a.
• In the DNA context
– Start codon: ATG
– Stop codon: TAA, TGA, TAG
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 56
Bioinformatics
GBIO0009 -1
Biological Sequences
mRNA ļƒ  Protein alphabet
• Codon table: 3 nucleotides code for 1 a.a.
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 57
Bioinformatics
GBIO0009 -1
Biological Sequences
Find ORF via R
source("https://bioconductor.org/biocLite.R");
biocLite("Biostrings");
library("Biostrings");
s1 <- "aaaatgcagtaacccatgccc";
# Find all ATGs in the sequence s1
matchPattern("atg", s1);
start end width
[1]
4
6
[2]
16 18
3 [atg]
3 [atg]
1) #ORFs: there are two “ATG”s in the sequence
2) Positions: nucleotides 4-6, and at nucleotides 16-18
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 58
Bioinformatics
GBIO0009 -1
Biological Sequences
Find stop codons
s1 <- "aaaatgcagtaacccatgccc";
stop.codons <- c("taa", "tga", "tag");
sapply(stop.codons,function(x){matchPattern(x,s1)});
$taa
Views on a 21-letter BString subject
subject: aaaatgcagtaacccatgccc
views:
start end width
[1]
10 12
3 [taa]
$tga
Views on a 21-letter BString subject
subject: aaaatgcagtaacccatgccc
views: NONE
$tag
Views on a 21-letter BString subject
subject: aaaatgcagtaacccatgccc
views: NONE
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 59
Bioinformatics
GBIO0009 -1
Biological Sequences
Dengue virus example
• Want to find ORFs in Dengue virus
– find all potential start and stop codons
– 500 nucleotides of the genome sequence
• the DEN-1 Dengue virus (NCBI accession NC_001477)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 60
Bioinformatics
GBIO0009 -1
Biological Sequences
Define getncbiseq()
getncbiseq <- function(accession)
{
require("seqinr") # this function requires the SeqinR R package
# first find which ACNUC database the accession is stored in:
dbs <- c("genbank","refseq","refseqViruses","bacterial")
numdbs <- length(dbs)
for (i in 1:numdbs)
{
db <- dbs[i]
choosebank(db)
# check if the sequence is in ACNUC database 'db':
resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE)
if (!(inherits(resquery, "try-error")))
{
queryname <- "query2"
thequery <- paste("AC=",accession,sep="")
query2 <-query(`queryname`,`thequery`);
# see if a sequence was retrieved:
seq <- getSequence(query2$req[[1]])
closebank()
return(seq);
}
closebank()
}
print(paste("ERROR: accession",accession,"was not found"))
}
dengueseq <- getncbiseq("NC_001477");
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 61
Bioinformatics
GBIO0009 -1
Biological Sequences
Select portion of the sequence
• Take portion of the Dengue virus
library("seqinr");
# Take the first 500 nucleotides
dengueseqstart <- dengueseq[1:500]
dengueseqstartstring <- c2s(dengueseqstart);
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 62
Bioinformatics
GBIO0009 -1
Biological Sequences
Define findPotentialStartsAndStops()
findPotentialStartsAndStops <- function(sequence)
{
# Define a vector with the sequences of potential start and stop codons
codons
<- c("atg", "taa", "tag", "tga")
# Find the number of occurrences of each type of potential start or stop codon
for (i in 1:4)
{
codon <- codons[i]
# Find all occurrences of codon "codon" in sequence "sequence"
occurrences <- matchPattern(codon, sequence)
# Find the start positions of all occurrences of "codon" in sequence "sequence"
codonpositions <- attr(occurrences,"ranges")@start
# Find the total number of potential start and stop codons in sequence "sequence"
numoccurrences <- length(codonpositions)
if (i == 1)
{
# Make a copy of vector "codonpositions" called "positions"
positions <- codonpositions
# Make a vector "types" containing "numoccurrences" copies of "codon"
types <- rep(codon, numoccurrences)
}
else
{
# Add the vector "codonpositions" to the end of vector "positions":
positions
<- append(positions, codonpositions, after=length(positions))
# Add the vector "rep(codon, numoccurrences)" to the end of vector "types":
types
<- append(types, rep(codon, numoccurrences), after=length(types))
}
}
# Sort the vectors "positions" and "types" in order of position along the input sequence:
indices <- order(positions)
positions <- positions[indices]
types <- types[indices]
# Return a list variable including vectors "positions" and "types":
mylist <- list(positions,types)
return(mylist)
}
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 63
Bioinformatics
GBIO0009 -1
Biological Sequences
Find ORFs
Sequence has 11 start and 22 stop codons
"agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgc
ttaacgtagttctaacagttttttattagagagcagatctctgatgaacaac
caacggaaaaagacgggtcgaccgtct … “
findPotentialStartsAndStops(dengueseqstartstring);
[[1]]
[1]
7 53 58 64 78 93 95 96 137 141 224 225 234 236 246
255 264 295 298 318 365 369 375 377 378 399 404 413 444 470 471
474 478
[[2]]
[1] "tag" "taa" "tag" "taa" "tag" "tga" "atg" "tga" "atg" "tga"
"atg" "tga" "tga" "atg" "tag" "taa" "tag" "tag" "atg" "atg" "atg"
"tga" "taa" "atg" "tga" "tga" "atg" "atg" "tga" "atg" "tga" "tag"
"tag"
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 64
Bioinformatics
GBIO0009 -1
Biological Sequences
Thee reading frames
• three different possible reading frames
– +1 - integer number of triplets (codons)
– +2 - integer number of triplets, plus 1 nucleotide
– +3 - integer number of triplets, plus 2 nucleotides
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 65
Bioinformatics
GBIO0009 -1
Biological Sequences
Identify reading frame
• Get sub-sequence
substring(dengueseqstartstring,137,143);
[1] "atgctga"
• starts
– a potential start codon (ATG)
• ends
– with a potential stop codon (TGA)
• there is
– an integer number of triplets
– plus one nucleotide
– frame +2
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 66
Bioinformatics
GBIO0009 -1
Biological Sequences
Find ORF in DNA
• Search for
– a potential start codon,
– followed by an integer number of codons,
– followed by a potential stop codon
– test all 3 frames
• Input 500bp of the DEN-1 virus
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 67
Bioinformatics
GBIO0009 -1
Biological Sequences
Defining findORFsinSeq() 1of3
findORFsinSeq <- function(sequence)
{
require(Biostrings)
# Make vectors "positions" and "types" containing
information on the positions of ATGs in the sequence:
mylist <- findPotentialStartsAndStops(sequence)
positions <- mylist[[1]]
types <- mylist[[2]]
# Make vectors "orfstarts" and "orfstops" to store the
predicted start and stop codons of ORFs
orfstarts <- numeric()
orfstops <- numeric()
# Make a vector "orflengths" to store the lengths of the
ORFs
orflengths <- numeric()
# Print out the positions of ORFs in the sequence:
# Find the length of vector "positions"
numpositions <- length(positions)
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 68
Bioinformatics
GBIO0009 -1
Biological Sequences
Defining findORFsinSeq()2of3
# There must be at least one start codon and one stop codon to have an ORF.
if (numpositions >= 2)
{
for (i in 1:(numpositions-1))
{
posi <- positions[i]
typei <- types[i]
found <- 0
while (found == 0)
{
for (j in (i+1):numpositions)
{
posj <- positions[j]
typej <- types[j]
posdiff <- posj - posi
posdiffmod3 <- posdiff %% 3
# Add in the length of the stop codon
orflength <- posj - posi + 3
if (typei == "atg" && (typej == "taa" || typej == "tag" || typej == "tga") && posdiffmod3 == 0)
{
# Check if we have already used the stop codon at posj+2 in an ORF
numorfs <- length(orfstops)
usedstop <- -1
if (numorfs > 0)
{
for (k in 1:numorfs)
{
orfstopk <- orfstops[k]
if (orfstopk == (posj + 2)) { usedstop <- 1 }
}
}
if (usedstop == -1)
{
orfstarts <- append(orfstarts, posi, after=length(orfstarts))
orfstops <- append(orfstops, posj+2, after=length(orfstops)) # Including the stop codon.
orflengths <- append(orflengths, orflength, after=length(orflengths))
}
found <- 1
break
}
if (j == numpositions) { found <- 1 }
}
}
}
}
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 69
Bioinformatics
GBIO0009 -1
Biological Sequences
Defining findORFsinSeq() 3of3
# Sort the final ORFs by start position:
indices <- order(orfstarts)
orfstarts <- orfstarts[indices]
orfstops <- orfstops[indices]
# Find the lengths of the ORFs that we have
orflengths <- numeric()
numorfs <- length(orfstarts)
for (i in 1:numorfs)
{
orfstart <- orfstarts[i]
orfstop <- orfstops[i]
orflength <- orfstop - orfstart + 1
orflengths <append(orflengths,orflength,after=length(orflengths))
}
mylist <- list(orfstarts, orfstops, orflengths)
return(mylist)
}
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 70
Bioinformatics
GBIO0009 -1
Biological Sequences
Find ORF start / end sites
s1 <- "aaaatgcagtaacccatgccc";
findORFsinSeq(s1);
[[1]]
[1] 4
[[2]]
[1] 12
[[3]]
[1] 9
1. the start positions of ORFs
2. a vector of the end positions of those ORFs
3. a vector containing the lengths of the ORFs.
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 71
Bioinformatics
GBIO0009 -1
Biological Sequences
Practical 5 of 5:
comparing the genomes of
two different species
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 72
Bioinformatics
GBIO0009 -1
Biological Sequences
Comparison
• Comparative genomics
– Compares genomes
• two different species, or
• two different strains
• Do the two species have the same number of
genes?
• Which genes were gained or lost?
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 73
Bioinformatics
GBIO0009 -1
Biological Sequences
Tuning biomaRt
# Load the biomaRt package in R
library("biomaRt");
# List all databases that can be queried
listMarts();
1. The names of the databases are listed and version
2. Can query many different databases including
WormBase, UniProt, Ensembl, etc.
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 74
Bioinformatics
GBIO0009 -1
Biological Sequences
Connect to database via biomaRt
• Once the particular Ensembl data is set
– can perform the query using the getBM() function
# Specify that we want to query the Ensembl Protists database
ensemblprotists <- useMart("protists_mart_29");
ensemblleishmania <- useDataset("lmajor_eg_gene",mart=ensemblprotists);
leishmaniaattributes <- listAttributes(ensemblleishmania);
attributenames <- leishmaniaattributes[[1]];
attributedescriptions <- leishmaniaattributes[[2]];
• A very long list of 292 features
– in the Leishmania major Ensembl data set
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 75
Bioinformatics
GBIO0009 -1
Biological Sequences
Query genes of Leishmania major
• Find all genes in Leishmania major
leishmaniagenes <- getBM(attributes = c("ensembl_gene_id"),
mart=ensemblleishmania);
leishmaniagenes[1:10];
• Find only protein coding genes in Leishmania major ,
• “gene_biotype” – specifies type of sequence (eg. protein-coding,
pseudogene, etc.):
leishmaniagenes2 <- getBM(attributes = c("ensembl_gene_id", "gene_biotype"),
mart=ensemblleishmania);
# Get the vector of the names of all L. major genes
leishmaniagenenames2 <- leishmaniagenes2[,1];
# Get the vector of the biotypes of all genes
leishmaniagenebiotypes2 <- leishmaniagenes2[,2];
table(leishmaniagenebiotypes2);
ncRNA nontranslating_CDS
84
4
rRNA
snoRNA
63
741
protein_coding
8308
snRNA
6
pseudogene
90
tRNA
83
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 76
Bioinformatics
GBIO0009 -1
Biological Sequences
Query genes of Plasmodium falciparum
• Find protein coding genes in
– Plasmodium falciparum
ensemblpfalciparum <useDataset("pfalciparum_eg_gene",mart=ensemblprotists);
pfalciparumgenes <- getBM(attributes = c("ensembl_gene_id",
"gene_biotype"), mart=ensemblpfalciparum);
# Get the names of the P. falciparum genes
pfalciparumgenenames <- pfalciparumgenes[[1]];
length(pfalciparumgenenames);
pfalciparumgenebiotypes <- pfalciparumgenes[[2]];
table(pfalciparumgenebiotypes);
ncRNA nontranslating_cds
712
80
rRNA
snRNA
24
3
protein_coding
5349
tRNA
45
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 77
Bioinformatics
GBIO0009 -1
Biological Sequences
Comparing two species
• Plasmodium falciparum seems to have less
protein-coding genes (5349) than Leishmania
major (8308)
– Why?
• that there have been gene duplications in
the Leishmania major lineage
• that completely new genes (that are not related to any
other Leishmania major gene) have arisen
• that there have been genes lost from the Plasmodium
falciparum genome
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 78
Bioinformatics
GBIO0009 -1
Biological Sequences
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 79
Bioinformatics
GBIO0009 -1
Biological Sequences
Resources
• Online Tutorial on Sequence Alignment
– http://a-little-book-of-r-forbioinformatics.readthedocs.org/en/latest/src/chapter4.html
• Graphical alignment of proteins
– http://www.itu.dk/~sestoft/bsa/graphalign.html
• Pairwise alignment of DNA and proteins using your rules:
– http://www.bioinformatics.org/sms2/pairwise_align_dna.html
• Documentation on libraries
– Biostings: http://www.bioconductor.org/packages/2.10/bioc/manuals/Biostrings/man/Biostrings.pdf
– SeqinR: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
____________________________________________________________________________________________________________________
Kirill Bessonov
slide 80
Download