Bioinformatics GBIO0009 -1 Biological Sequences Biological Sequences Comparison GBIO0009-1 Presented by Kirill Bessonov Oct 27, 2015 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 1 Bioinformatics GBIO0009 -1 Biological Sequences Lecture structure 1. 2. 3. 4. Introduction Global Alignment: Needleman-Wunsch Local Alignment: Smith-Waterman Illustrations – Sequences identification via BLAST – Retrieval of sequences via UniProt and R – Alignment of sequences via R (seqinr) – Detection of ORFs with R – Comparing genomes of two species ____________________________________________________________________________________________________________________ Kirill Bessonov slide 2 Bioinformatics GBIO0009 -1 Biological Sequences Biological sequence • Single continuous molecule – DNA – RNA – protein ____________________________________________________________________________________________________________________ Kirill Bessonov slide 3 Bioinformatics GBIO0009 -1 Biological Sequences Biological problem • Given the DNA sequence AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTT AAGATCATGCTATTTTCGAGATTCGATTCTAGCTA • Answer – – – – – – Is it likely to be a gene? What is its possible expression level? What is the possible structure of the protein product? Can we get the protein? Can we figure out the key residues of the protein? …. ____________________________________________________________________________________________________________________ Kirill Bessonov slide 4 Bioinformatics GBIO0009 -1 Biological Sequences Alphabet – In the case of DNA • A, C, T, G – In the case of RNA • A,C, U, G – In the case of protein • 20 amino acids • Complete list is found here ____________________________________________________________________________________________________________________ Kirill Bessonov slide 5 Bioinformatics GBIO0009 -1 Biological Sequences Words • Short strings of letters from an alphabet • A word of length k is called a k-word or k-tuple • Examples: – 1-tuple: individual nucleotide – 2-tuple: dinucleotide – 3-tuple: codon ____________________________________________________________________________________________________________________ Kirill Bessonov slide 6 Bioinformatics GBIO0009 -1 Biological Sequences Patterns • Recognizing motifs, sites, signals, domains – functionally important regions – a conserved motif - consensus sequence – Often words (in bold) are used interchangeably • Gene starts with with an “ATG” codon – Identify # of potential gene start sites AATCGGATGCGCGTAGGATCGGTAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATT CGATTCTAGCTAGGTTTAGCTTAGCTTAGTGCCAGAAATCGGATGCGCGTAGGATCGG TAGGGTAGGCTTTAAGATCATGCTATTTTCGAGATTCGATTCTAGCTAGGTTTTTAGT GCCAGAAATCGTTAGTGCCAGAAATCGATT ____________________________________________________________________________________________________________________ Kirill Bessonov slide 7 Bioinformatics GBIO0009 -1 Biological Sequences Probability • What is the probability distribution of the number of times a given base N (A,C,T or G) occurs in a random DNA sequence of length Ln? – Assume X1, …, Xn is the sequence of Ln – Count the number of times N appears – Calculate probabilities • P(Xi=N) = # {Xi=N}/ Ln • P(Xi= not N) = 1 – pN Ex: Given ATCCTACTGT Ln = 10 – P(Xi=A) = 2/10 = 0.2 – P(Xi=T or C or G) = 1 – 0.2 = 0.8 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 8 Bioinformatics GBIO0009 -1 Biological Sequences Alignments ____________________________________________________________________________________________________________________ Kirill Bessonov slide 9 Bioinformatics GBIO0009 -1 Biological Sequences Biological context • Proteins may be multifunctional – Sequence determines protein function – Assumptions • Pairs of proteins with similar sequence also share similar biological function(s) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 10 Bioinformatics GBIO0009 -1 Biological Sequences Comparing sequences • are important for a number of reasons. – used to establish evolutionary relationships among organisms – identification of functionally conserved sequences (e.g., DNA sequences controlling gene expression) • ‘TATAAT’ box ļ transcription initiation – develop models for human diseases • identify corresponding genes in model organisms (e.g. yeast, mouse), which can be genetically manipulated – E.g. gene knock outs / silencing ____________________________________________________________________________________________________________________ Kirill Bessonov slide 11 Bioinformatics GBIO0009 -1 Biological Sequences Comparing two sequences • There are two ways of pairwise comparison – Global using Needleman-Wunsch algorithm (NW) – Local using Smith-Waterman algorithm (SW) • Global alignment (NW) entire sequence • Alignment of the “whole” sequence perfect match • Local alignment (SW) • tries to align portions (e.g. motifs) • more flexible unaligned sequence – Considers sequences “parts” • works well on – highly divergent sequences aligned portion ____________________________________________________________________________________________________________________ Kirill Bessonov slide 12 Bioinformatics GBIO0009 -1 Biological Sequences Global alignment ____________________________________________________________________________________________________________________ Kirill Bessonov slide 13 Bioinformatics GBIO0009 -1 Biological Sequences Global alignment (NW) • Sequences are aligned end-to-end along their entire length • Many possible alignments are produced – The alignment with the highest score is chosen • Naïve algorithm is very inefficient (Oexp) – To align sequence of length 15, need to consider • (insertion, deletion, gap)15 = 315 = 1,4*107 – Impractical for sequences of length >20 nt • Used to analyze homology/similarity of – genes and proteins – between species ____________________________________________________________________________________________________________________ Kirill Bessonov slide 14 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of global alignment (1 of 4) • Define scoring scheme for each event – mismatch between ai and bj • š šš , šš = −1 if šš ≠ šš – gap (insertion or deletion) • š šš , − = š −, šš = −2 – match between ai and bj • š šš , šš = +2 if šš = šš • Provide no restrictions on minimal score • Start completing the alignment MxN matrix ____________________________________________________________________________________________________________________ Kirill Bessonov slide 15 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of global alignment (2 of 4) • The matrix should have extra column and row – M+1 columns , where M is the length sequence M – N+1 rows, where N is the length of sequence N 1. Initialize the matrix – introduce gap penalty at every initial position along rows and columns – Scores at each cell are cumulative W -2 H -4 A T -6 -2 -8 -2 -2 -2 0 -2 W -2 -2 H -4 -2 Y -6 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 16 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of global alignment (3 of 4) 2. Alignment possibilities Gap (horiz/vert) Match (W-W diag.) W H 0 -2 -4 -2 W H 0 -2 -4 +2 -2 W -2 -4 W -2 +2 3. Select the maximum score – Best alignment W H Y 0 -2 -4 -6 W -2 2 0 -2 H -4 0 4 2 A -6 -2 2 3 T -8 -4 0 1 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 17 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of global alignment (4 of 4) 4. Select the most very bottom right cell 5. Consider different path(s) going to very top left cell – How the next cell value was generated? From where? W H Y 0 -2 -4 -6 W -2 2 0 -2 H -4 0 4 2 A -6 -2 2 3 T -8 -4 0 1 WHAT WHYOverall score = 1 W H Y 0 -2 -4 -6 W -2 2 0 -2 H -4 0 4 2 A -6 -2 2 3 T -8 -4 0 1 WHAT WH-Y Overall score = 1 6. Select the best alignment(s) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 18 Bioinformatics GBIO0009 -1 Biological Sequences Local alignment ____________________________________________________________________________________________________________________ Kirill Bessonov slide 19 Bioinformatics GBIO0009 -1 Biological Sequences Local alignment (SW) • Sequences are aligned to find regions where the best alignment occurs (i.e. highest score) • Assumes a local context (aligning parts of seq.) • Ideal for finding short motifs, DNA binding sites – helix-loop-helix (bHLH) - motif – TATAAT box (a famous promoter region) – DNA binding site • Works well on highly divergent sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 20 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of local alignment (1 of 4) • The scoring system is similar with one exception – The minimum possible score in the matrix is zero – There are no negative scores in the matrix • Let’s define the scoring system as in global mismatch between seq. ai and bj š šš , šš = −š if šš ≠ šš gap (insertion or deletion) š šš , − = š −, šš = −š match between ai and bj š šš , šš = +š if šš = šš ____________________________________________________________________________________________________________________ Kirill Bessonov slide 21 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of local alignment (2 of 4) • Construct the MxN alignment matrix with M+1 columns and N+1 rows • Initialize the matrix by introducing gap penalty at 1st row and 1st column 0 W 0 H 0 Y 0 W H A T 0 0 0 0 s(a,b) ≥ 0 (min value is zero) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 22 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of local alignment (3 of 4) • For each subsequent cell consider alignments – Vertical s(I, - ) – Horizontal s(-,J) – Diagonal s(I,J) • For each cell select the highest score – If score is negative ļ assign zero W H A T 0 0 0 0 0 W 0 2 0 0 0 H 0 0 4 2 0 Y 0 0 2 3 1 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 23 Bioinformatics GBIO0009 -1 Biological Sequences Methodology of local alignment (4 of 4) • Select the initial cell with the highest score(s) • Consider different path(s) leading to score of zero – Trace-back the cell values – Look how the values were originated (i.e. path) I J W H Y 0 0 0 0 B • W 0 2 0 0 H 0 0 4 2 WH WH A 0 0 2 3 T 0 0 0 1 A total score of 4 Mathematically – where S(I, J) is the score for sub-sequences I and J ____________________________________________________________________________________________________________________ Kirill Bessonov slide 24 Bioinformatics GBIO0009 -1 Biological Sequences Local alignment illustration (1 of 2) • Determine the best local alignment and the maximum alignment score for • Sequence A: ACCTAAGG • Sequence B: GGCTCAATCA • Scoring conditions: – š šš , šš = +2 if šš = šš , – š šš , šš = −1 if šš ≠ šš and – š šš , − = š −, šš = −2 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 25 Bioinformatics GBIO0009 -1 Biological Sequences Local alignment illustration (2 of 2) G G C T C A A T C A 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 2 2 0 0 2 C 0 0 0 2 0 2 0 1 1 2 0 C 0 0 0 2 1 2 1 0 0 2 3 1 T 0 0 0 0 4 2 1 0 2 0 1 1 A 0 0 0 0 2 3 4 3 1 1 2 A 0 0 0 0 0 1 5 6 4 2 3 G 0 2 2 0 0 0 3 4 5 3 1 G 0 2 4 1 0 0 1 2 3 4 2 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 26 Bioinformatics GBIO0009 -1 Biological Sequences Local alignment illustration (3 of 3) A C C T A A G G 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 2 2 G 0 0 0 0 0 0 0 2 4 C 0 0 2 2 0 0 0 0 1 T 0 0 0 1 4 2 0 0 0 CTCAA CT-AA Best score: 6 locally C 0 0 2 2 2 3 1 0 0 A 0 2 0 1 1 4 5 3 1 A 0 2 1 0 0 3 6 4 2 T 0 0 1 0 2 1 4 5 3 C 0 0 2 3 1 1 2 3 4 A 0 2 0 1 1 2 3 1 2 GGCTCAATCA ACCT-AAGG in the whole seq. context (globally) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 27 Bioinformatics GBIO0009 -1 Biological Sequences Aligning proteins Globally and Locally ____________________________________________________________________________________________________________________ Kirill Bessonov slide 28 Bioinformatics GBIO0009 -1 Biological Sequences Biological context • Find common functional units – Structural motifs • Helix-loop-helix • Zinc finger •… • Phylogeny – Distance between species ____________________________________________________________________________________________________________________ Kirill Bessonov slide 29 Bioinformatics GBIO0009 -1 Biological Sequences Protein Alignment • Protein local and global alignment – follows the same rules as we saw with DNA/RNA • Differences (ā) – alphabet of proteins is 22 residues (aa) long – scoring/substitution matrices used (BLOSUM) • protein proprieties are taken into account – residues that are totally different due to charge such as polar Lysine and apolar Glycine are given a low score ____________________________________________________________________________________________________________________ Kirill Bessonov slide 30 Bioinformatics GBIO0009 -1 Biological Sequences Substitution matrices • Protein sequences are more complex – matrices = collection of scoring rules • Matrices over events such as – mismatch and perfect match • Need to define gap penalty separately • E.g. BLOcks SUbstitution Matrix (BLOSUM) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 31 Bioinformatics GBIO0009 -1 Biological Sequences BLOSUM-x matrices • Constructed from aligned sequences with specific x% similarity – matrix built using sequences with no more then 50% similarity is called BLOSUM-50 • For highly mutating / dissimilar sequences use – BLOSUM-45 and lower • For highly conserved / similar sequences use – BLOSUM -62 and higher ____________________________________________________________________________________________________________________ Kirill Bessonov slide 32 Bioinformatics GBIO0009 -1 Biological Sequences BLOSUM 62 • What diagonal represents?perfect match between a.a. • What is the score for substitution Eļ D (acid a.a.)? Score = 2 • More drastic substitution Kļ I (basic to non-polar)? Score = -3 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 33 Bioinformatics GBIO0009 -1 Biological Sequences Practical problem: Align following sequences both globally and locally using BLOSUM 62 matrix with gap penalty of -8 Sequence A: AAEEKKLAAA Sequence B: AARRIA ____________________________________________________________________________________________________________________ Kirill Bessonov slide 34 Bioinformatics GBIO0009 -1 Biological Sequences Aligning globally using BLOSUM 62 A A R R I A 0 -8 -16 -24 -32 -40 -48 A -8 4 -4 -12 -20 -28 -36 A -16 -4 8 0 -8 -16 -24 E -24 -12 0 8 0 -8 -16 E -32 -20 -8 0 8 0 -8 K -40 -28 -16 -6 2 5 -1 K -48 -36 -24 -14 -4 -1 4 L -56 -44 -32 -22 -12 -2 -2 A -64 -52 -40 -30 -20 -10 2 A -72 -60 -48 -38 -28 -18 -6 A -80 -68 -56 -46 -36 -26 -14 AAEEKKLAAA AA--RRIA-Score: -14 Other alignment options? Yes ____________________________________________________________________________________________________________________ Kirill Bessonov slide 35 Bioinformatics GBIO0009 -1 Biological Sequences Aligning locally using BLOSUM 62 A A E E K K L A A A 0 0 0 0 0 0 0 0 0 0 0 A 0 4 4 0 0 0 0 0 4 4 4 A 0 4 8 3 0 0 0 0 4 8 8 R 0 0 3 8 3 2 2 0 0 3 7 R 0 0 0 3 8 5 4 0 0 0 2 I 0 0 0 0 0 5 2 6 0 0 0 A 0 4 4 0 0 0 4 1 10 4 4 KKLA RRIA Score: 10 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 36 Bioinformatics GBIO0009 -1 Biological Sequences Practical 1 of 5: Using BLAST for sequence identification ____________________________________________________________________________________________________________________ Kirill Bessonov slide 37 Bioinformatics GBIO0009 -1 Biological Sequences BLAST • Basic Local Alignment Search Tool • Many different types • http://blast.ncbi.nlm.nih.gov/Blast.cgi ____________________________________________________________________________________________________________________ Kirill Bessonov slide 38 Bioinformatics GBIO0009 -1 Biological Sequences Types • blastn – nucleotide query vs nucleotide database • blastp – protein query vs protein DB • blastx – translated in 6 frames nucleotide query vs protein DB ____________________________________________________________________________________________________________________ Kirill Bessonov slide 39 Bioinformatics GBIO0009 -1 Biological Sequences Sequence identity • Want to know which genes are coded by the genomic sequence • >human_genomic_seq TGGACTCTGCTTCCCAGACAGTACCCCTGACAGTGACAGAACTGCCACTCTCCCCACCTG ACCCTGTTAGGAAGGTACAACCTATGAAGAAAAAGCCAGAATACAGGGGACATGTGAGCC ACAGACAACACAAGTGTGCACAACACCTCTGAGCTGAGCTTTTCTTGATTCAAGGGCTAG TGAGAACGCCCCGCCAGAGATTTACCTCTGGTCTTCTGAGGTTGAGGGCTCGTTCTCTCT TCCTGAATGTAAAGGTCAAGATGCTGGGCCTCAGTTTCCTCTTACATACTCACCAAAAGG CTCTCCTGATCAGAGAAGCAGGATGCTGCACTTGTCCTCCTGTCGATGCTCTTGGCTATG ACAAAATCTGAGCTTACCTTCTCTTGCCCACCTCTAAACCCCATAAGGGCTTCGTTCTGT GTCTCTTGAGAATGTCCCTATCTCCAACTCTGTCATACGGGGGAGAGCGAGTGGGAAGGA TCCAGGGCAGGGCTCAGACCCCGGCGCATGGACCTAGTCGGGGGCGCTGGCTCAGCCCC GCCCCGCGCGCCCCCGTCGCAGCCGACGCGCGCTCCCGGGAGGCGGCGGCAGAGGCAG CATCCACAGCATCAGCAGCCTCAGCTTCATCCCCGGGCGGTCTCCGGCGGGGAAGGCCGG TGGGACAAACGGACAGAAGGCAAAGTGCCCGCAATGGAGGGAGCATCCTTTGGCGCGG GCCGTGCGGGAGCTGCCTTTGATCCCGTGAGCTTTGCGCGGCGGCCCCAGACCCTGTTGC GGGTCGTGTCCTGG ____________________________________________________________________________________________________________________ Kirill Bessonov slide 40 Bioinformatics GBIO0009 -1 Biological Sequences BLAST GUI ____________________________________________________________________________________________________________________ Kirill Bessonov slide 41 Bioinformatics GBIO0009 -1 Biological Sequences Results Top hit • Mus musculus targeted KO-first, conditional ready, lacZ-tagged mutant allele Tbl3:tm1a(EUCOMM)Hmgu; transgenic ____________________________________________________________________________________________________________________ Kirill Bessonov slide 42 Bioinformatics GBIO0009 -1 Biological Sequences Practical 2 of 5 : Sequence Retrieval and Analysis via R (seqinr) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 43 Bioinformatics GBIO0009 -1 Biological Sequences Protein database • UniProt database (http://www.uniprot.org/) has high quality protein data manually curated • It is manually curated • Each protein is assigned UniProt ID ____________________________________________________________________________________________________________________ Kirill Bessonov slide 44 Bioinformatics GBIO0009 -1 Biological Sequences Manual retrieval from • In search field one can enter either use UniProt ID or common protein name – example: myelin basic protein Uniprot ID • We will use retrieve data for P02686 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 45 Bioinformatics GBIO0009 -1 Biological Sequences FASTA format • FASTA format is widely used and has the following parameters – Sequence name start with > sign – The fist line corresponds to protein name Actual protein sequence starts from 2nd line ____________________________________________________________________________________________________________________ Kirill Bessonov slide 46 Bioinformatics GBIO0009 -1 Biological Sequences Retrieving protein data with R and SeqinR • Can “talk” programmatically to UniProt database using R and seqinR library – seqinR library is suitable for • “Biological Sequences Retrieval and Analysis” • Detailed manual could be found here – Install this library in your R environment install.packages("seqinr") library("seqinr") – Choose database to retrieve data from choosebank("swissprot") – Download data object for target protein (P02686) MBP_HUMAN = query("MBP_HUMAN", "AC=P02686") – See sequence of the object MBP_HUMAN MBP_HUMAN_seq = getSequence(MBP_HUMAN); MBP_HUMAN_seq ____________________________________________________________________________________________________________________ Kirill Bessonov slide 47 Bioinformatics GBIO0009 -1 Biological Sequences Dot Plot (comparison of 2 sequences) (1of2) • Each sequence plotted on vertical or horizontal dimension – If two a.a. from two sequences at given positions are identical the dot is plotted – matching sequence segments appear as diagonal lines (that could be parallel to the absolute diagonal line if insertion or gap is present) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 48 Bioinformatics GBIO0009 -1 Biological Sequences Dot Plot (comparison of 2 sequences) (2of2) INSERTION in MBP-Human or GAP in MBP-Mous • Let’s compare two protein sequences – Human MBP (Uniprot ID: P02686) – Mouse MBP (Uniprot ID: P04370) • Download 2nd mouse sequence MBP_MOUSE = query("MBP_MOUSE", "AC=P04370"); MBP_MOUSE_seq = getSequence(MBP_MOUSE); Breaks in diagonal line = regions of dissimilarity Shift in diagonal line (identical regions) • Visualize dot plot dotPlot(MBP_HUMAN_seq[[1]], MBP_MOUSE_seq[[1]],xlab="MBP - Human", ylab = "MBP - Mouse") - Is there similarity between human and mouse form of MBP protein? - Where is the difference in the sequence between the two isoforms? ____________________________________________________________________________________________________________________ Kirill Bessonov slide 49 Bioinformatics GBIO0009 -1 Biological Sequences Practical 3 of 5: Pairwise global and local alignments via R and Biostrings ____________________________________________________________________________________________________________________ Kirill Bessonov slide 50 Bioinformatics GBIO0009 -1 Biological Sequences Installing Biostrings library • Install library from Bioconductor source("http://bioconductor.org/biocLite.R") biocLite("Biostrings") library(Biostrings) • Define substitution martix (e.g. for DNA) DNA_subst_matrix = nucleotideSubstitutionMatrix(match = 2, mismatch = -1, baseOnly = TRUE) • The scoring rules DNA_subst_matrix – Match: š šš , šš = 2 if šš = šš – Mismatch : š šš , šš = -1 if šš ≠ šš – Gap: š šš , − = -2 or š −, šš = -2 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 51 Bioinformatics GBIO0009 -1 Biological Sequences Global alignment using R and Biostrings • Create two sting vectors (i.e. sequences) seqA = "GATTA" seqB = "GTTA" • Use pairwiseAlignment() and the defined rules globalAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2, gapExtension=0, scoreOnly = FALSE, type="global") • Visualize best paths (i.e. alignments) globalAlignAB Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] GATTA subject: [1] G-TTA score: 6 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 52 Bioinformatics GBIO0009 -1 Biological Sequences Local alignment using R and Biostrings • Input two sequences seqA = "AGGATTTTAAAA" seqB = "TTTT" • The scoring rules will be the same as we used for global alignment localAlignAB = pairwiseAlignment(seqA, seqB, substitutionMatrix = DNA_subst_matrix, gapOpening = -2, scoreOnly = FALSE, type="local") • Visualize alignment localAlignAB Local PairwiseAlignedFixedSubject (1 of 1) pattern: [5] TTTT subject: [1] TTTT score: 8 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 53 Bioinformatics GBIO0009 -1 Biological Sequences Aligning protein sequences • Protein sequences alignments are very similar except the substitution matrix is specified data(BLOSUM62) BLOSUM62 • Will align sequences seqA = "PAWHEAE" seqB = "HEAGAWGHEE" • Execute the global alignment globalAlignAB <- pairwiseAlignment(seqA, seqB, substitutionMatrix = "BLOSUM62", gapOpening = -2, gapExtension = -8, scoreOnly = FALSE) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 54 Bioinformatics GBIO0009 -1 Biological Sequences Practical 4 of 5: Open Reading Frame (ORF) identification via R (i.e. Gene Finding) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 55 Bioinformatics GBIO0009 -1 Biological Sequences Gene expression sequences • Central dogma – DNA ļ RNA ļ protein • Each codon codes for one amino acid (a.a.) – residue = amino acid • mRNA polymerase II – Reads from 5’ to 3’ direction – 3 nucleotides code for 1 a.a. • In the DNA context – Start codon: ATG – Stop codon: TAA, TGA, TAG ____________________________________________________________________________________________________________________ Kirill Bessonov slide 56 Bioinformatics GBIO0009 -1 Biological Sequences mRNA ļ Protein alphabet • Codon table: 3 nucleotides code for 1 a.a. ____________________________________________________________________________________________________________________ Kirill Bessonov slide 57 Bioinformatics GBIO0009 -1 Biological Sequences Find ORF via R source("https://bioconductor.org/biocLite.R"); biocLite("Biostrings"); library("Biostrings"); s1 <- "aaaatgcagtaacccatgccc"; # Find all ATGs in the sequence s1 matchPattern("atg", s1); start end width [1] 4 6 [2] 16 18 3 [atg] 3 [atg] 1) #ORFs: there are two “ATG”s in the sequence 2) Positions: nucleotides 4-6, and at nucleotides 16-18 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 58 Bioinformatics GBIO0009 -1 Biological Sequences Find stop codons s1 <- "aaaatgcagtaacccatgccc"; stop.codons <- c("taa", "tga", "tag"); sapply(stop.codons,function(x){matchPattern(x,s1)}); $taa Views on a 21-letter BString subject subject: aaaatgcagtaacccatgccc views: start end width [1] 10 12 3 [taa] $tga Views on a 21-letter BString subject subject: aaaatgcagtaacccatgccc views: NONE $tag Views on a 21-letter BString subject subject: aaaatgcagtaacccatgccc views: NONE ____________________________________________________________________________________________________________________ Kirill Bessonov slide 59 Bioinformatics GBIO0009 -1 Biological Sequences Dengue virus example • Want to find ORFs in Dengue virus – find all potential start and stop codons – 500 nucleotides of the genome sequence • the DEN-1 Dengue virus (NCBI accession NC_001477) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 60 Bioinformatics GBIO0009 -1 Biological Sequences Define getncbiseq() getncbiseq <- function(accession) { require("seqinr") # this function requires the SeqinR R package # first find which ACNUC database the accession is stored in: dbs <- c("genbank","refseq","refseqViruses","bacterial") numdbs <- length(dbs) for (i in 1:numdbs) { db <- dbs[i] choosebank(db) # check if the sequence is in ACNUC database 'db': resquery <- try(query(".tmpquery", paste("AC=", accession)), silent = TRUE) if (!(inherits(resquery, "try-error"))) { queryname <- "query2" thequery <- paste("AC=",accession,sep="") query2 <-query(`queryname`,`thequery`); # see if a sequence was retrieved: seq <- getSequence(query2$req[[1]]) closebank() return(seq); } closebank() } print(paste("ERROR: accession",accession,"was not found")) } dengueseq <- getncbiseq("NC_001477"); ____________________________________________________________________________________________________________________ Kirill Bessonov slide 61 Bioinformatics GBIO0009 -1 Biological Sequences Select portion of the sequence • Take portion of the Dengue virus library("seqinr"); # Take the first 500 nucleotides dengueseqstart <- dengueseq[1:500] dengueseqstartstring <- c2s(dengueseqstart); ____________________________________________________________________________________________________________________ Kirill Bessonov slide 62 Bioinformatics GBIO0009 -1 Biological Sequences Define findPotentialStartsAndStops() findPotentialStartsAndStops <- function(sequence) { # Define a vector with the sequences of potential start and stop codons codons <- c("atg", "taa", "tag", "tga") # Find the number of occurrences of each type of potential start or stop codon for (i in 1:4) { codon <- codons[i] # Find all occurrences of codon "codon" in sequence "sequence" occurrences <- matchPattern(codon, sequence) # Find the start positions of all occurrences of "codon" in sequence "sequence" codonpositions <- attr(occurrences,"ranges")@start # Find the total number of potential start and stop codons in sequence "sequence" numoccurrences <- length(codonpositions) if (i == 1) { # Make a copy of vector "codonpositions" called "positions" positions <- codonpositions # Make a vector "types" containing "numoccurrences" copies of "codon" types <- rep(codon, numoccurrences) } else { # Add the vector "codonpositions" to the end of vector "positions": positions <- append(positions, codonpositions, after=length(positions)) # Add the vector "rep(codon, numoccurrences)" to the end of vector "types": types <- append(types, rep(codon, numoccurrences), after=length(types)) } } # Sort the vectors "positions" and "types" in order of position along the input sequence: indices <- order(positions) positions <- positions[indices] types <- types[indices] # Return a list variable including vectors "positions" and "types": mylist <- list(positions,types) return(mylist) } ____________________________________________________________________________________________________________________ Kirill Bessonov slide 63 Bioinformatics GBIO0009 -1 Biological Sequences Find ORFs Sequence has 11 start and 22 stop codons "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgc ttaacgtagttctaacagttttttattagagagcagatctctgatgaacaac caacggaaaaagacgggtcgaccgtct … “ findPotentialStartsAndStops(dengueseqstartstring); [[1]] [1] 7 53 58 64 78 93 95 96 137 141 224 225 234 236 246 255 264 295 298 318 365 369 375 377 378 399 404 413 444 470 471 474 478 [[2]] [1] "tag" "taa" "tag" "taa" "tag" "tga" "atg" "tga" "atg" "tga" "atg" "tga" "tga" "atg" "tag" "taa" "tag" "tag" "atg" "atg" "atg" "tga" "taa" "atg" "tga" "tga" "atg" "atg" "tga" "atg" "tga" "tag" "tag" ____________________________________________________________________________________________________________________ Kirill Bessonov slide 64 Bioinformatics GBIO0009 -1 Biological Sequences Thee reading frames • three different possible reading frames – +1 - integer number of triplets (codons) – +2 - integer number of triplets, plus 1 nucleotide – +3 - integer number of triplets, plus 2 nucleotides ____________________________________________________________________________________________________________________ Kirill Bessonov slide 65 Bioinformatics GBIO0009 -1 Biological Sequences Identify reading frame • Get sub-sequence substring(dengueseqstartstring,137,143); [1] "atgctga" • starts – a potential start codon (ATG) • ends – with a potential stop codon (TGA) • there is – an integer number of triplets – plus one nucleotide – frame +2 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 66 Bioinformatics GBIO0009 -1 Biological Sequences Find ORF in DNA • Search for – a potential start codon, – followed by an integer number of codons, – followed by a potential stop codon – test all 3 frames • Input 500bp of the DEN-1 virus ____________________________________________________________________________________________________________________ Kirill Bessonov slide 67 Bioinformatics GBIO0009 -1 Biological Sequences Defining findORFsinSeq() 1of3 findORFsinSeq <- function(sequence) { require(Biostrings) # Make vectors "positions" and "types" containing information on the positions of ATGs in the sequence: mylist <- findPotentialStartsAndStops(sequence) positions <- mylist[[1]] types <- mylist[[2]] # Make vectors "orfstarts" and "orfstops" to store the predicted start and stop codons of ORFs orfstarts <- numeric() orfstops <- numeric() # Make a vector "orflengths" to store the lengths of the ORFs orflengths <- numeric() # Print out the positions of ORFs in the sequence: # Find the length of vector "positions" numpositions <- length(positions) ____________________________________________________________________________________________________________________ Kirill Bessonov slide 68 Bioinformatics GBIO0009 -1 Biological Sequences Defining findORFsinSeq()2of3 # There must be at least one start codon and one stop codon to have an ORF. if (numpositions >= 2) { for (i in 1:(numpositions-1)) { posi <- positions[i] typei <- types[i] found <- 0 while (found == 0) { for (j in (i+1):numpositions) { posj <- positions[j] typej <- types[j] posdiff <- posj - posi posdiffmod3 <- posdiff %% 3 # Add in the length of the stop codon orflength <- posj - posi + 3 if (typei == "atg" && (typej == "taa" || typej == "tag" || typej == "tga") && posdiffmod3 == 0) { # Check if we have already used the stop codon at posj+2 in an ORF numorfs <- length(orfstops) usedstop <- -1 if (numorfs > 0) { for (k in 1:numorfs) { orfstopk <- orfstops[k] if (orfstopk == (posj + 2)) { usedstop <- 1 } } } if (usedstop == -1) { orfstarts <- append(orfstarts, posi, after=length(orfstarts)) orfstops <- append(orfstops, posj+2, after=length(orfstops)) # Including the stop codon. orflengths <- append(orflengths, orflength, after=length(orflengths)) } found <- 1 break } if (j == numpositions) { found <- 1 } } } } } ____________________________________________________________________________________________________________________ Kirill Bessonov slide 69 Bioinformatics GBIO0009 -1 Biological Sequences Defining findORFsinSeq() 3of3 # Sort the final ORFs by start position: indices <- order(orfstarts) orfstarts <- orfstarts[indices] orfstops <- orfstops[indices] # Find the lengths of the ORFs that we have orflengths <- numeric() numorfs <- length(orfstarts) for (i in 1:numorfs) { orfstart <- orfstarts[i] orfstop <- orfstops[i] orflength <- orfstop - orfstart + 1 orflengths <append(orflengths,orflength,after=length(orflengths)) } mylist <- list(orfstarts, orfstops, orflengths) return(mylist) } ____________________________________________________________________________________________________________________ Kirill Bessonov slide 70 Bioinformatics GBIO0009 -1 Biological Sequences Find ORF start / end sites s1 <- "aaaatgcagtaacccatgccc"; findORFsinSeq(s1); [[1]] [1] 4 [[2]] [1] 12 [[3]] [1] 9 1. the start positions of ORFs 2. a vector of the end positions of those ORFs 3. a vector containing the lengths of the ORFs. ____________________________________________________________________________________________________________________ Kirill Bessonov slide 71 Bioinformatics GBIO0009 -1 Biological Sequences Practical 5 of 5: comparing the genomes of two different species ____________________________________________________________________________________________________________________ Kirill Bessonov slide 72 Bioinformatics GBIO0009 -1 Biological Sequences Comparison • Comparative genomics – Compares genomes • two different species, or • two different strains • Do the two species have the same number of genes? • Which genes were gained or lost? ____________________________________________________________________________________________________________________ Kirill Bessonov slide 73 Bioinformatics GBIO0009 -1 Biological Sequences Tuning biomaRt # Load the biomaRt package in R library("biomaRt"); # List all databases that can be queried listMarts(); 1. The names of the databases are listed and version 2. Can query many different databases including WormBase, UniProt, Ensembl, etc. ____________________________________________________________________________________________________________________ Kirill Bessonov slide 74 Bioinformatics GBIO0009 -1 Biological Sequences Connect to database via biomaRt • Once the particular Ensembl data is set – can perform the query using the getBM() function # Specify that we want to query the Ensembl Protists database ensemblprotists <- useMart("protists_mart_29"); ensemblleishmania <- useDataset("lmajor_eg_gene",mart=ensemblprotists); leishmaniaattributes <- listAttributes(ensemblleishmania); attributenames <- leishmaniaattributes[[1]]; attributedescriptions <- leishmaniaattributes[[2]]; • A very long list of 292 features – in the Leishmania major Ensembl data set ____________________________________________________________________________________________________________________ Kirill Bessonov slide 75 Bioinformatics GBIO0009 -1 Biological Sequences Query genes of Leishmania major • Find all genes in Leishmania major leishmaniagenes <- getBM(attributes = c("ensembl_gene_id"), mart=ensemblleishmania); leishmaniagenes[1:10]; • Find only protein coding genes in Leishmania major , • “gene_biotype” – specifies type of sequence (eg. protein-coding, pseudogene, etc.): leishmaniagenes2 <- getBM(attributes = c("ensembl_gene_id", "gene_biotype"), mart=ensemblleishmania); # Get the vector of the names of all L. major genes leishmaniagenenames2 <- leishmaniagenes2[,1]; # Get the vector of the biotypes of all genes leishmaniagenebiotypes2 <- leishmaniagenes2[,2]; table(leishmaniagenebiotypes2); ncRNA nontranslating_CDS 84 4 rRNA snoRNA 63 741 protein_coding 8308 snRNA 6 pseudogene 90 tRNA 83 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 76 Bioinformatics GBIO0009 -1 Biological Sequences Query genes of Plasmodium falciparum • Find protein coding genes in – Plasmodium falciparum ensemblpfalciparum <useDataset("pfalciparum_eg_gene",mart=ensemblprotists); pfalciparumgenes <- getBM(attributes = c("ensembl_gene_id", "gene_biotype"), mart=ensemblpfalciparum); # Get the names of the P. falciparum genes pfalciparumgenenames <- pfalciparumgenes[[1]]; length(pfalciparumgenenames); pfalciparumgenebiotypes <- pfalciparumgenes[[2]]; table(pfalciparumgenebiotypes); ncRNA nontranslating_cds 712 80 rRNA snRNA 24 3 protein_coding 5349 tRNA 45 ____________________________________________________________________________________________________________________ Kirill Bessonov slide 77 Bioinformatics GBIO0009 -1 Biological Sequences Comparing two species • Plasmodium falciparum seems to have less protein-coding genes (5349) than Leishmania major (8308) – Why? • that there have been gene duplications in the Leishmania major lineage • that completely new genes (that are not related to any other Leishmania major gene) have arisen • that there have been genes lost from the Plasmodium falciparum genome ____________________________________________________________________________________________________________________ Kirill Bessonov slide 78 Bioinformatics GBIO0009 -1 Biological Sequences ____________________________________________________________________________________________________________________ Kirill Bessonov slide 79 Bioinformatics GBIO0009 -1 Biological Sequences Resources • Online Tutorial on Sequence Alignment – http://a-little-book-of-r-forbioinformatics.readthedocs.org/en/latest/src/chapter4.html • Graphical alignment of proteins – http://www.itu.dk/~sestoft/bsa/graphalign.html • Pairwise alignment of DNA and proteins using your rules: – http://www.bioinformatics.org/sms2/pairwise_align_dna.html • Documentation on libraries – Biostings: http://www.bioconductor.org/packages/2.10/bioc/manuals/Biostrings/man/Biostrings.pdf – SeqinR: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf ____________________________________________________________________________________________________________________ Kirill Bessonov slide 80