Sequence Analysis Hemant Kelkar Center for Bioinformatics University of North Carolina Chapel Hill, NC 27599 Scope of Series Talk I • Overview and BLAST Talk II • Protein analysis/Sequence Alignment Talk III • Evolution • Genomics and challenges Bioinformatics • Mathematical, Statistical and computational methods that are used for solving biological problems • Glue that holds the “omics” data together Help … • Is “my sequence” in the databases? • Is it similar to any sequence in the DB? • Does it have any know motifs/domains that can help in identification? • Is there a structural homolog? • Are there any polymorphisms? • Genetic Map location? Bioinformatics TOOLS! Bioinformatics Tools • Genetic Code Similarity search e.g. BLAST, FASTA • Protein Structure http://restools.sdsc.edu/biotools/biotools9.html • Protein Evolution e.g. CLUSTALW, T-COFFEE, Phylip Primary Sequence Databases • GenBank (http://www.ncbi.nlm.nih.gov/Genbank/index.html) • PIR (http://pir.georgetown.edu/) • Swiss-Prot (http://us.expasy.org/sprot/) Sequence information as is generated in the laboratory Derived Sequence Databases Databases based on functional or phylogenetic analysis • PFAM (http://www.sanger.ac.uk/Software/Pfam/) : Protein families based on HMM models • InterPRO (http://www.ebi.ac.uk/interpro/) : Protein families and domains based on functional sites • TransFac (http://www.gene-regulation.com/) transcription factor db • Cytochrome P450 database (http://drnelson.utmem.edu/CytochromeP450.html) Derived Sequence Databases Databases based on taxonomy • Flybase (http://www.flybase.org/) : Fly Genome • Wormbase (http://www.wormbase.org/) : C. elegans • Genome Browser (http://genome.ucsc.edu/) : Human and Mouse • MGI (http://www.informatics.jax.org/) : Mouse • Microbial Genome Resource : (http://www.tigr.org/tigrscripts/CMR2/CMRHomePage.spl) Sequence Alignments • Provide a measure of relation between the nucleotide or protein sequence • This allows us to decipher: Structural relationships Functional relationships Evolutionary relationships Sequence Similarity Searches • Information conserved evolutionarily • DNA sequences NOT coding for proteins/rRNAs diverge rapidly • When possible use protein sequences for similarity searches • Non-homologous protein identification is much less reliable • What is measured and what is inferred? Similarity • Is always based on an observable • Usually expressed as % identity • Quantifies the divergence of two sequences • substitutions/insertions/deletions • Residues crucial for structure and/or function Homology • Homology always implies that the molecules share a common ancestor • Absolute answer • Molecules ARE or ARE NOT homologous • No degrees How to Find Similar Sequences • Global Sequence Alignments • Sequence comparison along entire length • Homolog of similar length • Local Sequence Alignments • Similar regions in two sequences • Regions outside the local alignment excluded • Sequences of different length/similarity Dotplot Scoring Matrices • Empirical weighting schemes • Considers important biology • Side chain chemistry/structure/function • Functional/Structural Conservation • Ile/Val – small and hydrophobic • Ser/Thr – both polar • Size/Charge/Hydrophibicity Nucleotide Matrix A C G T A 5 -4 -4 -4 C -4 5 -4 -4 G -4 -4 5 -4 T -4 -4 -4 5 PAM Scoring Matrices • Margaret Dayhoff (1978) • Point accepted mutations (PAM) • Patterns of substitutions in highly related proteins (>85% identical), based on multiple sequence alignments • New side chains must function similarly • 1 PAM 1 AA change per 100 AA • 1 PAM ~ 1 % Divergence BLOSUM Matrices • Henikoff and Henikoff (1992) • Blocks Substitution Matrices • Differences in conserved ungapped regions • Directly calculated no extrapolations • Sensitive to structural/functional subs • Generally perform better for local similarity searches Scoring Matrix – BLOSUM62 BLOSUM n • Calculated from sequences sharing no more than n% identity • Sequences with more than n% identity are clustered and weighted to 1 • Reducing the value of “n” yields more divergent/distantly-related sequences • BLOSUM62 used as default by many of the online search sites Matrices and more PAM Matrices (Altschul, 1991) PAM 40 Short alignments >70% PAM120 >50% PAM250 Longer weaker local areas >30% BLOSUM Matrices (Henikoff, 1993) BLOSUM 90 Short alignments BLOSUM 80 >60% >50% BLOSUM 62 Commonly used >35% BLOSUM 30 Longer, weaker local alignments Gaps • Compensate for insertion and deletions • Improvement alignments • Must be kept to a reasonably small number • 1 per 20 residues is logical • Need a different scoring scheme Gap Penalties • Penalty for gap introduction • Penalty for Gap extension Deductions for Gap = G + Ln where G = gap-opening penalty Nuc Prot 5 11 L = Gap-extension penalty 2 n = Length of gap 1 BLAST • Basic Local Alignment Search Tool • Seeks high-scoring segment pair (HSP) • Sequences that can be aligned w/o gaps • have a maximal aggregate score • score be above score threshold S • Many HSP reported for ungapped blast BLAST Algorithms Program Query Target BLASTN BLASTP BLASTX Nucloetide Protein Nucleotide (6-Frame) Nucleotide Protein Protein TBLASTN TBLASTX Protein Nucleotide (6FR) Nucloetide(6FR) Nucloetide(6FR) Neighborhood Words Query Word (W = 3) Query: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE Neighborhood Score Threshold (T = 8) STL 13 SAL 8 SNL 8 SVL 8 SBL 7 SCL 7 SDL 7 Etc. = 4 + 5 + 4 High-Scoring Segment Pairs STL 13 SAL 8 SNL 8 SVL 8 SBL 7 SCL 7 SDL 7 Etc. Query: Sbjct: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE ++ G + ++G G+GKS+LLSA L L+ ++G + TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS Extension Query: Cumulative Score Sbjct: SISPGQRVGLLGRTGSGKSTLLSAFLRMLN-IKGDIE ++ G + ++G G+GKS+LLSA L L+ ++G + TVPQGCLLAVVGPVGAGKSSLLSALLGELSKVEGFVS X S Significance Decay • Mismatches • Gap penalties T Extension Karlin Altschul Equation E = -λs kmNe m Number of letters in query N Number of letters in db mN Size of search space λs Normalized score k minor constant http://www.ncbi.nlm.nih.gov