Summer Bioinformatics Workshop 2008 Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center clin@winona.edu Summer Bioinformatics Workshop 2008 Sequence Alignments Cornerstone of bioinformatics What is a sequence? Nucleotide sequence Amino acid sequence Pairwise and multiple sequence alignments What alignments can help Determine function of a newly discovered gene sequence Determine evolutionary relationships among genes, proteins, and species Predict structure and function of protein 2 Summer Bioinformatics Workshop 2008 Why Align Sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA What does it do? One approach: Is there a similar gene in another species? Align sequences with known genes Find the gene with the “best” match 3 Summer Bioinformatics Workshop 2008 Visualization of Sequence Alignment Dot Plot One of the simplest and oldest methods for sequence alignment Visualization of regions of similarity Assign one sequence on the horizontal axis Assign the other on the vertical axis Place dots on the space of matches Diagonal lines means adjacent regions of identity 4 Summer Bioinformatics Workshop 2008 A Simple Example Construct a simple dot plot for T A G T C G A T G T * TAGTCGATG TGGTCATC * G * * * G * * * T * The alignment is TAGTCGATG TGGTC-ATC * * C A T * C * * * * * * * 5 Summer Bioinformatics Workshop 2008 Genes Accumulate Mutations over Time Mistakes in gene replication or repair Deletions, duplications Insertions, inversions Translocations Point mutations Environmental factors Radiation Oxidation 6 Summer Bioinformatics Workshop 2008 Deletions Codon deletion: ACG ATA GCG TAT GTA TAG CCG… Effect depends on the protein, position, etc. Almost always deleterious Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… Almost always lethal 7 Summer Bioinformatics Workshop 2008 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT 8 Summer Bioinformatics Workshop 2008 The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA 9 Summer Bioinformatics Workshop 2008 Point Mutation Example: Sickle-cell Disease Wild-type hemoglobin DNA 3’----CTT----5’ Mutant hemoglobin DNA 3’----CAT----5’ mRNA 5’----GAA----3’ mRNA 5’----GUA----3’ Normal hemoglobin ------[Glu]------ Mutant hemoglobin ------[Val]-----10 Summer Bioinformatics Workshop 2008 11 image credit: U.S. Department of Energy Human Genome Program, http://www.ornl.gov/hgmis. Summer Bioinformatics Workshop 2008 Comparing Two Sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT 12 Summer Bioinformatics Workshop 2008 Scoring a Sequence Alignment Example Match score: Mismatch score: Gap penalty: +1 +0 –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) Score = 18 + 0 + (-7) = +11 Various scoring scheme exist. 13 Summer Bioinformatics Workshop 2008 How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT There are ~888,000 possibilities to align the two sequences given above. Algorithms using a technique called “dynamic programming” are used – out of the scope of this workshop. 14 Summer Bioinformatics Workshop 2008 Global and Local Alignments Global alignments – score the entire alignment Local alignment – find the best matching subsequence Why local sequence alignment? Global alignment is useful only if the sequences to be aligned are very similar Subsequence comparison between a DNA sequence and a genome Identify Conserved regions Protein function domains 15 Summer Bioinformatics Workshop 2008 Example Compare the two sequences: TTGACACCCTCCCAATT ACCCCAGGCTTTACACAG Global alignment (does it look good?) TTGACACCCTCC-CAATT || || || ACCCCAGGCTTTACACAG Local alignment (does it look good?) ---------TTGACACCCTCCCAATT || |||| ACCCCAGGCTTTACACAG-------16 Summer Bioinformatics Workshop 2008 Where do we get sequences to work with? Biological databases NCBI Entrez (http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi ?term=) Wet labs Simulations Other people’s results On-line education resources BEDROCK (http://www.bioquest.org/bedrock/) BLAST results 17