Bioinformatics in Practice A Tutorial for DS2005, 8 Oct 2005 Wing-Kin Sung Limsoon Wong Practicing Bioinformatics Tutorial Outline • Intro to biology & bioinformatics apps (10 min, KS) • DNA feature recognition (20 min, WLS) • Protein function inference (20 min, WLS) • Q&A/break (10 min) • Whole genome alignment (20 min, KS) • Phylogenetic network (20 min, KS) • Peptide sequencing by mass spec (20 min, KS) • Q&A/break (10 min) • Disease treatment optimization (15 min, WLS) • Mining errors in bio databases (15 min, WLS) • Q&A (10 min) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong •KS, •10 min Introduction to Biology & Bioinformatics Applications Practicing Bioinformatics Body and Cell • Our body consists of a number of organs • Each organ is composed of a number of tissues • Each tissue is composed of cells of the same type • Cells perform two types of function – Chemical reactions needed to maintain our life – Pass info for maintaining life to next generation • In particular – Protein performs chemical reactions – DNA stores & passes info – RNA is intermediate between DNA & proteins Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics DNA • Stores instructions needed by the cell to perform daily life function • Consists of two strands interwoven together to form a double helix • Each strand is a chain of some small molecules called nucleotides Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Francis Crick shows James Watson the model of DNA in their room number 103 of the Austin Wing at the Cavendish Laboratories, Cambridge Practicing Bioinformatics Classification of Nucleotides • 5 different nucleotides: adenine(A), cytosine(C), guanine(G), thymine(T), & uracil(U) • A, G are purines. They have a 2-ring structure • C, T, U are pyrimidines. They have a 1-ring structure • DNA only uses A, C, G, & T A C G Copyright © 2005 by Wing-Kin Sung and Limsoon Wong T U Practicing Bioinformatics Watson-Crick Rule • DNA is double stranded in a cell • One strand is reverse complement of the other • Complementary bases: – A with T (two hydrogen-bonds) – C with G (three hydrogen-bonds) C A T 10Å Copyright © 2005 by Wing-Kin Sung and Limsoon Wong G 10Å Practicing Bioinformatics Chromosome • DNA is usually tightly wound around histone proteins and forms a chromosome • The total info stored in all chromosomes constitutes a genome • In most multi-cell organisms, every cell contains the same complete set of chromosomes – May have some small diff due to mutation • Human genome has 3G bases, organized in 23 pairs of chromosomes Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Gene • A gene is a sequence of DNA that encodes a protein or an RNA molecule • About 30,000 – 35,000 (protein-coding) genes in human genome • For gene that encodes protein – In Prokaryotic genome, one gene corresponds to one protein – In Eukaryotic genome, one gene may correspond to more than one protein because of the process “alternative splicing” Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Central Dogma • Gene expression consists of two steps –Transcription DNA mRNA –Translation mRNA Protein Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Genetic Code • Start codon: ATG (code for M) • Stop codon: TAA, TAG, TGA Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Protein • A sequence composed from an alphabet of 20 amino acids – Length is usually 20 to 5000 amino acids – Average around 350 amino acids • Folds into 3D shape, forming the building block & performing most of the chemical reactions within a cell Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Classification of Amino Acids • Amino acids can be classified into 4 types • Positively charged (basic) –Arginine (Arg, R) –Histidine (His, H) –Lysine (Lys, K) • Negatively charged (acidic) –Aspartic acid (Asp, D) –Glutamic acid (Glu, E) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Classification of Amino Acids • Polar (overall uncharged, but uneven charge distribution. can form hydrogen bonds with water. they are called hydrophilic) –Asparagine (Asn, N) –Cysteine (Cys, C) –Glutamine (Gln, Q) –Glycine (Gly, G) –Serine (Ser, S) –Threonine (Thr, T) –Tyrosine (Tyr, Y) • Nonpolar (overall uncharged and uniform charge distribution. cant form hydrogen bonds with water. they are called hydrophobic) –Alanine (Ala, A) –Isoleucine (Ile, I) –Leucine (Leu, L) –Methionine (Met, M) –Phenylalanine (Phe, F) –Proline (Pro, P) –Tryptophan (Trp, W) –Valine (Val, V) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Bioinformatics Applications • Bio Data Searching • Gene finding • Cis-regulatory DNA • Gene/Protein Network • Protein/RNA Struct Prediction • Evolutionary Tree Construction • Infer Protein Function • Disease Diagnosis, Prognosis, & Treatment Optimization, ... Copyright © 2005 by Wing-Kin Sung and Limsoon Wong •WLS, •20 min DNA Feature Recognition A Case Study on Translation Initiation Sites Practicing Bioinformatics Translation Initiation Site (TIS) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE • What makes the second ATG the TIS? Copyright © 2005 by Wing-Kin Sung and Limsoon Wong 80 160 240 80 160 240 Practicing Bioinformatics Approach • Training data gathering • Signal generation – k-grams, distance, domain know-how, … • Signal selection – Entropy, 2, CFS, t-test, domain know-how… • Signal integration – SVM, ANN, PCL, CART, C4.5, kNN, ... Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Training & Testing Data • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] • 3312 sequences • 13503 ATG sites • 3312 (24.5%) are TIS • 10191 (75.5%) are non-TIS • Use for 3-fold x-validation expts Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Signal Generation •K-grams (ie., k consecutive letters) – K = 1, 2, 3, 4, 5, … – Window size vs. fixed position – Up-stream, downstream vs. any where in window – In-frame vs. any frame Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Too Many Signals • For each k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! • This is too many for most machine learning algo’s Need to do signal selection – t-stats, 2, CFS, signal-to-noise, entropy, gini index, info gain, info gain ratio, ... Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Signal Selection Basic Idea • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Signal Selection by T-Statistics Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Signal Selection by 2 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Sample K-Grams Selected by CFS Kozak consensus Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream –TAA, TAG, TGA, –CTG, GAC, GAG, and GCC Stop codon Codon bias? Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Signal Integration • kNN – Given a test sample, find the k training samples that are most similar to it. Let the majority class win • SVM – Given a group of training samples from two classes, determine a separating plane that maximises the margin of error • Naïve Bayes, ANN, C4.5, CS4, ... Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Results (3-fold x-validation) TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy Naïve Bayes 84.3% 86.1% 66.3% 85.7% SVM 73.9% 93.2% 77.9% 88.5% Neural Network 77.6% 93.2% 78.8% 89.4% Decision Tree 74.0% 94.4% 81.1% 89.4% Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Technique Comparisons • Pedersen&Nielsen [ISMB’97] – Neural network – No explicit features • This approach • Zien [Bioinformatics’00] – SVM+kernel engineering – No explicit features • Hatzigeorgiou [Bioinformatics’02] – Multiple neural networks – Scanning rule – No explicit features Copyright © 2005 by Wing-Kin Sung and Limsoon Wong – Explicit feature generation – Explicit feature selection – Use any machine learning method w/o any form of complicated tuning – Scanning rule is optional Practicing Bioinformatics mRNAProtein A T How about using k-grams from the translation? E F L R L S S P Y C H W R Q I T stop M V N K A D E Copyright © 2005 by Wing-Kin Sung and Limsoon Wong G Practicing Bioinformatics Amino-Acid Features Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Amino-Acid Features Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Amino Acid K-Grams Discovered (by entropy) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Validation Results (on Chr X and Chr 21) Our method ATGpr • Using top 100 features selected by entropy and train SVM on Pedersen & Nielsen’s Copyright © 2005 by Wing-Kin Sung and Limsoon Wong •WLS, •20 min Protein Function Inference • Guilt by Association • Genome Phylogenetic Profiling Practicing Bioinformatics Motivations for Sequence Comparison • DNA is blue print for living organisms • Evolution is related to changes in DNA • By comparing DNA sequences we can infer evolutionary relationships between the sequences w/o knowledge of the evolutionary events themselves • Foundation for inferring function, active site, and key mutations Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Sequence Alignment indel Sequence U • Key aspect of sequence comparison is sequence alignment mismatch Sequence V match • A sequence alignment maximizes the number of positions that are in agreement in two sequences Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Sequence Alignment: Poor Example • Poor seq alignment shows few matched positions • The two proteins are not likely to be homologous No obvious match between Amicyanin and Ascorbate Oxidase Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Sequence Alignment: Good Example • Good alignment has clusters of extensive matched positions • The two proteins are likely to be homologous good match between Amicyanin and unknown M. loti protein Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Function Assignment to Protein Sequence SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNR YVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWE QNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGD VTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTG TFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELE VT • How do we attempt to assign a function to a new protein sequence? Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Guilt-by-Association Compare T with seqs of known function in a db Assign to T same function as homologs Discard this function as a candidate Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Confirm with suitable wet experiments Practicing Bioinformatics Homologs Obtained by BLAST • Thus our example sequence could be a protein tyrosine phosphatase (PTP) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Example Alignment with PTP Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Guilt-by-Association: Caveats • Ensure that the effect of database size has been accounted for • Ensure that the function of the homology is not derived via invalid “transitive assignment’’ • Ensure that the target sequence has all the key features associated with the function, e.g., active site and/or domain Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Examples of Invalid Function Assignment: The IMP dehydrogenases (IMPDH) A partial list of IMPdehydrogenase misnomers in complete genomes remaining in some public databases Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics IMPDH Domain Structure IMPDH Misnomer in Methanococcus jannaschii IMPDH Misnomers in Archaeoglobus fulgidus • Typical IMPDHs have 2 IMPDH domains that form the catalytic core and 2 CBS domains. • A less common but functional IMPDH (E70218) lacks the CBS domains. • Misnomers show similarity to the CBS domains Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Invalid Transitive Assignment Root of invalid transitive assignment B A C Mis-assignment of function No IMPDH domain Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Protein Function Inference What if no sequence homolog with annotated function can be found? Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Phylogenetic Profiling • Gene (and hence proteins) with identical patterns of occurrence across phyla tend to function together • Even if no homolog with known function is available, it is still possible to infer function of a protein Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Phylogenetic Profiling: How it Works Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Phylogenetic Profiles: Evidence Pellegrini et al., PNAS, 96:4285--4288, 1999 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Phylogenetic Profiling: Evidence Wu et al., Bioinformatics, 19:1524--1530, 2003 hamming distance X,Y = #lineages X occurs + #lineages Y occurs – 2 * #lineages X, Y occur KEGG COG hamming distance (D) • Proteins having low hamming distance (thus highly similar phylogenetic profiles) tend to share common pathways Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Q&A / Break •KS, •20 min Whole Genome Alignment Practicing Bioinformatics Mouse vs Human • Mouse and human are closely related species. They share a lot of gene pairs Mouse Chr No. 2 7 14 14 15 15 16 16 16 17 17 17 18 19 19 Human # of Published Gene Pairs Chr No. 51 15 192 19 23 3 38 8 80 12 72 22 31 16 64 21 30 22 150 6 46 16 30 19 64 5 22 9 93 11 Data is extracted from http://www.ncbi.nlm.nih.gov/Homology Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Our Aim • Suppose we are given human and mouse genomes • Our aim is to extract all the conserved gene pairs between human and mouse • One possible solution --- Whole genome alignment! Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Different Approaches Coverage Precision MUM 100% Many false positives LCS (MUMmer1) Delcher et al, 1999 Very less Not many false positives Clustering (MUMmer2,3) Delcher et al, 2002 76.6% 26.5% Mutation-Sensitive Alignment (MSA) Chan et al, 2004 91.3% 29.3% MSA with 1-mismatch anchor Yiu et al, 2005 94.6% 30.1% Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Observation 1 • Though a pair of conserved genes rarely contain the same entire sequence, they share a lot of short common substrings and some of them are indeed unique to this pair of genes! • For example, Genome1: ACGACTCAGCTACTGGTCAGCTATTACTTACCGC Genome2: ACTTCTCTGCTACGGTCAGCTATTCACTTACCGC Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Good News! • When we do experiment, we found that MUMs can cover nearly 100% of the know conserved gene pairs Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Problem Solved? • We can find MUMs in linear time! Is the problem solved? Ans: No! Mouse Chr No. 2 7 14 14 15 15 16 16 16 17 17 17 18 19 19 Human # of Published Chr No. Gene Pairs 15 51 19 192 3 23 8 38 12 80 22 72 16 31 21 64 22 30 6 150 16 46 19 30 5 64 9 22 11 93 # of MUMs 96,473 52,394 58,708 38,818 88,305 71,613 66,536 51,009 61,200 94,095 29,001 56,536 131,850 62,296 29,814 No. of MUMs >> no. of gene pairs! There is too much noise! How can we extract the right MUMs? Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Observation 2 • Two related species should preserve the ordering of most conserved genes Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Conserved Genes in Mouse Chromosome 16 and Human Chromosome 16 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Solution 2 • Instead of reporting all MUMs to the user, – Compute the longest common subsequence (LCS) of all MUMs – Report only the MUMs on the LCS Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Example of LCS 12345678 41325768 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong 12345678 41325768 Practicing Bioinformatics Problem of this Approach • It assumes there exists a single long alignment • Moreover, such assumption may not be always true Therefore, for many cases, LCS can only discover few genes Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Common Genes in Mouse Chromosome 16 and Human Chromosome 3 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Observation 3 • A pair of conserved genes are likely to correspond to a sequence of MUMs that are consecutive, close in both genomes, and have sufficient length 1 7 2 5 3 4 5 6 6 4 1 2 7 3 • The set of such substrings is called a cluster Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Solution 3 • Based on Observation 3, MUMmer2 and MUMmer3 try to identify maximal clusters in the genomes • This approach is quite good. In our experiment, MUMmer3 can identify ~76.6% of the published gene pairs Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Can We Further Improve? • Yes. We propose the Similar Subsequence Problem • In our experiment, we can identify ~91.3% of the published gene pairs Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Observation 4 • If two genomes are closely related, they can be transformed from each other using a few transpositions/reversals Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Example • By two transposition/reversal operations, we can transform Mouse Chr 16 to Human Chr 16 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Input • Given two genomes S and T • Assume we already know the n MUMs • Let A=(a1,a2,…,an) and B=(b1,b2,…,bn), respectively, be the order of the n MUMs in S and T S T a1=1 a2=2 a3=3 b1=1 b2=6 b3=5 a4=4 b4=4 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong a5=5 a6=6 b5=7 b6=2 a7=7 a8=8 b7=3 b8=8 Practicing Bioinformatics Common Subsequence • A seq C=(c1,c2,…,cm) is a common subseq of A and B if C is a subsequence of both A and B • E.g., C=(1,2,3,8) is a common subseq of A and B • Weight of common subseq is total weight of the MUMs • A maximum weight common subseq (MWCS) of A and B is a subseq with the heaviest weight S T a1=1 a2=2 a3=3 b1=1 b2=6 b3=5 a4=4 b4=4 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong a5=5 a6=6 b5=7 b6=2 a7=7 a8=8 b7=3 b8=8 Practicing Bioinformatics Similar Subsequence • A k-similar subseq consists of k blocks and a backbone –Backbone is a common subseq w/ k blocks inserted into it –Each block is a common subseq or reversed common subseq while all of them are disjoint • Below is an example of 2-similar subseq • K-similar subseq models k transpositions/reversals S T a1=1 a2=2 a3=3 b1=1 b2=6 b3=5 a4=4 b4=4 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong a5=5 a6=6 b5=7 b6=2 a7=7 a8=8 b7=3 b8=8 Practicing Bioinformatics Similar Subsequence Problem • Given two sequences A and B and a parameter k, the Similar Subsequence Problem finds a k-similar subsequence with the heaviest weight • This problem is NP-complete in general • For a constant k, we can solve the problem in O(n2k+1 log n) time • We devise a heuristic algorithm to solve it in O(n2(log n + k)) time Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Solution 4 • Given two genomes S and T, Mutation Sensitive Alignment (MSA) Algorithm 1. Find all the MUMs 2. Solve the similar subsequence problem 3. Report all the MUMs on the k-similar subsequence (we set k=4) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Example S T S T a1=1 a2=2 a3=3 b1=2 b2=1 a1=1 a2=2 b3=7 a3=3 b1=2 b2=1 a4=4 b4=6 a4=4 b3=7 b4=6 a5=5 a6=6 b5=5 b6=8 b7=3 b8=4 b9=9 a5=5 a6=6 a8=8 a9=9 b5=5 b6=8 b7=3 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong a7=7 a7=7 a8=8 a9=9 b8=4 b9=9 Practicing Bioinformatics Experiment results • We apply MUMmer3 and MSA to the following 15 pairs of chromosomes For MSA, we set k=4! Mouse Chr No. 2 7 14 14 15 15 16 16 16 17 17 17 18 19 19 Human # of Published Chr No. Gene Pairs 15 51 19 192 3 23 8 38 12 80 22 72 16 31 21 64 22 30 6 150 16 46 19 30 5 64 9 22 11 93 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong # of MUMs 96,473 52,394 58,708 38,818 88,305 71,613 66,536 51,009 61,200 94,095 29,001 56,536 131,850 62,296 29,814 Practicing Bioinformatics Experiment results (II) Exp. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 average Coverage Precision MUMmer MSA MUMmer MSA 76.50% 92.20% 21.70% 22.70% 71.40% 91.70% 21.30% 25.10% 87.00% 100.00% 24.80% 25.50% 76.30% 94.70% 27.40% 26.70% 92.50% 96.30% 32.50% 32.00% 72.20% 95.80% 31.20% 32.90% 67.70% 87.10% 13.50% 17.80% 78.10% 90.60% 37.20% 36.70% 80.00% 86.70% 40.70% 49.70% 82.00% 92.00% 30.90% 32.10% 65.20% 89.10% 30.50% 36.00% 60.00% 80.00% 27.50% 41.90% 89.10% 95.30% 18.20% 18.40% 72.70% 86.40% 10.40% 12.60% 78.50% 91.40% 30.00% 29.70% 76.60% 91.30% 26.50% 29.30% Copyright © 2005 by Wing-Kin Sung and Limsoon Wong • Coverage: % of published genes covered • Precision: % of MUMs reside in some published gene pairs •KS, •20 min Phylogenetic Network Practicing Bioinformatics Phylogenetic Tree • Phylogenetic tree is a tree whose leaves are labeled by some species • It assumes that each species is evolved from ONE ancestor species • Represented by a rooted tree, distinctly leaf-labeled C. tigris D. dorsalis C. tigris D. dorsalis C. draconoides U. scoparia P. platyrhinos P. platyrhinos C. draconoides Copyright © 2005 by Wing-Kin Sung and Limsoon Wong U. scoparia Practicing Bioinformatics Limitation of Phylogenetic Tree • Ford Doolittle (Science 1999) said – Molecular phylogeneticists will have failed to find the “true tree”, not because their methods are inadequate or because they have chosen the wrong genes, but because the history of life cannot properly be represented as a tree Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics More Realistic Assumption • Evolution is in fact more than mutation. We have other types of evolutions. Like: – Hybridization. • E.g. tiger + lion tiglion – Horizontal gene transfer • E.g. Evolution of influenza • Phylogenetic tree cannot model those types of evolutions Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Another model: Phylogenetic Network • Generalization of phylogenetic tree in which internal nodes may have more than one parent • A network N is a directed acyclic graph such that –Each node has indegree 1 or 2 (except the root) –Each node has outdegree at most 2 –No node has both indegree 1 and outdegree 1 –All nodes with outdegree 0 are distinctly labeled (“leave”) root hybrid node x4 x1 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong x2 x3 leaves Practicing Bioinformatics A Special Case: Galled Phylogenetic Network • When all cycles in the phylogenetic network is node-disjoint, the network is called galled network • The biological significance of this special case is described in [D. Gusfield, S. Eddhu, and C. Langleg (CSB 2003)] Copyright © 2005 by Wing-Kin Sung and Limsoon Wong General network Galled network Practicing Bioinformatics Methods for Constructing Network • Median-joining • Split decomposition (SplitsTree) • PYRAMIDS • Statistical parsimony (TCS) • Molecular-variance parsimony (Arlequin) • Reticulogram (T-REX) • Netting • NeighborNet • Perfect phylogeny-based methods • Constructing galled network from triplets Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Maddison Method for Building Network • Maddison observed that –If a phylogenetic network for a set of species contains a single hybrid node then each gene presenting in the species must evolve according to one of the two trees embedded in the network • Hence, we have the following problem: –Input: a set of gene trees –Output: a network which refines all gene trees Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Example • Given a set of 2 trees T={T1, T2}, below is a galled network N which refines T T1 x1 x4 x5 x6 N refines x2 x 3 T2 x5 x1 x5 x6 x1 x x x4 3 2 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong x4 x2 x 3 x6 Practicing Bioinformatics Difficult to Construct the Network? • Unluckily, in general, this problem is NP-hard • Moreover, if the resulting network is a galled network, it can be constructed in polynomial time Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Framework • We propose a simple top-down and recursive framework to solve the problem 1. Partitions L (the set of leaves for T) into two subsets {X,Y} if possible; otherwise, three subsets {X,Y,Z} 2. For each subset L’=X,Y,Z, recursively constructs a solution network for T|L’ 3. Combine the solutions for T|L’ to obtain a network for T Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Illustration (I) Reason: • T1|{x1,x2,x3,x4} & T2|{x1,x2,x3,x4} are subtrees of T1 & T2, resp. • Similar for T1|{x5,x6} & T2|{x5,x6} T1 x1 x4 x5 x6 x2 x3 x1x2x3 x4 T2 x5 x6 x1 x x x4 2 3 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong x5 x6 Practicing Bioinformatics Illustration (II) T1|{x5,x6} x5 x6 T2|{x5,x6} x5 x5 x6 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong x6 Practicing Bioinformatics Illustration (III) Reason: •T1|{x2,x3} and T2|{x2,x3} are proper subtrees •Similar for T1|{x1} and T2|{x1} •Similar for T1|{x4} and T2|{x4} T1|{x1,x2,x3,x4} x4 x1 x2 x3 T1|{x1,x2,x3,x4} x1 x1 x2 x3 x4 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong x4 x2 x3 x1 x4 x2 x 3 Practicing Bioinformatics Illustration (IV) T1 x1 x4 x5 x6 x2 x3 x1x2 x3 x4 x5 x5 x6 T2 x1 x5 x6 x1 x x x4 2 3 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong x4 x2 x3 x6 Practicing Bioinformatics Summary • Given two trees T1 and T2, we can find a galled network N which refines T1 and T2 in polynomial time • Since galled network is biological meaningful, we did a big step on practically constructing phylogenetic network • Open problem: Can we have a practically fast algorithm for building general network for T1 and T2? Copyright © 2005 by Wing-Kin Sung and Limsoon Wong •KS, •20 min Peptide Sequencing by Mass Spec Practicing Bioinformatics Peptide Sequencing • Unlike DNA, deducing the amino acid sequence of a protein peptide is not easy • The problem of finding the amino acid sequence of a protein peptide is known as the Peptide Sequencing Problem • One solution is to use mass spectrometry Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Idea of Sequencing by Mass Spectrum M=total weight of the peptide CTVFTEPREFK W1 = weight of CTVFT M-W1 = weight of EPREFK fragmentation CTVFT W1+1 (mass of b-ion) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong EPREFK M-W1+19 (mass of y-ion) Practicing Bioinformatics An Example MS/MS Spectrum Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Two Ways for Identifying the Amino Acid Sequence • Given the spectrum M, there are two ways to identify the amino acid sequence – Database searching • Select a peptide from the database which is best explaining the spectrum M – De Novo sequencing • Among all possible peptides, find a peptide which is best explaining the spectrum M Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Basic Idea of De Novo Sequencing • Input: a spectrum S • Scoring function: For any peptide P, define a scoring function score(P,S) to measure the fitness between P and S CTVFTEPREFK Similar? • Algorithm: Among all possible peptides, find a peptide P which maximizes score(P,S) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics How to Compute Score(P,S)? •E.g. Consider a peptide P=SAG –y1 = wt(S)+19 = 76.05 –y2 = wt(SA)+19 = 147.13 –y3 = wt(SAG)+19 = 234.21 –b1 = wt(G)+1 = 88.08 –b2 = wt(AG)+1 = 159.16 –b3 = wt(SAG)+1 = 216.21 wt(S)=57.05Da wt(A)=71.08Da wt(G)=87.08Da Artificial spectrum 500 S 400 300 G SA SAG AG SAG 200 100 96 11 2 12 8 14 4 16 0 17 6 19 2 20 8 22 4 24 0 80 64 48 32 0 16 0 Red peaks: artificial y-ions Green peaks: artificial b-ions Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics How to Compute Score(P,S)? Artificial spectrum 500 400 S 300 200 G 96 11 2 12 8 14 4 16 0 17 6 19 2 20 8 22 4 24 0 80 64 48 32 0 0 Real spectrum 500 405 400 300 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong 210 200 150 100 160 96 11 2 12 8 14 4 16 0 17 6 19 2 20 8 22 4 24 0 80 64 48 32 0 0 16 Black peaks: real peaks Red peaks: artificial y-ions Green peaks: artificial b-ions SA SAG AG SAG 100 16 • Simple solution: –Count the number of peaks in S whose masses equal some b-ions or yions of P • For the following example, –Match peaks = 4 –Don’t match peaks = 2 Practicing Bioinformatics Factors Affecting Intensity (I) • y-ions are more intense than b-ions • More intense y-ion more intense b-ion, & vice versa CTVFTEPREFK CTVFT fragmentation EPREFK W1+1 (mass of b-ion) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong M-W1+19 (mass of y-ion) Practicing Bioinformatics Factors Affecting Intensity (II) • Mass of the fragment will affect its intensity • Peaks in the middle of spectrum have higher intensity Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Factors Affecting Intensity (III) • a1…ajaj+1…an (b-ion: a1…aj, y-ion: aj+1…an) • Amino acid at the cleavage site affects intensity – E.g. Low intensity for the b-ion if aj=P • Presence of basic residues • Precursor charge • Hydrophobicity and helicity • … Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics A Better Score Function • We propose to model the factors using decision tree • Then, we give a better score function Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Terminal part low intensity P: lower intensity Large mass can not detected Tree for b-ion Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Tree for y-ion Practicing Bioinformatics Algorithm • Among all possible peptides, find a peptide P that maximizes score(P,S) • This problem can be solved by dynamic programming • For instance, we can use –Sandwich algorithm proposed by Bin Ma; or –Spectrum graph algorithm proposed by Tin Chen Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Experiment Results •Data set –Training set: 1260 high confident spectra of doubly charged tryptic peptides (from Genome Inst of S’pore) –Testing set: 400 high confident spectra from Open Proteomics Database –Length from 9 to 18 (Average 13.7) • Result –Accuracy: Copyright © 2005 by Wing-Kin Sung and Limsoon Wong no. of correct predicted a mino acids accuracy no. of predicted amino acids –Compare with two other algorithms: • Peaks: one of the best de novo algorithm • PepNovo: de novo algorithm with intensitybased scoring function Practicing Bioinformatics Experiment Results • Compare accuracy • Compare maximal correct subsequence length –Proportions of subsequence length longer than l (3-10) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Q&A / Break •WLS, •15 min Disease Treatment Optimization A Case Study on Childhood ALL Practicing Bioinformatics Childhood ALL • Major subtypes are: –T-ALL, E2A-PBX, TEL-AML, MLL genome arrangements, BCR-ABL, Hyperdiploid>50 • Diff subtypes respond differently to same Tx Over-intensive Tx –Development of sec cancers –Reduction of IQ Under-intensiveTx –Relapse • The subtypes look similar • Conventional diagnosis –Immunophenotyping –Cytogenetics –Molecular diagnostics Unavailable in most ASEAN countries Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Single-Test Platform of Microarray & Machine Learning Image credit: Affymetrix Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Overall Strategy Diagnosis of subtype •For each subtype, select genes to develop classification model for diagnosing that subtype Subtypedependent prognosis Riskstratified treatment intensity •For each subtype, select genes to develop prediction model for prognosis of that subtype Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Subtype Diagnosis by PCL • Gene expression data collection • Gene selection by 2 • PCL Classifier training by emerging pattern • Apply PCL for diagnosis of future cases Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Emerging Patterns • An emerging pattern is a set of conditions – usually involving several features – that most members of a class satisfy – but none or few of the other class satisfy Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics PCL: Prediction by Collective Likelihood Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Childhood ALL Subtype Diagnosis Workflow A tree-structured diagnostic workflow was recommended by our doctor collaborator Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Training and Testing Sets Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Accuracy of Various Classifiers The classifiers are all applied to the 20 genes selected by 2 at each level of the tree Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Understandability of EP & PCL • E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs • These give us the diagnostic rule Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Conclusions Conventional Tx: • intermediate intensity to everyone 10% suffers relapse 50% suffers side effects costs US$150m/yr Our optimized Tx: • high intensity to 10% • intermediate intensity to 40% • low intensity to 50% • costs US$100m/yr Copyright © 2005 by Wing-Kin Sung and Limsoon Wong •High cure rate of 80% • Less relapse • Less side effects • Save US$51.6m/yr •WLS, •15 min Mining Errors in Bio Databases A Case Study on GenBank Practicing Bioinformatics Data Cleansing, Koh et al, DBiDB 2005 • 11 types & 28 subtypes of data artifacts – Critical artifacts (vector contaminated sequences, duplicates, sequence structure violations) – Non-critical artifacts (misspellings, synonyms) • > 20,000 seq records in public contain artifacts • Identification of these artifacts are impt for accurate knowledge discovery • Sources of artifacts –Diverse sources of data Copyright © 2005 by Wing-Kin Sung and Limsoon Wong • Repeated submissions of seqs to db’s • Cross-updating of db’s –Data Annotation • Db’s have diff ways for data annotation • Data entry errors can be introduced • Different interpretations –Lack of standardized nomenclature • Variations in naming • Synonyms, homonyms, & abbrevn –Inadequacy of data quality control mechanisms Practicing Bioinformatics A Classification of Errors Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Uninformative sequences Invalid values Practicing Bioinformatics Undersized sequences ATTRIBUTE Ambiguity Dubious sequences Vector contaminated sequence Crossannotation error RECORD Annotation error Example Meaningless Seqs • Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004). • In Nov 2004, the total number of undersized protein sequences increases to 3,350. • Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004). • In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711. Sequence structure violation Undersized protein sequences in major databases Sequence redundancy Data provenance flaws 1015 1000 DDBJ 800 EMBL 600 400 200 GenBank 528 383 364 218 171 116 123 3 0 SwissProt 51 2 0 151 42 125 12 23 0 MULTIPLE SOURCE DATABASE 1 Erroneous data transformation PDB 2 3 Sequence Length Incompatible schema Copyright © 2005 by Wing-Kin Sung and Limsoon Wong PIR Number of records SINGLE SOURCE DATABASE Number of records 1200 Undersized nucleotide sequences in major databases 233 228 250 200 DDBJ 150 100 50 115108 108 73 69 45 40 6 2 104 81 9 3 77 51 55 67 2 3 Sequence Length 4 GenBank PDB 24 0 1 EMBL 5 Invalid values Overlapping intron/exon Practicing Bioinformatics ATTRIBUTE Ambiguity Example Overlapping Intron/Exon Dubious sequences Vector contaminated sequence Crossannotation error RECORD Annotation error Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data Provenance flaws MULTIPLE SOURCE DATABASE Erroneous data transformation • Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6. • rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2. Incompatible schema Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Replication of sequence information Invalid values Different views Practicing Bioinformatics ATTRIBUTE Ambiguity Dubious sequences Overlapping annotations of the same sequence Example Seqs w/ Identical Info Submission of the same sequence to different databases • Repeated submission of the same sequence to the same database Vector contaminated sequence • Initially submitted by different groups • Protein sequences may be translated from duplicate nucleotide sequences Crossannotation error RECORD Annotation error Sequence structure violation SINGLE SOURCE DATABASE Sequence redundancy Data provenance flaws MULTIPLE SOURCE DATABASE Erroneous data transformation Incompatible schema http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept Copyright © 2005 by Wing-Kin Sung and Limsoon Wong http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db =protein&list_uids=11692005&dopt=GenPept Practicing Bioinformatics Association Rule Mining for De-duplication Select matching criteria Compute similarity scores from known duplicate pairs Generate association rules Detect duplicates using the rules Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Features to Match Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Association Rule Mining AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0 AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0 Similarity scores of known duplicate pairs P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0 S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0 P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0 Association rule mining Frequent item-set with support LE1.0 PD0 SQ1.0 (99.7%) SP1 PD0 SQ1.0 (97.1%) SP1 LE1.0 PD0 SQ1.0 (96.8%) DB0 PD0 SQ1.0 (93.1%) DB0 LE1.0 PD0 SQ1.0 (92.8%) DB0 SP1 PD0 SQ1.0 (90.4%) DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%) RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%) RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%) Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Dataset Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB) scorpion AND (venom OR toxin) serpentes AND venom AND PLA2 Scorpion venom dataset containing 520 records Snake PLA2 venom dataset containing 780 records Expert annotation 251 duplicate pairs 444 duplicate pairs 695 duplicate pairs are collectively identified. Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Duplicates detected by association rules 60 49.4 Results FP% and FN% 50 40 36.3 32.7 30 20 10 Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates. Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates. Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates. 6 2.4 1.8 5.7 3.8 0.3 9.4 7.9 7.5 5.2 0.1 0 le Ru 1 le Ru 2 le Ru 3 le Ru 4 le Ru 5 le Ru 6 Association rules FP% FN% x 1000 Rule 1 S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%) Rule 2 S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%) Rule 3 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%) Rule 4 S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%) Rule 5 S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%) Rule 6 S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%) Rule 7 S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%) le Ru 7 Q&A / Wrap Up Practicing Bioinformatics References (I) • H. Liu & L. Wong “Data mining tools for biological sequences”, JBCB, 1:139-168, 2003 • J. Li et al., “Simple Rules Underlying Gene Expression Profiles of More than Six Subtypes of Acute Lymphoblastic Leukemia (ALL) Patients”, Bioinformatics. 19:71--78, 2003 • J. Koh et al., “A Classification of Biological Data Artifacts”, DBiBD, 2005 Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics References (II) • HL Chan, TW Lam, WK Sung, Prudence WH Wong, SM Yiu, and X Fan. “The mutated subsequence problem and locating conserved genes”. Bioinformatics, 21(10):22712278, 2005 • Trinh ND Huynh, J Jansson, WK Sung, and NB Nguyen. “Constructing a Smallest Refining Galled Phylogenetic Network”. RECOMB, 2005, pages 265-280 • W Shen, WK Sung, N Sze. “DTSeq: Decision Tree based De Novo peptide sequencing”. In preparation. Copyright © 2005 by Wing-Kin Sung and Limsoon Wong Practicing Bioinformatics Acknowledgements TIS Prediction Huiqing Liu, Roland Yap, Fanfan Zeng Treatment Optimization for Childhood ALL James Downing, Huiqing Liu, Jinyan Li, Allen Yeoh Mining Errors from Bio DB Vladimir Brusic, Judice Koh, Mong Li Lee Whole genome alignment Tak-Wah Lam, Siu-Ming Yiu, Ho-Leung Chan, Prudence WH Wong Phylogenetic network Jansson Jesper, Trinh ND Huynh, Nguyen Bao Nguyen Protein peptide sequencing Shen Wei, Newman Sze Copyright © 2005 by Wing-Kin Sung and Limsoon Wong