Bioinformatics in Practice

advertisement
Bioinformatics
in Practice
A Tutorial for DS2005, 8 Oct 2005
Wing-Kin Sung
Limsoon Wong
Practicing Bioinformatics
Tutorial Outline
• Intro to biology & bioinformatics apps (10 min, KS)
• DNA feature recognition (20 min, WLS)
• Protein function inference (20 min, WLS)
• Q&A/break (10 min)
• Whole genome alignment (20 min, KS)
• Phylogenetic network (20 min, KS)
• Peptide sequencing by mass spec (20 min, KS)
• Q&A/break (10 min)
• Disease treatment optimization (15 min, WLS)
• Mining errors in bio databases (15 min, WLS)
• Q&A (10 min)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
•KS,
•10 min
Introduction to
Biology & Bioinformatics
Applications
Practicing Bioinformatics
Body and Cell
• Our body consists of a
number of organs
• Each organ is composed
of a number of tissues
• Each tissue is composed
of cells of the same type
• Cells perform two types of
function
– Chemical reactions
needed to maintain our life
– Pass info for maintaining
life to next generation
• In particular
– Protein performs
chemical reactions
– DNA stores & passes info
– RNA is intermediate
between DNA & proteins
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
DNA
• Stores instructions needed
by the cell to perform daily
life function
• Consists of two strands
interwoven together to form
a double helix
• Each strand is a chain of
some small molecules
called nucleotides
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Francis Crick shows James Watson the model of DNA
in their room number 103 of the Austin Wing at the
Cavendish Laboratories, Cambridge
Practicing Bioinformatics
Classification of Nucleotides
• 5 different nucleotides: adenine(A), cytosine(C), guanine(G),
thymine(T), & uracil(U)
• A, G are purines. They have a 2-ring structure
• C, T, U are pyrimidines. They have a 1-ring structure
• DNA only uses A, C, G, & T
A
C
G
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
T
U
Practicing Bioinformatics
Watson-Crick Rule
• DNA is double stranded in a cell
• One strand is reverse complement of the other
• Complementary bases:
– A with T (two hydrogen-bonds)
– C with G (three hydrogen-bonds)
C
A
T
10Å
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
G
10Å
Practicing Bioinformatics
Chromosome
• DNA is usually tightly wound around histone
proteins and forms a chromosome
• The total info stored in all chromosomes constitutes
a genome
• In most multi-cell organisms, every cell contains the
same complete set of chromosomes
– May have some small diff due to mutation
• Human genome has 3G bases, organized in 23 pairs
of chromosomes
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Gene
• A gene is a sequence of DNA that encodes a protein
or an RNA molecule
• About 30,000 – 35,000 (protein-coding) genes in
human genome
• For gene that encodes protein
– In Prokaryotic genome, one gene corresponds to one
protein
– In Eukaryotic genome, one gene may correspond to
more than one protein because of the process
“alternative splicing”
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Central Dogma
• Gene expression consists
of two steps
–Transcription
DNA  mRNA
–Translation
mRNA  Protein
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Genetic Code
• Start codon:
ATG (code for M)
• Stop codon:
TAA, TAG, TGA
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Protein
• A sequence composed
from an alphabet of 20
amino acids
– Length is usually 20 to
5000 amino acids
– Average around 350
amino acids
• Folds into 3D shape,
forming the building block &
performing most of the
chemical reactions within a
cell
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Classification of Amino Acids
• Amino acids can be
classified into 4 types
• Positively charged (basic)
–Arginine (Arg, R)
–Histidine (His, H)
–Lysine (Lys, K)
• Negatively charged (acidic)
–Aspartic acid (Asp, D)
–Glutamic acid (Glu, E)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Classification of Amino Acids
• Polar (overall uncharged,
but uneven charge
distribution. can form
hydrogen bonds with water.
they are called hydrophilic)
–Asparagine (Asn, N)
–Cysteine (Cys, C)
–Glutamine (Gln, Q)
–Glycine (Gly, G)
–Serine (Ser, S)
–Threonine (Thr, T)
–Tyrosine (Tyr, Y)
• Nonpolar (overall uncharged
and uniform charge
distribution. cant form
hydrogen bonds with water.
they are called hydrophobic)
–Alanine (Ala, A)
–Isoleucine (Ile, I)
–Leucine (Leu, L)
–Methionine (Met, M)
–Phenylalanine (Phe, F)
–Proline (Pro, P)
–Tryptophan (Trp, W)
–Valine (Val, V)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Bioinformatics Applications
• Bio Data Searching
• Gene finding
• Cis-regulatory DNA
• Gene/Protein Network
• Protein/RNA Struct Prediction
• Evolutionary Tree Construction
• Infer Protein Function
• Disease Diagnosis, Prognosis,
& Treatment Optimization, ...
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
•WLS,
•20 min
DNA Feature
Recognition
A Case Study on Translation Initiation Sites
Practicing Bioinformatics
Translation Initiation Site (TIS)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
• What makes the second ATG the TIS?
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
80
160
240
80
160
240
Practicing Bioinformatics
Approach
• Training data gathering
• Signal generation
– k-grams, distance, domain know-how, …
• Signal selection
– Entropy, 2, CFS, t-test, domain know-how…
• Signal integration
– SVM, ANN, PCL, CART, C4.5, kNN, ...
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Training & Testing Data
• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]
• 3312 sequences
• 13503 ATG sites
• 3312 (24.5%) are TIS
• 10191 (75.5%) are non-TIS
• Use for 3-fold x-validation expts
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Signal Generation
•K-grams (ie., k
consecutive letters)
– K = 1, 2, 3, 4, 5,
…
– Window size vs.
fixed position
– Up-stream,
downstream vs.
any where in
window
– In-frame vs. any
frame
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Too Many Signals
• For each k, there are 4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have
24 + 96 + 384 + 1536 + 6144 = 8184 features!
• This is too many for most machine learning algo’s
 Need to do signal selection
– t-stats, 2, CFS, signal-to-noise, entropy, gini index,
info gain, info gain ratio, ...
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Signal Selection Basic Idea
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Signal Selection by T-Statistics
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Signal Selection by 2
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Sample K-Grams Selected by CFS
Kozak
consensus
Leaky
scanning
• Position –3
• in-frame upstream ATG
• in-frame downstream
–TAA, TAG, TGA,
–CTG, GAC, GAG, and GCC
Stop codon
Codon bias?
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Signal Integration
• kNN
– Given a test sample, find the k training samples that
are most similar to it. Let the majority class win
• SVM
– Given a group of training samples from two classes,
determine a separating plane that maximises the
margin of error
• Naïve Bayes, ANN, C4.5, CS4, ...
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Results (3-fold x-validation)
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
Naïve Bayes
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
Neural Network
77.6%
93.2%
78.8%
89.4%
Decision Tree
74.0%
94.4%
81.1%
89.4%
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Technique Comparisons
• Pedersen&Nielsen [ISMB’97]
– Neural network
– No explicit features
• This approach
• Zien [Bioinformatics’00]
– SVM+kernel engineering
– No explicit features
• Hatzigeorgiou [Bioinformatics’02]
– Multiple neural networks
– Scanning rule
– No explicit features
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
– Explicit feature generation
– Explicit feature selection
– Use any machine learning
method w/o any form of
complicated tuning
– Scanning rule is optional
Practicing Bioinformatics
mRNAProtein
A
T
How about using k-grams
from the translation?
E
F
L
R
L
S
S
P
Y
C
H
W
R
Q
I
T
stop
M
V
N
K
A
D
E
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
G
Practicing Bioinformatics
Amino-Acid Features
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Amino-Acid Features
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Amino Acid K-Grams Discovered (by entropy)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Validation Results (on Chr X and Chr 21)
Our
method
ATGpr
• Using top 100 features selected by entropy and train
SVM on Pedersen & Nielsen’s
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
•WLS,
•20 min
Protein Function
Inference
• Guilt by Association
• Genome Phylogenetic Profiling
Practicing Bioinformatics
Motivations for Sequence Comparison
• DNA is blue print for living organisms
• Evolution is related to changes in DNA
• By comparing DNA sequences we can infer
evolutionary relationships between the sequences w/o
knowledge of the evolutionary events themselves
• Foundation for inferring function, active site, and key
mutations
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Sequence Alignment
indel
Sequence U
• Key aspect of sequence
comparison is sequence
alignment
mismatch
Sequence V
match
• A sequence alignment
maximizes the number of
positions that are in
agreement in two
sequences
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Sequence Alignment: Poor Example
• Poor seq alignment shows few matched positions
• The two proteins are not likely to be homologous
No obvious match between
Amicyanin and Ascorbate Oxidase
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Sequence Alignment: Good Example
• Good alignment has clusters of extensive matched positions
• The two proteins are likely to be homologous
good match between
Amicyanin and unknown M. loti protein
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Function Assignment to Protein Sequence
SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNR
YVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWE
QNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGD
VTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTG
TFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELE
VT
• How do we attempt to assign a function to a new
protein sequence?
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Guilt-by-Association
Compare T with seqs of
known function in a db
Assign to T same
function as homologs
Discard this function
as a candidate
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Confirm with suitable
wet experiments
Practicing Bioinformatics
Homologs Obtained by BLAST
• Thus our example sequence could be a protein
tyrosine phosphatase  (PTP)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Example Alignment with PTP
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Guilt-by-Association: Caveats
• Ensure that the effect of database size has been
accounted for
• Ensure that the function of the homology is not
derived via invalid “transitive assignment’’
• Ensure that the target sequence has all the key
features associated with the function, e.g., active site
and/or domain
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Examples of Invalid Function Assignment:
The IMP dehydrogenases (IMPDH)
A partial list of IMPdehydrogenase misnomers
in complete genomes remaining in some
public databases
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
IMPDH Domain Structure
IMPDH Misnomer in Methanococcus jannaschii
IMPDH Misnomers in Archaeoglobus fulgidus
• Typical IMPDHs have 2 IMPDH domains that form the
catalytic core and 2 CBS domains.
• A less common but functional IMPDH (E70218) lacks the
CBS domains.
• Misnomers show similarity to the CBS domains
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Invalid Transitive Assignment
Root of invalid transitive assignment
B
A
C
Mis-assignment
of function
No IMPDH domain
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Protein Function Inference
What if no sequence homolog
with annotated function
can be found?
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Phylogenetic Profiling
• Gene (and hence proteins) with identical patterns of
occurrence across phyla tend to function together
• Even if no homolog with known function is available,
it is still possible to infer function of a protein
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Phylogenetic
Profiling:
How it Works
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Phylogenetic
Profiles:
Evidence
Pellegrini et al., PNAS,
96:4285--4288, 1999
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Phylogenetic Profiling: Evidence
Wu et al., Bioinformatics, 19:1524--1530, 2003
hamming distance X,Y
= #lineages X occurs +
#lineages Y occurs –
2 * #lineages X, Y occur
KEGG
 COG
hamming distance (D)
• Proteins having low hamming distance (thus highly similar
phylogenetic profiles) tend to share common pathways
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Q&A / Break
•KS,
•20 min
Whole Genome
Alignment
Practicing Bioinformatics
Mouse vs Human
• Mouse and human are closely related species. They
share a lot of gene pairs
Mouse
Chr No.
2
7
14
14
15
15
16
16
16
17
17
17
18
19
19
Human # of Published
Gene Pairs
Chr No.
51
15
192
19
23
3
38
8
80
12
72
22
31
16
64
21
30
22
150
6
46
16
30
19
64
5
22
9
93
11
Data is extracted from http://www.ncbi.nlm.nih.gov/Homology
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Our Aim
• Suppose we are given human and mouse genomes
• Our aim is to extract all the conserved gene pairs
between human and mouse
• One possible solution --- Whole genome alignment!
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Different Approaches
Coverage
Precision
MUM
100%
Many false positives
LCS (MUMmer1)
Delcher et al, 1999
Very less
Not many false positives
Clustering (MUMmer2,3)
Delcher et al, 2002
76.6%
26.5%
Mutation-Sensitive Alignment
(MSA)
Chan et al, 2004
91.3%
29.3%
MSA with 1-mismatch anchor
Yiu et al, 2005
94.6%
30.1%
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Observation 1
• Though a pair of conserved genes rarely contain
the same entire sequence, they share a lot of short
common substrings and some of them are indeed
unique to this pair of genes!
• For example,
Genome1:
ACGACTCAGCTACTGGTCAGCTATTACTTACCGC
Genome2: ACTTCTCTGCTACGGTCAGCTATTCACTTACCGC
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Good News!
• When we do experiment, we found that MUMs can
cover nearly 100% of the know conserved gene pairs
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Problem Solved?
• We can find MUMs in linear time! Is the problem
solved? Ans: No!
Mouse
Chr No.
2
7
14
14
15
15
16
16
16
17
17
17
18
19
19
Human # of Published
Chr No.
Gene Pairs
15
51
19
192
3
23
8
38
12
80
22
72
16
31
21
64
22
30
6
150
16
46
19
30
5
64
9
22
11
93
# of
MUMs
96,473
52,394
58,708
38,818
88,305
71,613
66,536
51,009
61,200
94,095
29,001
56,536
131,850
62,296
29,814
No. of MUMs >> no. of gene pairs!
There is too much noise!
How can we extract the right MUMs?
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Observation 2
• Two related species should preserve the ordering of
most conserved genes
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Conserved Genes in Mouse Chromosome
16 and Human Chromosome 16
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Solution 2
• Instead of reporting all MUMs to the user,
– Compute
the longest common
subsequence (LCS) of all MUMs
– Report only the MUMs on the LCS
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Example of LCS
12345678
41325768
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
12345678
41325768
Practicing Bioinformatics
Problem of this Approach
• It assumes there exists a single long alignment
• Moreover, such assumption may not be always true
 Therefore, for many cases, LCS can only discover
few genes
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Common Genes in Mouse Chromosome
16 and Human Chromosome 3
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Observation 3
• A pair of conserved genes are likely to correspond
to a sequence of MUMs that are consecutive, close
in both genomes, and have sufficient length
1
7
2
5
3
4
5
6
6
4
1
2
7
3
• The set of such substrings is called a cluster
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Solution 3
• Based on Observation 3, MUMmer2 and MUMmer3
try to identify maximal clusters in the genomes
• This approach is quite good. In our experiment,
MUMmer3 can identify ~76.6% of the published gene
pairs
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Can We Further Improve?
• Yes. We propose the Similar Subsequence Problem
• In our experiment, we can identify ~91.3% of the
published gene pairs
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Observation 4
• If two genomes are closely related, they can be
transformed from each other using a few
transpositions/reversals
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Example
• By two transposition/reversal operations, we can
transform Mouse Chr 16 to Human Chr 16
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Input
• Given two genomes S and T
• Assume we already know the n MUMs
• Let A=(a1,a2,…,an) and B=(b1,b2,…,bn), respectively,
be the order of the n MUMs in S and T
S
T
a1=1
a2=2
a3=3
b1=1 b2=6
b3=5
a4=4
b4=4
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
a5=5
a6=6
b5=7 b6=2
a7=7 a8=8
b7=3 b8=8
Practicing Bioinformatics
Common Subsequence
• A seq C=(c1,c2,…,cm) is a common subseq of A and B
if C is a subsequence of both A and B
• E.g., C=(1,2,3,8) is a common subseq of A and B
• Weight of common subseq is total weight of the MUMs
• A maximum weight common subseq (MWCS) of A and
B is a subseq with the heaviest weight
S
T
a1=1
a2=2
a3=3
b1=1 b2=6
b3=5
a4=4
b4=4
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
a5=5
a6=6
b5=7 b6=2
a7=7 a8=8
b7=3 b8=8
Practicing Bioinformatics
Similar Subsequence
• A k-similar subseq consists of k blocks and a backbone
–Backbone is a common subseq w/ k blocks inserted into it
–Each block is a common subseq or reversed common
subseq while all of them are disjoint
• Below is an example of 2-similar subseq
• K-similar subseq models k transpositions/reversals
S
T
a1=1
a2=2
a3=3
b1=1 b2=6
b3=5
a4=4
b4=4
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
a5=5
a6=6
b5=7 b6=2
a7=7 a8=8
b7=3 b8=8
Practicing Bioinformatics
Similar Subsequence Problem
• Given two sequences A and B and a parameter k,
the Similar Subsequence Problem finds a k-similar
subsequence with the heaviest weight
• This problem is NP-complete in general
• For a constant k, we can solve the problem in
O(n2k+1 log n) time
• We devise a heuristic algorithm to solve it in
O(n2(log n + k)) time
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Solution 4
•
Given two genomes S and T,
Mutation Sensitive Alignment (MSA) Algorithm
1. Find all the MUMs
2. Solve the similar subsequence problem
3. Report all the MUMs on the k-similar
subsequence (we set k=4)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Example
S
T
S
T
a1=1 a2=2
a3=3
b1=2 b2=1
a1=1 a2=2
b3=7
a3=3
b1=2 b2=1
a4=4
b4=6
a4=4
b3=7
b4=6
a5=5
a6=6
b5=5
b6=8 b7=3
b8=4 b9=9
a5=5
a6=6
a8=8 a9=9
b5=5
b6=8 b7=3
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
a7=7
a7=7
a8=8 a9=9
b8=4 b9=9
Practicing Bioinformatics
Experiment results
• We apply MUMmer3 and MSA to the following 15
pairs of chromosomes
For MSA, we set
k=4!
Mouse
Chr No.
2
7
14
14
15
15
16
16
16
17
17
17
18
19
19
Human # of Published
Chr No.
Gene Pairs
15
51
19
192
3
23
8
38
12
80
22
72
16
31
21
64
22
30
6
150
16
46
19
30
5
64
9
22
11
93
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
# of
MUMs
96,473
52,394
58,708
38,818
88,305
71,613
66,536
51,009
61,200
94,095
29,001
56,536
131,850
62,296
29,814
Practicing Bioinformatics
Experiment results (II)
Exp. No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
average
Coverage
Precision
MUMmer MSA
MUMmer MSA
76.50%
92.20%
21.70%
22.70%
71.40%
91.70%
21.30%
25.10%
87.00% 100.00%
24.80%
25.50%
76.30%
94.70%
27.40%
26.70%
92.50%
96.30%
32.50%
32.00%
72.20%
95.80%
31.20%
32.90%
67.70%
87.10%
13.50%
17.80%
78.10%
90.60%
37.20%
36.70%
80.00%
86.70%
40.70%
49.70%
82.00%
92.00%
30.90%
32.10%
65.20%
89.10%
30.50%
36.00%
60.00%
80.00%
27.50%
41.90%
89.10%
95.30%
18.20%
18.40%
72.70%
86.40%
10.40%
12.60%
78.50%
91.40%
30.00%
29.70%
76.60%
91.30%
26.50%
29.30%
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
• Coverage: % of published
genes covered
• Precision: % of MUMs
reside in some published
gene pairs
•KS,
•20 min
Phylogenetic Network
Practicing Bioinformatics
Phylogenetic Tree
• Phylogenetic tree is a tree whose leaves are labeled
by some species
• It assumes that each species is evolved from ONE
ancestor species
• Represented by a rooted tree, distinctly leaf-labeled
C. tigris
D. dorsalis
C. tigris D. dorsalis C. draconoides U. scoparia P. platyrhinos
P. platyrhinos
C. draconoides
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
U. scoparia
Practicing Bioinformatics
Limitation of Phylogenetic Tree
• Ford Doolittle (Science 1999) said
– Molecular phylogeneticists will have failed to find the
“true tree”, not because their methods are inadequate or
because they have chosen the wrong genes, but
because the history of life cannot properly be
represented as a tree
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
More Realistic Assumption
• Evolution is in fact more than mutation. We have
other types of evolutions. Like:
– Hybridization.
• E.g. tiger + lion  tiglion
– Horizontal gene transfer
• E.g. Evolution of influenza
• Phylogenetic tree cannot model those types of
evolutions
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Another model: Phylogenetic Network
• Generalization of phylogenetic tree in which internal nodes
may have more than one parent
• A network N is a directed acyclic graph such that
–Each node has indegree 1 or 2 (except the root)
–Each node has outdegree at most 2
–No node has both indegree 1 and outdegree 1
–All nodes with outdegree 0 are distinctly labeled (“leave”)
root
hybrid node
x4
x1
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
x2 x3
leaves
Practicing Bioinformatics
A Special Case:
Galled Phylogenetic Network
• When all cycles in the
phylogenetic network is
node-disjoint, the network is
called galled network
• The biological significance
of this special case is
described in [D. Gusfield, S.
Eddhu, and C. Langleg (CSB
2003)]
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
General
network
Galled
network
Practicing Bioinformatics
Methods for Constructing Network
• Median-joining
• Split decomposition (SplitsTree)
• PYRAMIDS
• Statistical parsimony (TCS)
• Molecular-variance parsimony (Arlequin)
• Reticulogram (T-REX)
• Netting
• NeighborNet
• Perfect phylogeny-based methods
• Constructing galled network from triplets
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Maddison Method for Building Network
• Maddison observed that
–If a phylogenetic network for a set of species contains
a single hybrid node then each gene presenting in the
species must evolve according to one of the two trees
embedded in the network
• Hence, we have the following problem:
–Input: a set of gene trees
–Output: a network which refines all gene trees
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Example
• Given a set of 2 trees T={T1, T2}, below is a galled
network N which refines T
T1
x1
x4
x5
x6
N
refines
x2 x 3
T2
x5
x1
x5
x6
x1 x x x4
3
2
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
x4
x2 x 3
x6
Practicing Bioinformatics
Difficult to Construct the Network?
• Unluckily, in general, this problem is NP-hard
• Moreover, if the resulting network is a galled network,
it can be constructed in polynomial time
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Framework
•
We propose a simple top-down and recursive
framework to solve the problem
1. Partitions L (the set of leaves for T) into two
subsets {X,Y} if possible; otherwise, three subsets
{X,Y,Z}
2. For each subset L’=X,Y,Z, recursively constructs a
solution network for T|L’
3. Combine the solutions for T|L’ to obtain a network
for T
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Illustration (I)
Reason:
• T1|{x1,x2,x3,x4} & T2|{x1,x2,x3,x4}
are subtrees of T1 & T2, resp.
• Similar for T1|{x5,x6} & T2|{x5,x6}
T1
x1
x4
x5
x6
x2 x3
x1x2x3 x4
T2
x5
x6
x1 x x x4
2
3
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
x5 x6
Practicing Bioinformatics
Illustration (II)
T1|{x5,x6}
x5
x6
T2|{x5,x6}
x5
x5
x6
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
x6
Practicing Bioinformatics
Illustration (III)
Reason:
•T1|{x2,x3} and T2|{x2,x3} are
proper subtrees
•Similar for T1|{x1} and T2|{x1}
•Similar for T1|{x4} and T2|{x4}
T1|{x1,x2,x3,x4}
x4
x1
x2 x3
T1|{x1,x2,x3,x4}
x1
x1
x2 x3
x4
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
x4
x2 x3
x1
x4
x2 x 3
Practicing Bioinformatics
Illustration (IV)
T1
x1
x4
x5
x6
x2 x3
x1x2 x3 x4
x5
x5 x6
T2
x1
x5
x6
x1 x x x4
2
3
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
x4
x2 x3
x6
Practicing Bioinformatics
Summary
• Given two trees T1 and T2, we can find a galled
network N which refines T1 and T2 in polynomial time
• Since galled network is biological meaningful, we did
a big step on practically constructing phylogenetic
network
• Open problem: Can we have a practically fast
algorithm for building general network for T1 and T2?
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
•KS,
•20 min
Peptide Sequencing by
Mass Spec
Practicing Bioinformatics
Peptide Sequencing
• Unlike DNA, deducing the amino acid sequence of a
protein peptide is not easy
• The problem of finding the amino acid sequence of a
protein peptide is known as the Peptide Sequencing
Problem
• One solution is to use mass spectrometry
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Idea of Sequencing by Mass Spectrum
M=total weight of the peptide
CTVFTEPREFK
W1 = weight of CTVFT
M-W1 = weight of EPREFK
fragmentation
CTVFT
W1+1 (mass of b-ion)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
EPREFK
M-W1+19 (mass of y-ion)
Practicing Bioinformatics
An Example MS/MS Spectrum
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Two Ways for Identifying the Amino
Acid Sequence
• Given the spectrum M, there are two ways to
identify the amino acid sequence
– Database searching
• Select a peptide from the database which is best
explaining the spectrum M
– De Novo sequencing
• Among all possible peptides, find a peptide which is
best explaining the spectrum M
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Basic Idea of De Novo Sequencing
• Input: a spectrum S
• Scoring function: For any peptide P, define a scoring
function score(P,S) to measure the fitness between P
and S
CTVFTEPREFK
Similar?
• Algorithm: Among all possible peptides, find a
peptide P which maximizes score(P,S)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
How to Compute Score(P,S)?
•E.g. Consider a peptide P=SAG
–y1 = wt(S)+19 = 76.05
–y2 = wt(SA)+19 = 147.13
–y3 = wt(SAG)+19 = 234.21
–b1 = wt(G)+1 = 88.08
–b2 = wt(AG)+1 = 159.16
–b3 = wt(SAG)+1 = 216.21
wt(S)=57.05Da
wt(A)=71.08Da
wt(G)=87.08Da
Artificial spectrum
500
S
400
300
G
SA
SAG
AG SAG
200
100
96
11
2
12
8
14
4
16
0
17
6
19
2
20
8
22
4
24
0
80
64
48
32
0
16
0
Red peaks: artificial y-ions
Green peaks: artificial b-ions
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
How to Compute Score(P,S)?
Artificial spectrum
500
400
S
300
200
G
96
11
2
12
8
14
4
16
0
17
6
19
2
20
8
22
4
24
0
80
64
48
32
0
0
Real spectrum
500
405
400
300
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
210
200
150
100
160
96
11
2
12
8
14
4
16
0
17
6
19
2
20
8
22
4
24
0
80
64
48
32
0
0
16
Black peaks: real peaks
Red peaks: artificial y-ions
Green peaks: artificial b-ions
SA
SAG
AG SAG
100
16
• Simple solution:
–Count the number of
peaks in S whose masses
equal some b-ions or yions of P
• For the following example,
–Match peaks = 4
–Don’t match peaks = 2
Practicing Bioinformatics
Factors Affecting Intensity (I)
• y-ions are more intense than b-ions
• More intense y-ion more intense b-ion, & vice versa
CTVFTEPREFK
CTVFT
fragmentation
EPREFK
W1+1 (mass of b-ion)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
M-W1+19 (mass of y-ion)
Practicing Bioinformatics
Factors Affecting Intensity (II)
• Mass of the fragment will affect its intensity
• Peaks in the middle of spectrum have higher intensity
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Factors Affecting Intensity (III)
• a1…ajaj+1…an (b-ion: a1…aj, y-ion: aj+1…an)
• Amino acid at the cleavage site affects intensity
– E.g. Low intensity for the b-ion if aj=P
• Presence of basic residues
• Precursor charge
• Hydrophobicity and helicity
• …
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
A Better Score Function
• We propose to model the factors using decision tree
• Then, we give a better score function
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Terminal part low
intensity
P: lower intensity
Large mass can not
detected
Tree for b-ion
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Tree for y-ion
Practicing Bioinformatics
Algorithm
• Among all possible peptides, find a peptide P that
maximizes score(P,S)
• This problem can be solved by dynamic
programming
• For instance, we can use
–Sandwich algorithm proposed by Bin Ma; or
–Spectrum graph algorithm proposed by Tin Chen
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Experiment Results
•Data set
–Training set: 1260 high
confident spectra of doubly
charged tryptic peptides
(from Genome Inst of
S’pore)
–Testing set: 400 high
confident spectra from
Open Proteomics
Database
–Length from 9 to 18
(Average 13.7)
• Result
–Accuracy:
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
no. of correct
predicted a mino acids
accuracy 
no. of predicted amino acids
–Compare with two other
algorithms:
• Peaks: one of the best de
novo algorithm
• PepNovo: de novo
algorithm with intensitybased scoring function
Practicing Bioinformatics
Experiment Results
• Compare accuracy
• Compare maximal correct subsequence length
–Proportions of subsequence length longer than l (3-10)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Q&A / Break
•WLS,
•15 min
Disease Treatment
Optimization
A Case Study on Childhood ALL
Practicing Bioinformatics
Childhood ALL
• Major subtypes are:
–T-ALL, E2A-PBX, TEL-AML,
MLL genome arrangements,
BCR-ABL, Hyperdiploid>50
• Diff subtypes respond
differently to same Tx
 Over-intensive Tx
–Development of sec cancers
–Reduction of IQ
 Under-intensiveTx
–Relapse
• The subtypes look similar
• Conventional diagnosis
–Immunophenotyping
–Cytogenetics
–Molecular diagnostics
 Unavailable in most
ASEAN countries
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Single-Test Platform of
Microarray & Machine Learning
Image credit: Affymetrix
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Overall Strategy
Diagnosis
of subtype
•For each subtype, select
genes to develop
classification model for
diagnosing that subtype
Subtypedependent
prognosis
Riskstratified
treatment
intensity
•For each subtype, select
genes to develop prediction
model for prognosis of that
subtype
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Subtype Diagnosis by PCL
• Gene expression data collection
• Gene selection by 2
• PCL Classifier training by emerging pattern
• Apply PCL for diagnosis of future cases
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Emerging Patterns
• An emerging
pattern is a set of
conditions
– usually involving
several features
– that most
members of a
class satisfy
– but none or few
of the other class
satisfy
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
PCL: Prediction by Collective Likelihood
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Childhood ALL
Subtype Diagnosis Workflow
A tree-structured
diagnostic
workflow was
recommended by
our doctor
collaborator
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Training and Testing Sets
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Accuracy of Various Classifiers
The classifiers are all applied to the 20 genes selected
by 2 at each level of the tree
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Understandability of EP & PCL
• E.g., for T-ALL vs. OTHERS, one ideally
discriminatory gene 38319_at was found, inducing
these 2 EPs
• These give us the diagnostic rule
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Conclusions
Conventional Tx:
• intermediate intensity to
everyone
 10% suffers relapse
 50% suffers side effects
 costs US$150m/yr
Our optimized Tx:
• high intensity to 10%
• intermediate intensity to 40%
• low intensity to 50%
• costs US$100m/yr
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
•High cure rate of 80%
• Less relapse
• Less side effects
• Save US$51.6m/yr
•WLS,
•15 min
Mining Errors in
Bio Databases
A Case Study on GenBank
Practicing Bioinformatics
Data Cleansing, Koh et al, DBiDB 2005
• 11 types & 28 subtypes of
data artifacts
– Critical artifacts (vector
contaminated sequences,
duplicates, sequence
structure violations)
– Non-critical artifacts
(misspellings, synonyms)
• > 20,000 seq records in
public contain artifacts
• Identification of these
artifacts are impt for
accurate knowledge
discovery
• Sources of artifacts
–Diverse sources of data
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
• Repeated submissions of seqs to db’s
• Cross-updating of db’s
–Data Annotation
• Db’s have diff ways for data annotation
• Data entry errors can be introduced
• Different interpretations
–Lack of standardized
nomenclature
• Variations in naming
• Synonyms, homonyms, & abbrevn
–Inadequacy of data quality
control mechanisms
Practicing Bioinformatics
A Classification
of Errors
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Uninformative sequences
Invalid
values
Practicing Bioinformatics Undersized sequences
ATTRIBUTE
Ambiguity
Dubious
sequences
Vector
contaminated
sequence
Crossannotation
error
RECORD
Annotation
error
Example Meaningless Seqs
• Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide
databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).
• In Nov 2004, the total number of undersized protein sequences increases to 3,350.
• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records
contain sequences shorter than six bases (as of Sep, 2004).
• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.
Sequence
structure
violation
Undersized protein sequences in major databases
Sequence
redundancy
Data provenance
flaws
1015
1000
DDBJ
800
EMBL
600
400
200
GenBank
528
383
364
218
171
116
123
3 0
SwissProt
51
2 0
151
42
125
12 23
0
MULTIPLE
SOURCE
DATABASE
1
Erroneous data
transformation
PDB
2
3
Sequence Length
Incompatible
schema
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
PIR
Number of records
SINGLE
SOURCE
DATABASE
Number of records
1200
Undersized nucleotide sequences in major
databases
233
228
250
200
DDBJ
150
100
50
115108
108
73
69
45
40
6
2
104
81
9
3
77
51
55
67
2
3
Sequence Length
4
GenBank
PDB
24
0
1
EMBL
5
Invalid
values
Overlapping intron/exon
Practicing Bioinformatics
ATTRIBUTE
Ambiguity
Example Overlapping Intron/Exon
Dubious
sequences
Vector
contaminated
sequence
Crossannotation
error
RECORD
Annotation
error
Sequence
structure
violation
SINGLE
SOURCE
DATABASE
Sequence
redundancy
Data
Provenance
flaws
MULTIPLE
SOURCE
DATABASE
Erroneous data
transformation
• Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has
overlapping intron 5 and exon 6.
• rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1
and exon 2.
Incompatible
schema
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Replication of sequence information
Invalid
values
Different views
Practicing Bioinformatics
ATTRIBUTE
Ambiguity
Dubious
sequences
Overlapping annotations of the same sequence
Example Seqs w/ Identical Info
Submission of the same sequence to different databases
• Repeated submission of the same sequence to the same database
Vector
contaminated
sequence
• Initially submitted by different groups
• Protein sequences may be translated from duplicate nucleotide sequences
Crossannotation
error
RECORD
Annotation
error
Sequence
structure
violation
SINGLE
SOURCE
DATABASE
Sequence
redundancy
Data provenance
flaws
MULTIPLE
SOURCE
DATABASE
Erroneous data
transformation
Incompatible
schema
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db
=protein&list_uids=11692005&dopt=GenPept
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db
=protein&list_uids=11692005&dopt=GenPept
Practicing Bioinformatics
Association Rule Mining for De-duplication
Select matching criteria
Compute similarity scores from known duplicate pairs
Generate association rules
Detect duplicates using the rules
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Features
to Match
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Association Rule Mining
AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0
AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0
Similarity scores
of known
duplicate pairs
P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0
P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0
O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0
S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0
P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0
Association rule mining
Frequent item-set
with support
LE1.0 PD0 SQ1.0 (99.7%)
SP1 PD0 SQ1.0 (97.1%)
SP1 LE1.0 PD0 SQ1.0 (96.8%)
DB0 PD0 SQ1.0 (93.1%)
DB0 LE1.0 PD0 SQ1.0 (92.8%)
DB0 SP1 PD0 SQ1.0 (90.4%)
DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%)
RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%)
RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%)
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Dataset
Entrez (GenBank, GenPept,
SwissProt, DDBJ, PIR, PDB)
scorpion AND (venom OR toxin)
serpentes AND venom AND PLA2
Scorpion venom dataset
containing 520 records
Snake PLA2 venom dataset
containing 780 records
Expert annotation
251 duplicate pairs
444 duplicate pairs
695 duplicate pairs are collectively identified.
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Duplicates detected by association rules
60
49.4
Results
FP% and FN%
50
40
36.3
32.7
30
20
10
Rule 1. Identical sequences with
the same sequence length and
not originated from PDB are
99.7% likely to be duplicates.
Rule 2. Identical sequences with
the same sequence length and of
the same species are 97.1%
likely to be duplicates.
Rule 3. Identical sequences with
the same sequence length, of the
same species and not originated
from PDB are 96.8% likely to
be duplicates.
6
2.4
1.8
5.7
3.8
0.3
9.4
7.9
7.5
5.2
0.1
0
le
Ru
1
le
Ru
2
le
Ru
3
le
Ru
4
le
Ru
5
le
Ru
6
Association rules
FP%
FN% x 1000
Rule 1
S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)
Rule 2
S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)
Rule 3
S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)
Rule 4
S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)
Rule 5
S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)
Rule 6
S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)
Rule 7
S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)
le
Ru
7
Q&A / Wrap Up
Practicing Bioinformatics
References (I)
• H. Liu & L. Wong “Data mining tools for biological
sequences”, JBCB, 1:139-168, 2003
• J. Li et al., “Simple Rules Underlying Gene Expression
Profiles of More than Six Subtypes of Acute Lymphoblastic
Leukemia (ALL) Patients”, Bioinformatics. 19:71--78, 2003
• J. Koh et al., “A Classification of Biological Data
Artifacts”, DBiBD, 2005
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
References (II)
• HL Chan, TW Lam, WK Sung, Prudence WH Wong, SM
Yiu, and X Fan. “The mutated subsequence problem and
locating conserved genes”. Bioinformatics, 21(10):22712278, 2005
• Trinh ND Huynh, J Jansson, WK Sung, and NB Nguyen.
“Constructing a Smallest Refining Galled Phylogenetic
Network”. RECOMB, 2005, pages 265-280
• W Shen, WK Sung, N Sze. “DTSeq: Decision Tree based
De Novo peptide sequencing”. In preparation.
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Practicing Bioinformatics
Acknowledgements
TIS Prediction
Huiqing Liu, Roland Yap,
Fanfan Zeng
Treatment Optimization for
Childhood ALL
James Downing, Huiqing
Liu, Jinyan Li, Allen Yeoh
Mining Errors from Bio DB
Vladimir Brusic, Judice Koh,
Mong Li Lee
Whole genome alignment
Tak-Wah Lam, Siu-Ming Yiu,
Ho-Leung Chan, Prudence
WH Wong
Phylogenetic network
Jansson Jesper, Trinh ND
Huynh, Nguyen Bao Nguyen
Protein peptide sequencing
Shen Wei, Newman Sze
Copyright © 2005 by Wing-Kin Sung and Limsoon Wong
Download