Knowledge Discovery in Biological Databases David Gilbert and Aik Choon Tan

advertisement
Knowledge Discovery in
Biological Databases
David Gilbert and Aik Choon Tan
{drg,actan}@brc.dcs.gla.ac.uk
www.brc.dcs.gla.ac.uk
Bioinformatics Research Centre
University of Glasgow
Outline
•
•
•
•
Introduction
Motivation
Overview of KDBD
Machine Learning
–
–
–
–
–
Domain representations
Knowledge representations
Search strategies
Classification methods
Evaluation & Interpretation
• Conclusion
Bioinformatics
•Bio - Molecular Biology
•Informatics - Computer Science
•Bioinformatics - the study of the application of
• molecular biology, computer science, artificial
intelligence, statistics and mathematics
•to model, organise, understand and discover
interesting knowledge associated with the large
scale molecular biology databases,
•to guide assays for biological experiments.
• “Computational Biology” (USA)
Bioinformatics =
Machine Learning + Data Mining + Biological
Databases =
(?Knowledge Discovery in Databases?)
(?Knowledge Discovery in (Biological)Databases?)
Growth in Sequence Data
Growth in Structural Data
(Berman et al 2002)
Organisms
Physiology
Organs
Tissues
Cell signalling
Cell
Protein-protein interaction (pathways)
Protein functions
Protein Structures
Gene expressions
Nucleotide structures
Nucleotide sequences
Increasing complexity of Biological Data
Computational bottlenecks
Caused by
• Data characteristics
–
–
–
–
–
Lots of it
heterogeneous
distributed
incomplete
dirty
• (Traditional) complexity issues: time, space
• Induction: constructing
discriminatory/descriptive functions from large
data sets
Computational bottlenecks
• Data representation
–
–
–
–
sequences (DNA, RNA, amino-acid)
trees (phylogentic,…)
graphs (protein structure, biochemical networks)
matrices (micro-arrays, metabolic pathways)
Molecular biology overview
Biological activity: interaction!
Knowledge Discovery
Biology – A Classification Problem
• Biology - The division of physical science which
deals with organised beings or animals and plants,
their morphology, physiology, origin, and
distribution
(OED)
• Analysis via classification - steps:
–Organise examples into family
–Find common descriptions to characterise the family members
–Look for more members in the Universe
–If a new instance matches the characteristics of a family, infer
family properties to the new instance and add it as a member
Comparative
genomics
One aspect:
making inferences
(Eisenberg et al, 2000)
Data Explosion
Proteome
Genome
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
ggggctacgg ggggtggggc ttcgcgcccc gccggcctat aaaagcggcc gccgcggctc cgtgccgttg ccgaccttcg cctgcgccgc
tgctgcttcgcgcccgtcgc ctccgccatg gctcccagga agttcttcgt gggtggcaac tggaagatga acggcgacaa gaagagcttg
ggcgagctca tccacacgct gaatggcgcc aagctctcgg ccgacaccga ggtggtttgc ggagcccctt caatctacct tgattttgcc
cgccagaagc ttgatgcaaa gattggagtt gcagcacaaa actgttacaa ggtaccgaag ggtgctttca caggagagat cagcccagca
atgatcaaag atattggagc tgcatgggtg atcctgggcc actcagagcg gaggcatgtttttggagagt ctgatgagtt gattgggcag
aaggtggctc atgctcttgc tgaaggcctc ggtgtcatcg cctgcattgg ggagaagctg gatgagagag aagctggcat aacggagaag
gtggtctttg aacagaccaa agctattgct gataacgtga aggactggag taaggtggtt cttgcctatg agccagtttg ggctatcgga
actggtaaaa ctgctactcc ccaacaggct caggaggttc atgagaagct gagaggctgg ctcaaaagcc acgtgtctga tgctgttgct
cagtcaacta ggacgtcta tggaggttca gtcactggtg gcaactgtaa ggaactggcc tcccagcatg atgtggatgg cttccttgtt
ggtgggacgt ctctcaagcc agagtttgtg gatattatca atgcaaaaca ttaaagcagc ctgtgaggag cagtccctta cggttaagag
caagaaactg aagcaagaag ggaccttgtg ttgcacgtct ctcggtacag aggcttcttc tgaggctttc ccccaccacc acaattattg ttctagctgt
gctgctaacc cccaccacct tgttggagtc ccattagtgt gagcccatct cagcagagtc tcctttctga actggcaaaatccttggtta tctgttgagc
acgt
Data, information, knowledge …
• data : nucleotide sequence
• information : where are the “genes”.
control
statement
Termination
(stop)
TATA box
start
gene
Found using classifier, pattern, rule which has been mined/discovered
• knowledge : facts and rules
If a gene X has a weak psi-blast assignment to a function F
–and that gene is in an expression cluster
–and sufficient members of that cluster are known to have function F,
⇒ then believe assignment of F to X.
Data, Information, Pattern, Knowledge
INFORMATION
Molecular Weight = 26528
Number of Residues = 247
Number of Alpha = 11
Number of Beta = 8
Content of Alpha = 43.32
Content of Beta = 17.00
PATTERN
[AV]-Y-E-P-[LIVM]-W-[SA]-I-G-T-[GK]
KNOWLEDGE
The DNA sequence encodes
an alpha-beta protein with
a barrel architecture.
The structure of the
protein is a TIM-barrel.
DATA
APRKFFVGGN WKMNGKRKSL GELIHTLDGA
KLSADTEVVC GAPSIYLDFARQKLDAKIGV
AAQNCYKVPK GAFTGEISPA MIKDIGAAWV
ILGHSERRHVFGESDELIGQ KVAHALAEGL
GVIACIGEKLDEREAGITEKVVFQETKAIADNVK
DWSKVVLAYEPVWAIGTGKTATPQQAQEVHE
KLRGWLKTHVSDAVAVQSRIIYGGSVTGGNCK
ELA SQHDVDGFLV GGASLKPEFV DIINAKH
An abstract view
• Given
{p:9, p:1, q:8, p:3, q:2, q:6, p:5, q:4, p:7, q:0}
• Cluster:
{p:9, p:1, p:3, p:5, p:7} {q:8, q:2, q:6, q:4, q:0}
• Background knowledge:
> + -
• Induce:
0 is q
X is q if X-2 is q and X > 0
• X is p if not(X is q)
What is a pattern?
Types of Pattern
• Deterministic
– is a boolean function which either matches a given object (i.e.
sequence, structure) or not
R-x-Y-[ST]
(e.g. regular expression for sequence pattern)
•Probabilistic
Assigns each sequence with a
probability that generated by the
model. The higher the probability,
the better is the match between a
sequence and a pattern
(e.g. Profile for sequence pattern)
1
2
3
4
5
6
7
8
9
10
S1: R
V
Q
R
A
Y
S
Y
V
N
S2: P
L
M
R
A
Y
S
I
A
S
S3: L
V
I
R
P
Y
T
P
V
S
S4: L
C
M
R
A
Y
T
P
T
S
S5: E
K
L
R
L
Y
S
I
A
S
R=.2
V=.4
Q=.2 R=1 A=.6 Y=1
S=.6 Y=.2 V=.4 N=.2
P=.2
L=.2
M=.4
P=.2
T=.4 I=.4 A=.4 S=.8
L=.4
V=.2
I=.2
L=.2
P=.4 T=.2
E=.2
Motifs
Motif : a pattern associated with some biological meaning (e.g. function)
1FDR:_
1A8P:_
1NDH:_
1CNF:_
1B2R:A
1AMO:A
RVQRAYSYVNSP
PLMRAYSIASPN
LVIRPYTPVSSD
LCMRAYTPTSMV
EKLRLYSIASTR
LQARYYSIASSS
FAD binding site
Sequence pattern
FAD ligand
RxY[ST]
Structural pattern
KDD in BIOINFORMATICS
Target
Data
PreProcessing
S1:ACAATG
Selection
S1:ACAATG
S2:TCAACTATC
S3:ACACAGC
S4:AGAATC
S5:ACCGATC
PreProcessed
Data
Transformation
Transformed S1:ACA---ATG
S2:TCAACTATC
Data
S3:ACAC--AGC
S4:AGA---ATC
S5:ACCG--ATC
Raw Data
KNOWLEDGE!!
Pattern
Interpretation/
Evaluation
Machine
Learning
Characteristics of KDD
• Validity
• High-level Patterns/Languages understandable by human
• Accuracy - measures of certainty
(probability)
• Interesting Results - novel, useful and
nontrivial to compute
• Efficiency - running times for large-sized
databases are predictable and acceptable
(Frawley et. al. 1992)
Data preparation
• Select and identify target database
• Extract target data set
• Transform the target data set into the input format
of the learning algorithm
• Divide the target data set into groups (training, test
sets)
• Takes most of KDD process time
• Issues:
– Dealing with noisy data and missing attributes
– Filtering target data set (e.g. statistical analysis for gene
expression before performing clustering)
Machine learning tasks
(in bioinformatics as elsewhere…)
•
•
•
•
•
Classification: predicting an the class of an item
Clustering: finding groups of items
Characterisation: describing a group
Deviation Detection: finding changes
Linkage Analysis: finding relationships &
associations
• Visualisation: presenting data visually to facilitate
knowledge discovery by humans (human in the
loop)
Learning Approaches
• Unsupervised approach – given the
unassigned examples, group together the
examples with similar properties
• Supervised approach – given predefined
class of a set of positive and negative
examples, construct the classifiers that
distinguish between the classes
Issues
•
•
•
•
Domain representation
Knowledge representation
Search strategy
Classification method
Learning in bioinformatics context
• Automatically find pattern (given a training set)
• Characterisation: (positive examples only) patterns
describing “interesting” properties of a family
• Classification: (positive and negative examples) pattern
distinguishing S+ and S- .. Which may overlap...
• Formal language for descriptions (domain representation)
• Scoring function to rate descriptions (knowledge representation)
• Algorithm (search strategy and classification methods)
Protein comparison & motif discovery
Str comparison
Structure Prediction
Function Prediction
Str Classification
Str Motif Database
Str Database
Extract
features
Match
Str Description
Discover / Compare
Patterns
Eidhammer, Jonassen & Taylor,
“Structure Comparison and Structure
Patterns”, JCB, 7:5 pp 685-716, 2000.
Steps
• Pattern matching: input is 1 pattern & 1 str;output is
“yes”/“no” (deterministic pattern) or score (probabilistic
pattern).
• Pattern discovery: find patterns matching some/all of
input structures (choose patterns with high as possible
fitness value to input structures)
• Comparison: input (pair of) structure descriptions, find
(local/global) similarities, optimise similarity measure,
output score.
– Similarity may be represented as a pattern
Pattern discovery in biosequences
• Group together sequences thought to have common
biological (structural, functional) properties
– families (biological - semantic level)
• Study their common syntactic properties ignoring
biological (semantic) properties
– patterns, clusters (mathematical - syntactic level)
• Test whether the discovered patterns make sense (back
to semantic level)
Approaches to pattern discovery
• Pattern driven:
enumerate all (or some) patterns up to
certain complexity (length), for each
calculate the score, and report the best
• Sequence driven:
look for patterns by aligning the given
sequences
Brazma et al, Approaches to the automatic discovery of patterns in
biosequences, Journal of Computational Biology, 5(2):277-303, 1988
Pattern driven algorithms
• Brute force - enumerate all patterns (for
instance, all substrings) up to a given length
(complexity)
• Evaluate their fitness with respect to the input
sequences and output the best
• Unrealistic for patterns of even modest size even
for substring patterns (e.g., for substring patterns of length
10 over the amino acid alphabet, there are more than 1013
different substrings to enumerate in this way)
• E.g. PRATT program (Jonassen, U.Bergen, via www.ebi.ac.uk)
Sequence driven algorithms
• Group similar sequences together (e.g., in
pairs);
• For each group find a common pattern (e.g.,
by dynamic programming);
• Group similar patterns together and repeat
the previous step until there is only one
group left
Sequence driven approach
s1
s2
p1
p4
s3
p2
s4
s5
p3
Characteristic string function for
family F+
function g : Σ* → {FALSE,TRUE}
FF+
Σ*
g(s)=
{
TRUE if s ∈ F+
FALSE if s ∈ F-
Classification & characterisation Problems
Classification: + and - examples
Clean
training
data
Characterisation: + examples
SS+
S+
F-
F-
F+
F+
Σ*
Σ*
SNoisy
training
data
S+
S+
F+
F-
Σ*
F+
F-
Σ*
(Some) Performance Measurements
Specificity
Sensitivity
TP
Sn =
TP + FN
Positive Predicted Value
TN
TP
Sp =
PPV =
TN + FP
TP + FP
0 ≤ Sn ≤ 1
0 ≤ Sp ≤ 1
0 ≤ PPV ≤ 1
Correlation Coefficient
(TP * TN − FP * FN )
cc =
(TP + FP ) * ( FP + TN ) * (TN + FN ) * ( FN + TP )
-1≤cc ≤1
cc
1.0 no FP or FN
0.0 when f is random with respect to S+ and S-1.0 only FP and FN
Knowledge Representation
“If the predictive accuracies of two hypotheses are statistically
equivalent then the hypothesis with better explanatory
power will be preferred.
Otherwise the one with higher accuracy will be preferred.”
(Muggleton et al., 1998)
Input
Learner
Classifier
•High accuracy
•High explanatory power
Biological Sequences
-nucleotide sequences
-protein sequences
Domain representation –Example 1
xxx
V
x
x
x
x
x
x
x
C
Zn
H
x \ / x
x
Zn x
x / \ x
C
H
xxxx
xxxxxx
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
Edit distance
• Levenshtein 1966
• Minimum number of edit operations to
transform 1 string into another
– insert, delete, substitute (1 symbol)
• Score is zero (identical) or positive
• E.g “AIMS” & “AMOS”
AIMS
AMOS
⇒
AIM-S
A-MOS
AIMS
AIMS
(score=2 for each solution)
AMOS
AMOS
The possibilities?
AIM-S
| | |
A-MOS
Which is better?
AIMS
| |
AMOS
Multiple alignments
• Analyse gene families
– reveal (subtle) conserved family characteristics
characters
2
3
4
5
6
7
8
9
10
S1
S2
S3
S4
S5
Y
Y
F
F
Y
D
D
E
D
E
G
G
G
G
G
G
G
G
G
A
I
I
A
V
L
L
V
V
V
V
E
E
E
Q
Q
A
A
A
A
A
L
L
L
V
L
consensus
y
d
G
G
AI
VL
sequences
1
V
e
A
l
Multiple aligment - methods
• Simultaneous: N-wise alignment (adapted from pairwise approach)
– uses N-dimension matrix.
– Complexity is
• O(m1m2) [2 sequences length m1 & m2 ]
• O(mn) [n sequences of length m]
– Thus only good for short sequences.
• Manua1 (!)
s1
s2
• Progressive (heuristic) e.g. ClustalW:
a1
s3
s4
a2
– compute pairwise sequence identities
– construct binary tree (can output phylogenetic tree) s5
– align similar sequences in pairs, add distantly related ones later.
a4
a3
Multiple sequence alignment (globins)
CLUSTAL W (1.81) multiple sequence alignment
Human
Gorilla
Rabbit
Pig
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
VHLSSEEKSAVTALWGKVNVEEVGGEALGRLLVVYPWTQRFFESFGDLSSANAVMNNPKV
VHLSAEEKEAVLGLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPKV
***:.***.** .*******:****************************..:***.****
60
60
60
60
Human
Gorilla
Rabbit
Pig
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK
KAHGKKVLAAFSEGLSHLDNLKGTFAKLSELHCDKLHVDPENFRLLGNVLVIVLSHHFGK
KAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVVLARRLGH
******** :**:** **********.*******:********:*****:* **::::*:
120
120
120
120
Human
Gorilla
Rabbit
Pig
EFTPPVQAAYQKVVAGVANALAHKYH
EFTPPVQAAYQKVVAGVANALAHKYH
EFTPQVQAAYQKVVAGVANALAHKYH
DFNPNVQAAFQKVVAGVANALAHKYH
:*.* ****:****************
146
146
146
146
sequence alignments
& phylogenetic trees
Pair
Human-Gorilla
Human-Rabbit
Gorilla-Rabbit
Human-Pig
Gorilla-Pig
Rabbit-Pig
Score
99
90
89
84
84
83
((Human:0.00000,
Gorilla:0.00685)
:0.04110,
Rabbit:0.05479,
Pig:0.10959);
What can we do with multiple alignments?
• Create (databases of) profiles derived from multiple
alignments for protein families
– profile = multiple alignment + observed character
frequencies at each position
• Search with a sequence against a database of profiles
(e.g. PROSITE database)
– faster than sequence against sequence
– gives a more general result (“the input sequence matches
globin profile”)
• Search with a profile against a database of sequences
– PSI-BLAST : can identify more distant relationships
than by normal BLAST search
PSI-BLAST (position specific iterated BLAST)
Single protein
sequence
Search database(BLAST)
Profile
?iterate
until
convergence
Multiple alignment
Estimate statistical
significance of
local alignments
Protein structure
Protein structure - levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVD
EVGGEALGRLLVVYPWTQRFF
ESFGDLSTPDAVMGNPKVKAH
GKKVLGAFSDGLAHLDNLKGTF
ATLSELHCDKLHVDPENFRLLG
NVLVCVLAHHFGKEFTPPVQAA
YQKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
TOPS
Simplified descriptions of protein 3D
structures and their use in searching
and structural pattern recognition
Domain representation –Example 2
(TOPS Approach)
TOPS Example – 2bopA0
chirality
H-bond
2bopA0
α-helix
loop
β-strand
• Several
examples, with
common parts
highlighted
What is a pattern?
• A common description
Number of
insert SSEs
Plait motif
(0,N)
(0,N)
(0,N)
(0,N)
Correspondences
1,2,4,6,7,8
1,3,4,6,7,8
(0,N)
Pattern matching
2bop
(0,N)
Plait
motif
Alternative
matches
1,2,4,6,7,8
2bopA0
1,3,4,6,7,8
Plait
Discovering common patterns
and making multiple alignments
Pattern
P
P
P
Compression:
Send the pattern
once, and then for
each domain, send
the uncovered parts
Domain 1
Domain 2
Domain 3
Topological description
• Consider sequence of SSEs (strand, helices), plus spatial
adjacency within fold & approximate orientation
• Neglect details (lengths & structures of loops, exact lengths &
spatial orientations of SSEs, sequence information...)
√ simplicity
– implement very fast comparison algorithms, machine learning, ...
– detect distant structural relationships
X simplicity
– relate structures topologically which may have no meaningful biological
relationship.
Enhanced TOPS
TOPS
Sequence
Information
Pattern-Discovery/
Matching Algorithms
Biochemical
Features
TOPS+
&
Scoring Functions
Structure Comparison
Algorithm
TOPS+ BASED – PSSM/HMM PROFILES
Structural and Functional
Assignment
(Veeramalai, 2002)
Tops + Sequence with Biochemical features
Functional information
DNA
Ligand
DNA binding-site
Ligand binding-site
(Veeramalai, 2002)
Feature Extraction
A
E
(Veeramalai, 2002)
B
C
PSSM/HMM Profiles & Scoring Function
Key
Structure-Based
Sequence
&
Function
Extraction
For Protein
Domains
Æ Ligand interaction
S1
ÆLigand interaction in
loop
ÆLigand interacting aa’s
ÆSeq segment of a helix
Æ Seq segment of a strand
ÆSeq segment of a loop
S2
Etc.,
Sn
S1
TOPS-BASED
Multiple Sequence
Alignment
Profile
Generation
S2
Sn
SAM/HMMER
IMPALA
HMM Profile PSSM Profile
Scoring
Function
(Veeramalai, 2002)
1vpt00
Methyltransferase
Superfamily
2admA1
1vid00
Key to TOPS
1xvaA2
Ligand binding site
in α-helix residues
Ligand binding site
in β-strand residues
1hmy01
Ligand binding site
in loop-region residues
Ion-binding site between
SSEs & loop regions
Conserved Structural Pattern
Structural/Functional
(TOPS) Pattern
(Veeramalai, 2002)
Comparing structures - NADP binding domains
dihydropterine
reductase
homo sapiens
homo sapiens
rat
dihyrofoliate
reductase
E.Coli
Dendrogram from
pairwise comparisons &
Dihydropteridine reductase (human)
Dihydropteridine reductase (rat)
hierarchical clustering
Lactate dehydrogenease (pig)
Lactate dehydrogenase (bacterial)
Malate dehydrogenase (pig)
Malate dehydrogenase (bacterial)
Quinone oxido-reductase (bacteria)
Alcohol dehydrogenase (human)
D-3-phosphoglycerate dehydrogenase (bacteria)
NADH peroxidase (bacteria)
D-glyceraldehyde-3-phosphate dehydrogenase
Dihydrofolate reductase (bacterial)
Dihydrofolate reductase (human)
NADH peroxidase (bacteria)
NAD comparisons
Sequence
Structure
(atomic coordinates)
Structure
(topology)
Hierarchical Machine Learning
•Integrate various machine learning techniques
•Incorporate patterns induced from different sources
•Produce user readable hypotheses
Gene Expression
Gene - informatics??
Phylogenetic
Inferences
Connectors To
Other Maps
Metabolic
Profiles
Cofactors &
Metabolites
Sequence Homologs
In Other Genomes
Metabolic Map Locator
Sequence
Functional
Chemistry
Gene X
Experimental
Data
Genome Location
Structure
Expression Info
Raw
Images
Numerical
Values
(Adapted from Gibas & Jambeck, 2001)
Cluster
Genes
Raw
Data
Electron
Density
Structure
Annotation
SS
Assignment
Gene expression
Pre-genomics era
p1
g1
p2
One gene = One gene product = One behaviour
Post-genomics era
p1’’
p1’
g1
g2
g3
p1
p2
p3
p4
Many genes = Many gene products = Many behaviours
Microarray experiment
Spotting the arrays
RED = Present (P) = highly expressed, detected by the detector
YELLOW = Marginal (M) = expressed, “not sure” for the detector
GREEN = Absent (A) = maybe expressed, not detected by the detector
Classification
Problem
(Golub et al 1999)
ALL = acute lymphoblastic leukemia
(lymphoid precursors)
AML = acute myeloid leukemia
(myeloid precursor)
Characterisation Problem
(Stuart et al 2001)
Temporal gene expression profiles during kidney development. Data are expressed as the mean at each time for clusters of genes as defined by kmeans clustering (1-5). The distribution of individual profiles is also shown for the most heterogeneous group (2, all). 13, 15, 17, 19, embryonic
days; N, newborn; W, 1 week old; A, adult.
Characterisation Problem
(Stuart et al 2001)
Functional associations of gene clusters. Gene clusters varied remarkably in terms of major functional classifications of component genes.
Group 1 expressed earlier in nephrogenesis was most notable for genes involved in DNA replication (D), RNA production (R), protein synthesis
(P), and morphogenesis (M), consistent with an actively proliferating tissue.
Group 2 (which peaked in midnephrogenesis) was most notable for genes of the extracellular matrix (E) as well as morphogenetic genes (M).
Group 3 (with a peak in neonatal life) was dominated by retrotransposon transcripts (RT).
Group 4 was most notable for transport (T) and energy metabolism (EN) related genes.
Group 5 genes (significantly up-regulated in the adult vs. all previous times) was more heterogeneous and included genes specifying catabolic
enzymes (C), defense and immune recognition (DE), homeostasis of the organism as a whole (H), detoxification (DT), oxidative stress (RD), and
transport (T).
Gene expression matrix
Rows = genes expression profiles
Columns = Different conditions/time points
Genes
AFFX-b-ActinMur/M12481_3_st
AFFX-YEL002c/WBP1_at
AFFX-YEL018w/_at
AFFX-YEL024w/RIP1_at
AFFX-YEL021w/URA3_at
92539_at
92540_f_at
92541_at
92542_at
92543_at
A1 TSu74aA1 TSu74aA2 TSu74aA2 TSu74aA3 V10_SiA3 V10_DeB1 V12-A_B1 V12-A_B2 V12-B_B2 V12-B_B3 V12-C_B3 V12-C_C1 P1-A_S
26.1 A
29.7 A
7.7 A
13.2 A
11.4 A
43.7 A
15.1
1.3 A
6.2 A
2.5 A
4.7 A
2.7 A
1.3 A
7
6.1 A
0.6 A
1.8 A
3.1 A
2.1 A
1.4 A
0.6
11.9 A
7.2 A
2.7 A
10.4 A
2.4 A
8.2 A
7.6
11.8 A
6.6 A
2.6 A
12.4 A
6.9 A
7.6 A
6
2475.9 P
2091.3 P
1391.6 P
1407.9 P
1947.2 P
1572.9 P
1999.6
96.9 P
77.4 P
138.7 P
144.8 P
122.6 P
126.6 P
128.8
863.2 P
1920.6 P
1248.1 P
1384.9 P
268 P
352.3 P
856.4
702.4 P
868.3 P
558.4 P
613.1 P
631.8 P
602.1 P
548.3
56.7 P
56.7 P
75.5 P
61.6 P
72.5 P
76.6 P
56.2
Replicates
Signal
(intensity)
Detection
Clustering Gene Expression Data
• A clustering problem consists of elements & a
characteristic vector for each element
• A measure of similarity is defined between pairs of such
vectors
• Elements = genes
• Vector = expression level of each gene
• Goal: Partition the elements into subsets (clusters) which
satisfy:
– Homogeneity: elements in the same cluster are highly
similar to each other
– Separation: elements from different clusters have low
similarity to each other
Hierarchical clustering
Different experimental conditions/time points
Genes
k-means clustering
Genes related with ‘casein’ in
mammary gland tissues
Lactation
Linking gene expression data with
morphological information
stage(A, pregnancy) :gene_id(A, g1),
gene_id(A, g2),…,
has_ducts(A, medium),
Fat_pad(A, medium).
47%
Molecular
Function
38%
Biological
Process
15%
Cellular
Component
Challenges of
KDBD
Goal
“All possible data”
(in the universe)
Hypotheses
Current
Data
Learning in “Dirty” Biological
Databases
•
•
•
•
•
Experimental errors
Wrong interpretation by biologists
Human error during annotation process
Non standardised techniques
Biased data
Expressive Capacity
Hypothesis for Glutathione reductase (GR) Family
Class(‘GR’,A):protein(A,B,C,D),
Sequence(B,GxG(x)2G(x)16-19[DE]),
Structure(C,bbasandwich),
Has_seq(strand1_helix1_strand2,B,C),
Function(D,oxidoreductases).
If the protein has sequence motif
GxG(x)2G(x)16-19[DE]
in β1-α1-β2 of the 3-layer β-β-α
sandwich structure and carries
out oxidoreductases reaction then
it is GR family.
GxG(x)2G(x)16-19[DE]
Single Vs Multiple Methods
• Advantage - compliment each other
• Increase expressive power - discover useful
& understandable knowledge
• Difficult to combine - lack of coherence
Open Question?
“All data”
Training
Set
Current data
(continues to expand)
Hypotheses
Conclusion
ggggctacgg
ccgaccttcg
gggtggcaac
aagctctcgg
ttgatgcaaa
ggggtggggc
cctgcgccgc
tggaagatga
ccgacaccga
gattggagtt
ttcgcgcccc gccggcctat aaaagcggcc gccgcggctc cgtgccgttg
tgctgcttcgcgcccgtcgc ctccgccatg gctcccagga agttcttcgt
acggcgacaa gaagagcttg ggcgagctca tccacacgct gaatggcgcc
ggtggtttgc ggagcccctt caatctacct tgattttgcc cgccagaagc
gcagcacaaa actgttacaa ggtaccgaag ggtgctttca caggagagat
Acknowledgements
• Gilleain Torrance, Mallika Veeramalai, Olivier Sand,
Ali Al-Shahib (Bioinformatics Research Centre,
University of Glasgow)
• David Westhead, (EBI), Ioannis Michalopoulos,
Leeds University
• Janet Thornton, UCL, Birkbeck, EBI
• Lorenz Wernisch, Birkbeck
• Juris Viksna, University of Latvia
• Inge Jonassen, Ingvar Eidhammer, U.Bergen
• Alvis Brazma, EBI
Bioinformatics Research Centre
• Provides an environment for collaborative
interdisciplinary research in Bioinformatics.
• Hosts researchers from
– Department of Computing Science
– Institute of Biomedical and Life Sciences.
• Physically located in the Institute of Biomedical and
Life Sciences (Spring 2003)
• Strong links with
– Sir Henry Welcome Functional Genomics Facility.
– Statistical Bioinformatics
– Mathematical Biology
• Outreach programme (visitors etc)
The Scottish Bioinformatics Forum (SBF)
• Network of Bioinformatics researchers and industries in
Scotland
• A vehicle for developing Scotland as a Centre of
Bioinformatics Excellence
• Nodes in Glasgow, Edinburgh, Dundee, Aberdeen, ...
• Promoting collaborative research
• Development of a Bioinformatics educational programme
• www.sbforum.org, sbforum-general@sbforum.org
Contacts
{actan,drg}@brc.dcs.gla.ac.uk
Bioinformatics Research Centre
Department of Computing Science
University of Glasgow
http://www.brc.dcs.gla.ac.uk
Download