Knowledge Discovery from
Biological and Clinical Data:
What is Datamining?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules : Blue or Circle
Jessica’s rules : All the rest
What is Datamining?
Question: Can you explain how?
Driving Forces: Genes, Proteins,
Interactions, Diagnosis, & Cures
• Complete genomes
are now available
• Proteins, not genes,
• Proteins function by
are responsible for
interacting with other
proteins and
• Knowing the genes is many cellular activities
not enough to
understand how
biology functions
If we figure out how these work,
we get these Benefits
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
To figure these out,
we bet on...
“solution” =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Predict Epitopes,
Find Vaccine Targets
• Vaccines are often the
only solution for viral
• Finding & developing
effective vaccine targets
(epitopes) is slow and
expensive process
• Develop systems to recognize
protein peptides that bind
MHC molecules
• Develop systems to recognize
hot spots in viral antigens
Recognize Functional Sites,
Help Scientists
• Effective recognition of
initiation, control, and
termination of biological
processes is crucial to
speeding up and focusing
scientific experiments
• Data mining of bio seqs to
find rules for recognizing
& understanding
functional sites
Dragon’s 10x
reduction of
TSS recognition
false positives
Diagnose Leukaemia,
Benefit Children
• Childhood leukaemia is a
heterogeneous disease
• Treatment is based on subtype
• 3 different tests and 4 different
experts are needed for diagnosis
 Curable in USA,
 fatal in Indonesia
• A single platform diagnosis
based on gene expression
• Data mining to discover
rules that are easy for
doctors to understand
Understand Proteins,
Fight Diseases
• Understanding function and role
of protein needs organised info
on interaction pathways
• Such info are often reported in
scientific paper but are seldom
found in structured databases
• Knowledge extraction
system to process free text
• extract protein names
• extract interactions
Direction & Plan
• Objectives
– Translate inspiration
from biological
systems into
advancement of life
and computing
– Advance data mining
technologies in
decision systems for
complex problems
• To work on practical
systems for
– data mining
– data cleansing
– knowledge extraction
• Applied to
gene regulation
protein interaction
clinical data analysis
E.g., How to Get More Out of the
Same Experiments?
• How to recognize false positives from two-hybrid and
other types of high-throughput protein interaction
• Some initial thoughts:
It seems that configuration
a is less likely than b. Can
we exploit this?
E.g., How to Improve Classifier
• SVM, ANN, etc.
– Good accuracy,
– but not easy to understand
• C4.5, CART, etc.
– Clear rules,
– but lower accuracy
• Why can’t we have a classifier algorithm that
– handles high dimension
– achieves high accuracy
– provides understandable rules
Who will you be working with...
Vladimir Bajic
I2 R
Vladimir Brusic
See-Kiong Ng
Jinyan Li
Limsoon Wong
Wynne Hsu
Mong Li Lee
Ken Sung