soc-joint-lab-intro0..

advertisement
Knowledge Discovery from
Biological and Clinical Data:
BASIC BACKGROUND
What is Datamining?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules : Blue or Circle
Jessica’s rules : All the rest
What is Datamining?
Question: Can you explain how?
Knowledge Discovery from
Biological and Clinical Data:
MOTIVATION
Driving Forces: Genes, Proteins,
Interactions, Diagnosis, & Cures
• Complete genomes
are now available
• Proteins, not genes,
• Proteins function by
are responsible for
interacting with other
proteins and
• Knowing the genes is many cellular activities
biomolecules
not enough to
understand how
biology functions
INTERACTOME
GENOME
PROTEOME
If we figure out how these work,
we get these Benefits
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
To figure these out,
we bet on...
“solution” =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Knowledge Discovery from
Biological and Clinical Data:
ACCOMPLISHMENT
Predict Epitopes,
Find Vaccine Targets
• Vaccines are often the
only solution for viral
diseases
• Finding & developing
effective vaccine targets
(epitopes) is slow and
expensive process
• Develop systems to recognize
protein peptides that bind
MHC molecules
• Develop systems to recognize
hot spots in viral antigens
Recognize Functional Sites,
Help Scientists
• Effective recognition of
initiation, control, and
termination of biological
processes is crucial to
speeding up and focusing
scientific experiments
• Data mining of bio seqs to
find rules for recognizing
& understanding
functional sites
Dragon’s 10x
reduction of
TSS recognition
false positives
Diagnose Leukaemia,
Benefit Children
• Childhood leukaemia is a
heterogeneous disease
• Treatment is based on subtype
• 3 different tests and 4 different
experts are needed for diagnosis
 Curable in USA,
 fatal in Indonesia
• A single platform diagnosis
based on gene expression
• Data mining to discover
rules that are easy for
doctors to understand
Understand Proteins,
Fight Diseases
• Understanding function and role
of protein needs organised info
on interaction pathways
• Such info are often reported in
scientific paper but are seldom
found in structured databases
• Knowledge extraction
system to process free text
• extract protein names
• extract interactions
Knowledge Discovery from
Biological and Clinical Data:
OPPORTUNITY
Direction & Plan
• Objectives
– Translate inspiration
from biological
systems into
advancement of life
and computing
sciences
– Advance data mining
technologies in
decision systems for
complex problems
• To work on practical
systems for
– data mining
– data cleansing
– knowledge extraction
• Applied to
–
–
–
–
gene regulation
protein interaction
clinical data analysis
ligand-receptor
interaction
E.g., How to Get More Out of the
Same Experiments?
• How to recognize false positives from two-hybrid and
other types of high-throughput protein interaction
experiments?
• Some initial thoughts:
It seems that configuration
a is less likely than b. Can
we exploit this?
a
b
E.g., How to Improve Classifier
Algorithms?
• SVM, ANN, etc.
– Good accuracy,
– but not easy to understand
• C4.5, CART, etc.
– Clear rules,
– but lower accuracy
• Why can’t we have a classifier algorithm that
– handles high dimension
– achieves high accuracy
– provides understandable rules
Who will you be working with...
SOC
Vladimir Bajic
I2 R
Vladimir Brusic
See-Kiong Ng
Jinyan Li
Limsoon Wong
Wynne Hsu
Mong Li Lee
Ken Sung
Download