Data Mining in Bioinformatics

advertisement
January 31, 2002
Data Mining in Bioinformatics
Peter Bajcsy, PhD
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
pbajcsy@ncsa.uiuc.edu
Outline
• Introduction
• Overview of Microarray Problem
• Image Analysis
• Data Mining
• Validation
• Summary
2
Introduction: Recommended Literature
1. Bioinformatics – The Machine Learning Approach by P. Baldi & S.
Brunak, 2nd edition, The MIT Press, 2001
2. Data Mining – Concepts and Techniques by J. Han & M. Kamber,
Morgan Kaufmann Publishers, 2001
3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2nd edition,
John Wiley & Sons, 2001
3
Introduction: Microarray Problem in Bioinformatics Domain
• Problems in Bioinformatics Domain
— Data production at the levels of molecules, cells,
organs, organisms, populations
— Integration of structure and function data, gene
expression data, pathway data, phenotypic and
clinical data, …
— Prediction of Molecular Function and Structure
— Computational biology: synthesis (simulations) and
analysis (machine learning)
4
Microarray Problem: Major Objective
• Major Objective: Discover a comprehensive theory of
life’s organization at the molecular level
— The major actors of molecular biology: the nucleic
acids, DeoxyriboNucleic acid (DNA) and RiboNucleic
Acids (RNA)
— The central dogma of molecular biology
Proteins are very complicated molecules with 20
different amino acids.
5
Input and Output of Microarray Data Analysis
• Input: Laser image scans (data) and underlying experiment
•
6
hypotheses or experiment designs (prior knowledge)
Output:
— Conclusions about the input hypotheses or knowledge
about statistical behavior of measurements
— The theory of biological systems learnt automatically from
data (machine learning perspective)
– Model fitting, Inference process
Overview of Microarray Problem
Biology Application Domain
Validation
Data Analysis
Microarray
Experiment
Experiment
Design and
Hypothesis
Image
Analysis
Data Warehouse
Artificial
Intelligence (AI)
7
Data
Mining
Knowledge discovery
in databases (KDD)
Artificial Intelligence (AI) Community
Collect Data
• Issues:
Choose Features
Choose Model
Train Classifier
Evaluate Classifier
Design Cycle of Predictive Modeling
8
— Prior knowledge
(e.g., invariance)
— Model deviation
from true model
— Sampling
distributions
— Computational
complexity
— Model complexity
(overfitting)
Knowledge Discovery in Databases (KDD) Community
Database
9
GeneFilter Comparison Report
GeneFilter 1 Name:
GeneFilter 1
O2#1 8-20-99adjfinal
N2#1finaladj
INTENSITIES
RAW
NORMALIZED
ORF NAME
GENE NAME
CHRM F
G
R
YAL001C
TFC3 1
1 A 1 2 12.03 7.38
YBL080C
PET112
2
1 A 1 3 53.21
YBR154C
RPB5 2
1 A 1 4 79.26 78.51
YCL044C
3
1 A 1 5 53.22 44.66
YDL020C
SON1 4
1 A 1 6 23.80 20.34
YDL211C
4
1 A 1 7 17.31 35.34
YDR155C
CPH1 4
1 A 1 8 349.78
YDR346C
4
1 A 1 9 64.97 65.88
YAL010C
MDM10 1
1 A 2 2 13.73 9.61
Name:
GF1
GF2
GF1
GF2
DIFFERENCE RATIO
403.83
209.79
194.04
1.92
35.62 "1,786.11" "1,013.13" 772.98
1.76
"2,660.73" "2,232.86" 427.87
1.19
"1,786.53" "1,270.12" 516.41
1.41
799.06
578.42
220.64
1.38
581.00
"1,005.18" -424.18
-1.73
401.84
"11,741.98" "11,428.10" 313.88
"2,180.87" "1,873.67" 307.21
1.16
461.03
273.36
187.67
1.69
1.03
Data Mining and Image Analysis Steps
•
•
•
10
Image Analysis
— Normalization
— Grid Alignment
— Feature construction (selection and extraction)
Data Mining
— Statistics
GeneFilter Comparison Report
— Machine learning
GeneFilter 1 Name:
GeneFilter 1 Name:
O2#1 8-20-99adjfinal
N2#1finaladj
— Pattern recognition
INTENSITIES
RAW
NORMALIZED
— Database techniques
ORF NAME
GENE NAME
CHRM F
G
R
GF1
GF2
YAL001C
TFC3 1
1 A 1 2 12.03 7.38 403.83
— Optimization techniques
YBL080C
PET112
2
1 A 1 3 53.21 35.62 "1,
YBR154C
RPB5 2
1 A 1 4 79.26 78.51 "2,660.73
— Visualization
YCL044C
3
1 A 1 5 53.22 44.66 "1,786.53
YDL020C
SON1 4
1 A 1 6 23.80 20.34 799.06
— Prior knowledge
YDL211C
4
1 A 1 7 17.31 35.34 581.00
YDR155C
CPH1 4
1 A 1 8 349.78
401.84
Validation
YDR346C
4
1 A 1 9 64.97 65.88 "2,180.87
YAL010C
MDM10 1 ? 1 A 2 2 13.73 9.61 461.03
— Issues
YBL088C
TEL1 2
1 A 2 3 8.50 7.74 285.38
YBR162C
2
1 A 2 4 226.84
293.83
— Cross validation techniques
YCL052C
PBN1 3
1 A 2 5 41.28 34.79 "1,385.79
YDL028C
YDL219W
YDR163W
YDR354W
MPS1
TRP4
4
4
4
4
1
1
1
1
A
A
A
A
2
2
2
2
6
7
8
9
7.95
16.08
19.13
62.24
6.24
11.33
14.19
40.74
266.99
539.93
642.17
"2,089.48
IMAGE ANALYSIS
11
Image Analysis: Normalization
Cattle and Soy Controls
Beta Actin
PKG
Red
Band
Green
Band
12
Dynamic
range of
red band
Dynamic
range of
green band
HPRT
Beta 2 microglobulin
Rubisco
AB binding protein
Major latex protein
homologue (MSG)
Array of cattle and soy spiking controls. 50 ug of cattle brain total RNA was labeled with Cy3 (green).
1 ul each of in vitro transcribed soy Rubisco (5 ng), AB binding protein (0.5 ng) and MSG (0.05 ng)
were labeled with Cy5. The two labeled samples were cohybridized on superamine slides (Telechem,
Inc.). To the right of each set of spots are five negative controls (water).
Solution: Reference points with
reference values
Image Analysis: Grid Alignment
Solution: Manual, semi-automatic and fully automatic alignment
based on fiducials and/or global grid fitting.
13
Image Analysis: Feature Selection
Features: mean, median, standard
deviation, ratios
Area: Sensitive to
background noise
14
Image Analysis: Feature Extraction
• Area is determined by image thresholding and used during
feature extraction
1102
Dist: 2004
Box: 902
Plane: 2632
15
DATA MINING
16
Why Data Mining ? Sequence Example
•
•
•
•
•
•
•
•
•
17
Biology: Language and Goals
A gene can be defined as a region of DNA.
A genome is one haploid set of chromosomes with the genes
they contain.
Perform competent comparison of gene sequences across
species and account for inherently noisy biological
sequences due to random variability amplified by evolution
Assumption: if a gene has high similarity to another gene
then they perform the same function
Analysis: Language and Goals
Feature is an extractable attribute or measurement (e.g.,
gene expression, location)
Pattern recognition is trying to characterize data pattern
(e.g., similar gene expressions, equidistant gene locations).
Data mining is about uncovering patterns, anomalies and
statistically significant structures in data (e.g., find two
similar gene expressions with confidence > x)
Data Mining Techniques
Data mining techniques draw from
Statistics
Machine learning
Database techniques
Pattern recognition
Optimization techniques
Visualization
18
Statistics
Statistics
Descriptive
Statistics
Describe data
Inductive
Statistics
Make forecast
and inferences
Are two sample sets
identically distributed
19
?
Machine Learning
Machine Learning
Unsupervised
Supervised
“Natural groupings”
Reinforced
Examples
20
Pattern Recognition
Pattern Recognition
Statistical Models
Linear Correlation
and Regression
Locally Weighted
Learning
Decision Trees
Neural Networks
NN representation
and gradient based
optimization
21
NN representation and
genetic algorithm based
optimization
k-nearest
neighbors,
support
vectors
Database Techniques
•
•
•
•
•
•
•
•
22
Database Design and Modeling (tables, procedures,
functions, constraints)
Database Interface to Data Mining System
Efficient Import and Export of Data
Database Data Visualization
Database Clustering for Access Efficiency
MINING
Database Performance Tuning (memory usage, query
encoding)
Database Parallel Processing (multiple servers and
CPUs)
Distributed Information Repositories (data warehouse)
Optimization Techniques
•
•
23
Highly nonlinear search space (global versus local
maxima)
• Gradient based optimization
• Genetic algorithm based optimization
• Optimization with sampling
Large search space
• Example: A genome with N genes can encode 2^N
states (active or inactive states, regulated is not
considered). Human genome ~ 2^30,000;
Nematode genome ~ 2^20,000 patterns.
Visualization
• Data: 3D cubes,distribution charts, curves, surfaces, link
•
graphs, image frames and movies, parallel coordinates
Results: pie charts, scatter plots, box plots, association rules,
parallel coordinates, dendograms, temporal evolution
Pie chart
Parallel coordinates
Temporal evolution
24
Prior Knowledge from Experiment Design
Complexity Levels of Microarray Experiments:
1. Compare single gene in a control situation versus a treatment situation
• Example: Is the level of expression (up-regulated or down-regulated)
significantly different in the two situations? (drug design application)
• Methods: t-test, Bayesian approach
2. Find multiple genes that share common functionalities
• Example: Find related genes that are dependent?
• Methods: Clustering (hierarchical, k-means, self-organizing maps,
neural network, support vector machines)
3. Infer the underlying gene and protein networks that are responsible
for the patterns and functional pathways observed
• Example: What is the gene regulation at system level?
• Directions: mining regulatory regions, modeling regulatory networks
on a global scale
Goal of Future Experiment Designs: Understand biology at the system level,
e.g., gene networks, protein networks, signaling networks, metabolic
networks, immune system and neuronal networks.
25
Types of Expected Data Mining and Analysis Results
Hypothetical Examples:
• Binary answers using tests of hypotheses
— Drug treatment is successful with a confidence level x.
• Statistical behavior (probability distribution functions)
— A class of genes with functionality X follows Poisson
distribution.
• Expected events
— As the amount of treatment will increase the gene
expression level will decrease.
• Relationships
— Expression level of gene A is correlated with expression
level of gene B under varying treatment conditions (gene A
and B are part of the same pathway).
• Decision trees
— Classification of a new gene sequence by a “domain
expert”.
26
VALIDATION
27
Why Validation?
•
Validation type:
— Within the existing data
— With newly collected data
•
Errors and uncertainties:
— Systematic or random errors
— Unknown variables - number of classes
— Noise level - statistical confidence due to noise
— Model validity – error measure, model over-fit or under-fit
— Number of data points - measurement replicas
•
Other issues
— Experimental support of general theories
— Exhaustive sampling is not permissive
28
Cross Validation: Example
•
•
•
29
One-tier cross validation
— Train on different data than test data
Two-tier cross validation
— The score from one-tier cross validation is used by
the bias optimizer to select the best learning
algorithm parameters (# of control points) . The
more you optimize the more you over-fit. The
second tier is to measure the level of over-fit
(unbiased measure of accuracy).
— Useful for comparing learning algorithms with
control parameters that are optimized.
— Number of folds is not optimized.
Computational complexity:
— #folds of top tier X #folds of bottom tier X
#control points X CPU of algorithm
Summary
•
Microarray problem
— Computational biology
— Major objective of microarray technology
— Input and output of data analysis
•
Data mining and image analysis steps
— Image normalization, grid alignment, feature construction
— Data mining techniques
— Prior knowledge
— Expected results of data mining
•
Validation
— Issues
— Cross validation techniques
30
Download