Microarray Analysis 2

advertisement
Flow chart of Affymetrix from
sample to information
Functional
annotation
Pathway
assignment
Co-ordinate
regulation
Tissue
Hyb. cRNA
Hybridize to
Affy arrays
Generate
Affy.dat file
Output as
Affy.chp file
Text
Self Organized
Maps (SOMs)
Promoter motif
commonalities
Microarray Data Analysis


Data preprocessing and visualization
Supervised learning


Unsupervised learning




Machine learning approaches
Clustering and pattern detection
Gene regulatory regions predictions based coregulated genes
Linkage between gene expression data and gene
sequence/function databases
…
Data preprocessing

Data preparation or pre-processing
Normalization
 Feature selection

Base on the quality of the signal intensity
 Based on the fold change
 T-test


…
Normalization
Experiment1
Experiment2
Control

Need to scale the red sample so that the overall
intensities for each chip are equivalent
Control
Normalization

To insure the data are comparable, normalization
attempts to correct the following variables:





Can use simple math





Number of cells in the sample
Total RNA isolation efficiency
Signal measurement sensitivity
…
Normalization by global scaling (bring each image to the
same average brightness)
Normalization by sectors
Normalization to housekeeping genes
…
Active research area
Basic Data Analysis

Fold change (relative change in intensity for each gene)
Mn-SOD
Annexin IV
Aminoacylase 1
Microarray Data Analysis


Data preprocessing and visualization
Supervised learning


Unsupervised learning




Machine learning approaches
Clustering and pattern detection
Gene regulatory regions predictions based coregulated genes
Linkage between gene expression data and gene
sequence/function databases
…
Microarrays: An Example
 Leukemia:
Acute Lymphoblastic (ALL) vs Acute Myeloid
(AML), Golub et al, Science, v.286, 1999
72 examples (38 train, 34 test), about 7,000 probes
 well-studied (CAMDA-2000), good test example

ALL
AML
Visually similar, but genetically very different
Feature selection
Probe
AML1
AML2
AML3
ALL1
ALL2
ALL3
p-value
D21869_s_at
170.7
55.0
43.7
5.5
807.9
1283.5
0.243
D25233cds_at
605
31.0
629.2
441.7
95.3
205.6
0.487
D25543_at
2148.7
2303.0
1915.5
49.2
96.3
89.8
0.0026
L03294_g_at
241.8
721.5
77.2
66.1
107.3
132.5
0.332
J03960_at
774.5
3439.8
614.3
556
14.4
12.9
0.260
M81855_at
1087
1283.7
1372.1
1469
4611.7
3211.8
0.178
L14936_at
212.6
2848.5
236.2
260.5
2650.9
2192.2
0.626
L19998_at
367
3.2
661.7
629.4
151
193.9
0.941
L19998_g_at
65.2
56.9
29.6
434.0
719.4
565.2
0.022
AB017912_at
1813.7
9520.6
2404.3
3853.1
6039.4
4245.7
0.963
AB017912_g_at
385.4
2396.8
363.7
419.3
6191.9
5617.6
0.236
U86635_g_at
83.3
470.9
52.3
3272.5
3379.6
5174.6
0.022
…
…
…
…
…
…
…
…



Hypothesis Testing

Null hypothesis is an hypothesis about a population
parameter.

Hypothesis testing is to test the viability of the null
hypothesis for a set of experimental data

Example:


Test whether the time to respond to a tone is affected by the
consumption of alcohol
Hypothesis : µ1 - µ2 = 0


µ1 is the mean time to respond after consuming alcohol
µ2 is the mean time to respond otherwise
Z-test

Theorem: If xi has a normal distribution with mean  and
standard deviation 2, i=1,…,n, then U= ai xi has a normal
distribution with a mean E(U)=  ai and standard deviation
D(U)=2 ai 2.



xi /n ~ N(, 2/n).
Z test : H: µ = µ0 (µ0 and 0 are known, assume  = 0)
What would one conclude about the null hypothesis that a sample of N = 46
with a mean of 104 could reasonably have been drawn from a population
with the parameters of µ = 100 and  = 8? Use
Reject the null hypothesis.
Histogram
Set 1
Set 2
T-test
William Sealey Gosset (1876-1937)
(Guinness Brewing Company)
Project 3

A training data set


A testing data set


(38 samples, 7129 probes, 27 ALL, 11 AML)
(35 samples, 7129 probes, 22 ALL, 13 AML)
Lab today: pick the top probes that can
differentiate the two sub types and process the
testing data set
K Nearest Neighbor Classification
L
Feature 1
M
M
M
L
L
M
M
L
L
L
L
Feature 2
L
= ALL
M
= AML
= test sample
M
Distance measures

Euclidean distance

Manhattan distance
Jury Decisions


Use one feature at a time for the classification
Combining the results from the top 51 features

Majority decision
test sample
Feature0
Feature1
…
Feature50
M
L
…
M
M
False Discovery

Two possible errors in making a decision about the
null hypothesis.
1. We could reject the null hypothesis when it is
actually true, i.e., our results were obtained by
chance. (Type I error).
2. We could fail to reject the null hypothesis when it is
actually false, i.e. our experiment failed to detect the
true difference that exists. (Type II error)

We set  at a level which will minimize the chances of
making either of these errors.
False Discovery


Type I error: False Discovery
False Discovery Rate (FDR) is equal to the p-value of
the t-test X the number of genes in the array



For a p-value of 0.01  10,000 genes
= 100 false “different” genes
You cannot eliminate false positives, but by choosing a more
stringent p-value, you can keep them manageable (try
p=0.001)
The FDR must be smaller than the number of real
differences that you find - which in turn depends on
the size of the differences and variability of the
measured expression values
RCC subtypes

Clear Cell RCC (70-80%)

Papillary (15-20%)

Chromoprobe (4-5%)

Collecting duct

Oncocytoma

Saramatoid RCC
?
Goal:
Identify a panel
of discriminator
genes
Genetic Algorithm for Feature
Selection
Raw
measurement
data
Sample
Clear cell RCC,
etc.
f1
f2
f3
f4
f5
Feature
vector
= pattern
Why Genetic Algorithm?


Assuming 2,000 relevant genes, 20 important
discriminator genes (features).
Cost of an exhaustive search for the optimal set of
features ?
C(n,k)=n!/k!(n-k)!
C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20
= 10^40
If it takes one femtosecond (10-15 second) to evaluate a
set of features, it takes more than 310^17 years to find
the optimal solution on the computer.
Evolutionary Methods

Based on the mechanics of Darwinian evolution


Population of competing candidate solutions


Chromosomes (a set of features)
Genetic operators (mutation, recombination, etc.)


The evolution of a solution is loosely based on biological
evolution
generate new candidate solutions
Selection pressure directs the search

those that do well survive (selection) to form the basis for the
next set of solutions.
A Simple Evolutionary Algorithm
Evaluation
Genetic
Operators
Selection
Genetic Algorithm
4
g100
g2
g5
g7
g20
3
g21
g3
g6
g1
g2
2
g10
g12
g15
g7
g12
1
g1
g21
g51
g17
g201
5
g10
g23
g56
g72
g25
Stop
g100
g2
g5
g1
g2
g21
g3
g6
g7
g20
g10
g22
g15
g7
g12
g14
g23
g25
g7
g20
g10
g23
g56
g72
g25
Encoding




Most difficult, and important part of any GA
Encode so that illegal solutions are not
possible
Encode to simplify the “evolutionary”
processes, e.g. reduce the size of the search
space
Most GA’s use a binary encoding of a
solution, but other schemes are possible
GA Fitness


At the core of any optimization approach is the
function that measures the quality of a solution or
optimization.
Called:
 Objective function
 Fitness function
 Error function
 measure
 etc.
Genetic Operators
Mutation
Crossover
10
30
50
10
70
30
62
80
Randomly Selected Mutation Site
20
40
60
80

Randomly Selected
Crossover Point
10
30
60
80
20
40
50
70

Recombination is
intended to produce
promising individuals.
Mutation maintains
population diversity,
preventing premature
convergence.
Genetic Algorithm/K-Nearest Neighbor
Algorithm
Classifier
Feature Selection
(kNN)
(GA)
Microarray
Database
Download