Basic Principles in Bioinformatics: Understanding Microarrays

advertisement
Basic Principles in Bioinformatics:
Understanding Microarrays
Pierre Farmer/Pascale Anderle
Swiss Institute for Bioinformatics/ISREC
Aim of This Course
Rapid overview of microarray
technologies
Introduction to different bioinformatic
solutions related to microarrays
Overview of the Course
Part I
Introduction into the microarray technology
Illustration of typical biological questions related to microarray studies
Short presentation of methods being used for the analysis of microarray data
Part II (TP)
Discussion of biological questions and how they could be answered applying
microarray data mining
Part III
Functional classification
Biological Problem
What is the difference between a tumor and
healthy tissue?
Are the different types of tumors?
Biological Fundamentals
Biological Fundamentals
Transcriptome: Genes
Proteome: Proteins
Microarrays
Genomics Fundamentals
Exon 1
Intron 1
Exon 2
Intron 2
Exon 3
Intron 3
Exon 1
Intron 1
Exon 2
Intron 2
Exon 3
Intron 3
Genomic DNA: ATGC
Transcription
Exon 1
Exon 2
Exon 3
Messenger RNA: AUGC
Reverse Transcription: RT-PCR
Exon 1
Exon 2
Exon 3
cDNA: ATGC
Exon 1
Exon 2
Exon 3
PCR
Exon 2
Exon 3
Exon 2
Exon 3
cDNA/PCR product: ATGC
Introduction into Microarray Technology : Sample
Tumor Tissue
Normal Tissue
Protein
……CCAGGCAAUAAAAAA
……CCAGGCAAUAAAAAA
mRNA
……A U G AGUAAUAAAAAA
……A U G AGUAAUAAAAAA
……CCAGGCAATAAAAAA
……CCAGGCAATAAAAAA
……A T G AGTAATAAAAAA
Signal N < Signal T
Signal N << Signal T
……A T G AGTAATAAAAAA
Introduction into Microarray Technology
Normal
Gene A
Tumor
Gene B
Gene C
Introduction into Microarray Technology
Spotting:
Probes
Photolithography
Printing
Oligomers
Physical support:
Glass slide,
nylon membrane
PCR products
Sample preparation and hybridization:
cRNA or cDNA
Single-labeling or dual-labeling
Affymetrix:
Short oligo chip
Single labeling
Fluorescence or radioactivity
or
cDNA chip:
Oligos or PCR products
Dual-labeling
Microarray: Definition
•Microarray
analysis
is
a
technology
that
allows
to
simultaneously detect the expression of thousands of genes in
a small sample.
•Microarrays are simply ordered sets of DNA molecules of
known sequence fixed on a physical support.
Different Microarray Platforms
Definition of biological questions
Experimental design
Custom array
PCR products
Oligomers
Commercial array
Short oligos: Affymetrix
Long oligos: Agilent
Chip preparation
Probe design
Probe preparation
Printing
Sample preparation
cRNA/cDNA Labeling
Hybridization
Scanning
Data Acquisition and Data Analysis
Making the Chip: Probe Design
Choosing genes of interest for the experiment
Probe selection strategy should ensure:
everything! Or….
• Biologically meaningful results (The truth...)
• Coverage, sensitivity (... The whole truth...)
• Specificity (... And nothing but the truth)
• Annotation
Making the Chip: Probe Design
• Coding region (ORF)
• Annotation relatively safe
• No problems with alternative polyA sites
• No repetitive elements or other funny sequences
• Danger of close isoforms
• Danger of alternative splicing
3’ Untranslated region
• Annotation less safe
• Danger of alternative polyA sites
• Danger of repetitive elements
• Less likely to cross-hybridize with isoforms
• Little danger of alternative splicing
• 5’ Untranslated region
• Close linkage to promoter
• Frequently not available
5’utr
EXON A
EXON B
INTRON
3’utr
Probe Design for Custom Array
Keywords,
seed sequences
Search Pfam
HMM db
HMM
Models
Run hmmsearch
against GenPept
db
Putative
new genes
Filter genes
(human only, set cut off,
eliminate red. genes)
Transporters: 670
Channels: 263
Transporters: 316
Channels: 151
Contigs: 156
Positive Controls: 9
Negative Controls: 3
Controls (diff. Oligos): 9
RGS: 75
FGF/RGF-like: 7
ADAM family: 18
Run
Pick70
Multiple alignment and
selection of repr. genes
Run Pick70
Tm = 70, Palindrome
Uniqueness = 15 bp
236 Contigs and singlets
Assemble contigs
Remove vector and
characterized ESTs
Protein seed
sequence
Converged
PSI-Blast
Brown et al. AAPS PharmSci. 2003
Core Protein
Family
Blast human
EST db
EST nucleotide
sequence
Anderle et al. Pharm Res. 2003
THE EXPERIMENT : Printing I
The microspotting is done by a robot called “arrayer”
THE EXPERIMENT : Printing II
Microspotting
THE EXPERIMENT : Printing III
Oligo-spotting (Photolithography)
Summary
?
Microarray Analysis: Data Analysis
Definition of biological questions
Experimental design
Scanning and Processing images
}
Calculation of expression values per probe set
Normalization across chips
Statistical analysis of expression values
Clustering of expression values
Annotation of probe sets
Functional classification
Biological interpretation of data
}
Low level analysis
High level analysis
Data Analysis: Processing of Images II
Addressing or gridding
Assigning coordinates to each of the spots
Segmentation
Classification of pixels either as foreground or as background
Intensity extraction (for each spot)
Foreground fluorescence intensity pairs (R, G)
Background intensities
FG
FG
Quality measures
M
Fluorescence Signal to Expression Level
GTTAAGCGTTCCGATGCTACTTACC PM
GTTAAGCGTTCCCATGCTACTTACC MM
Probes
Probes
mRNA reference sequence
= representative sequence
Consensus sequence
Fluorescence Signal to Expression Level I
Example: Affymetrix
•
•
•
•
~ 30 % MM signal > PM signal
Probes of given set mapped to different UniGene clusters
Same probe mapped to different UniGene clusters
Ca. 340 MM mapped to UniGene clusters
Computing Expression Values
ƒ
Microarray Analysis Suite (MAS 5.0):
signal = TukeyBiweight{log( PM j − MM *j )}
with MM*, a version of MM that is never bigger than PM, Tukey biweight is a type of
robust estimator...
ƒ
Li and Wong model:
PMij − MM ij = θ iφ j + ε ij , εij ∝ N(0,σ 2 )
θi is gene expression in chip i, φj is rate of increase of PM response over MM (probespecific effect)
ƒ
Robust multi-chip analysis (RMA)
log(PM ij − BG) = ai + b j + ε ij
Use only PM, ignore MM, assumes additive model (on log scale), estimates chip
effects ai and probe effects bj using a robust method (median polish)
MAS 5 vs. RMA: A Values
MAS 5 vs. RMA: M vs. A Plot
RMA
MAS 5
Data Analysis: Transformation (Coding)
Log2 transformation
No transformation
Effect of different scheme of data transformation on the total distribution of expression
values. Data: Alon et al. PNAS 1999
Ratios:
un-transformed
2 distance
Log2 transformed:
1
distance
2
0
1
0
0.5
0.5
1
-1
2X = y; log2(y) = x
22 = 4; log2(4) = 2
Data Analysis: Normalization I
Tentative separation of systematic sources of variation ("artefacts") that bias
the results from random sources of variation ("noise") that hide the truth.
Typical Statistical Approach:
Measured value = real value
+ systematic errors
Corrected value = real value
Analysis of corrected value => (unbiased) CONCLUSIONS
Examples of systematic errors:
Fluorochrome incorporation
Spatial bias
+ noise
+ noise
Data Analysis: Normalization II
Self-self hybridization: Non-normalized data
No Self-self hybridization: Non-normalized data
Scatter (MVA-)plots
Normalization: global
„
Normalization based on a global adjustment
log2 R/G → log2 R/G - c = log2 R/(kG)
„
Common choices for k or c = log2k are
c = median or mean of log
ratios for a particular gene set (e.g. all genes, or control or
housekeeping genes)
Another possibility is total intensity normalization, where k = ∑Ri/ ∑Gi
Median centering Normalization
Ratio
„
0
2
Log2 Ratio
Data Analysis: Normalization III
Methods:
Median center: MEDIAN log2( CY3/CY5) = 0
CY5
CY5
Linear Transformation
CY3
CY5
CY3
CY3
Why is not satisfactory? More noise with low–expressed genes
Data Analysis: Use of M vs A Plot
1.
2.
3.
4.
Logs stretch out region we are most interested in.
Can more clearly see features of the data such as intensity dependent variation,
and dye-bias.
Differentially expressed genes more easily identified.
Intuitive interpretation
M = log R/G = logR - logG
A = ( logR + logG ) /2
Data Analysis: Normalization IV
M
M
0
0
A
A
Magnification
M
M
0
M
0
A
0
A
Loess correction
A
Data Analysis: Normalization IV
M
0
Sub-array
A
Array
M
Regional Variation
0
Spatial Bias
A
Data Analysis: Normalization V
Use of spikes
Before normalization
After normalization
Data Analysis: Low Level Analysis
Summary:
Chip has been built!
Signals have been measured!
Systematic errors have been removed!
Data Analysis: Limitations
Problems in data analysis
Limitations of traditional biological interpretations:
Complexity (10 000 genes)
How to distinguish a true positive result from a false positive?
Methods:
1. Supervised learning: k-Nearest neighbor, LDA
2. Non-supervised learning: Clustering
Data Analysis: Clustering
Objectives
Gene discovery/Class identification
Sample/Gene classification
Looking for characterization of the
components of the data set,
without a priori input on cases or genes
Finding genes, combinations of genes
or samples that match a particular a
priori pattern
Labels are not used
Labels are used
Unsupervised learning
Supervised learning
Hierarchical clustering/Dendrograms
K-means clustering
Self organizing maps (SOM)
LDA
k-NN classification
Supervised vs. Unsupervised Learning: Examples
1.
Example
Identification of genes that are responsible for the fact that some patients respond differently to a
certain type of chemotherapy
2.
Example
Identification of genes or group of genes that explain the difference between tumor tissue and
non-tumor tissue based on the expression profile of ~100 samples (60 tumor tissue/ 40
healthy tissue)
3.
Example
Identify a group of genes that are co-regulated upon a given treatment
Unsupervised Learning Problems
Circularity of spots
Unsupervised Methods
This is clustering!
Length of neck
Similar objects are grouped together
How do we measure similarity
Agglomerative Hierarchical Clustering I
Before doing such clustering, one has to define two things:
1- The similarity measure between two genes (or experiments)
Correlation:
Distance = 1 - R
Euclidean:
Distance = sqrt((x1-x2)2+ (y1+y2)2)
Sample 2
Sample 3
Sample 1
Sample 1
2- The distance measure between the new cluster and the others
Single Linkage:
Distance between closest pair.
Complete Linkage:
Distance between farthest pair.
Average Linkage:
Distance between cluster centers
Agglomerative Hierarchical Clustering II
Distance between joined clusters
Gene 1
4
2
5
3
1
1
3
2
4
Dendrogram
5
Gene 2
The
Thedendrogram
dendrograminduces
inducesaalinear
linearordering
orderingof
of
the
data
points
the data points
Clustering: Defining Clusters
Unsupervised Clustering: Example
Sorlie et al. Proc Natl Acad Sci U S A 2001 Sep 11;98(19):10869-74
Supervised Methods: Learning Problems
Which criteria should we use?
Supervised Methods: Examples
k-Nearest Neighbor (knn)
Sample
Data Matrix
Gene 2
Gene
?
Gene 1
PCA
Gene 2
LDA
Gene 2
Gene 1
Gene 1
Supervised Methods: Learning Problems
Which criteria should we use?
Supervised Learning: Problems
Supervised Methods: Cross-Validation
Training Set
Training Set
Labels
Data Matrix
Labels
Microarray Data
Tissues
Genes
Test Set Labels
Test Set
Evaluation
Subset
Training
Subset
Test
Predicted
Labels
LDA
Predictor
Supervised Methods: Experimental Design
Subset 1
Subset 2
Subset 3
Subset 4
Characteristics:
Test set: 15 Tissues
Training set: 45 Tissues
Trained Model
Trained Model
Trained Model
Trained Model
Cross
Validation
Cross
Validation
Cross
Validation
Cross
Validation
Learning set
Test set
The 4 subsets are used for cross-validation
(Data set from Alon et al. 1999).
Always same proportions of
Normal / Cancer Tissues
All data once (and only once)
in test set
Supervised Methods: Student’s Test →LDA I
Group A
Group B
t - Statistics
For all Genes-> Compute the t statistics
LDA done with the most “differently expressed”, then most and the second
most……etc (Cumulative)
Supervised Methods: Student’s Test →LDA II
Percent of correct predictions
Effect of the Number of Genes Selected with a Student's t-Test
on the LDA Performance.
120
100
80
60
Test Set
Training Set
(12,89)
40
20
0
0
10
20
30
Number of genes (cummulative)
40
Summary: Part I
Microarray analysis allows simultaneously detection of the expression of thousands
of genes in a small sample.
Microarray experiments includes:
- Experimental design
- Making of the chip
- Preparation of samples, hybridization, detection of fluorescence signals
- Low level analysis:
- Transformation of fluorescence signal measurement into an
expression level values
- Normalization
- High level analysis
- Clustering, statistical analysis, functional classification
Part II: Practical Course
1. Exercise
In which steps of a typical microarray experiment may optimized computational methods contribute
to an improvement ?
2. Exercise
What features would you include in a probe design program?
3. Exercise
Which methods do you think the authors applied to answer their questions described in the
abstracts?
4. Exercise
What are the principal objectives of a supervised or unsupervised learning method, respectively?
5. Exercise
What do you think are the major limitations of microarrays?
6. Exercise
When would you rather use RMA or MAS5, respectively?
7. Exercise
Why is normalization crucial for the analysis of microarray data?
8. Exercise
How can you relate microarray data and phenotypes?
Part II: Abstract A
Novel genes and functional relationships in the adult mouse gastrointestinal tract identified by microarray analysis.
Bates MD, Erwin CR, Sanford LP, Wiginton D, Bezerra JA, Schatzman LC, Jegga AG, Ley-Ebert C, Williams SS,
Steinbrecher KA, Warner BW, Cohen MB, Aronow BJ.
Division of Gastroenterology, Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati,
Ohio 45229, USA. michael.bates@chmcc.org
BACKGROUND & AIMS: A genome-level understanding of the molecular basis of segmental gene expression along the
anterior-posterior (A-P) axis of the mammalian gastrointestinal (GI) tract is lacking. We hypothesized that functional patterning
along the A-P axis of the GI tract could be defined at the molecular level by analyzing expression profiles of large numbers of
genes. METHODS: Incyte GEM1 microarrays containing 8638 complementary DNAs (cDNAs) were used to define expression
profiles in adult mouse stomach, duodenum, jejunum, ileum, cecum, proximal colon, and distal colon. Highly expressed
cDNAs were classified based on segmental expression patterns and protein function. RESULTS: 571 cDNAs were expressed
2-fold higher than reference in at least 1 GI tissue. Most of these genes displayed sharp segmental expression boundaries, the
majority of which were at anatomically defined locations. Boundaries were particularly striking for genes encoding proteins that
function in intermediary metabolism, transport, and cell-cell communication. Genes with distinctive expression profiles were
compared with mouse and human genomic sequence for promoter analysis and gene discovery. CONCLUSIONS: The
anatomically defined organs of the GI tract (stomach, small intestine, colon) can be distinguished based on a genome-level
analysis of gene expression profiles. However, distinctions between various regions of the small intestine and colon are much
less striking. We have identified novel genes not previously known to be expressed in the adult GI tract. Identification of genes
coordinately regulated along the A-P axis provides a basis for new insights and gene discovery relevant to GI development,
differentiation, function, and disease.
Gastroenterology 2002 May;122(5):1467-82
Part II: Abstract B
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA,
Bloomfield CD, Lander ES.
Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA.
golub@genome.wi.mit.edu
Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer
classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer
classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a
test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute
lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to
determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene
expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer,
independent of previous biological knowledge.
Science 1999 Oct 15;286(5439):531-7
Functional Classification
1
5
2
4
3
4
4
4
Microarray Data Analysis Workflow. Existing data (repository) (1)-> generate data (2) -> collect
& manage data (3) (Microarray data management systems) -> analyze interesting sequences
(4) -> depositing into repositories (5)
Functional Classification
Typical questions to be answered with functional classification:
•Whether a gene has a known function, and if so, in what class?
•Whether genes found to cluster together have been described as being
functionally similar or related (promoter motifs, transcription factors)
•Whether homologs or orthologs have been found to be functionally related in
any known physiological or pathological state
•Whether the resultant genes are known to be associated with the experimental
conditions tested.
Functional Classification
Grouping and Clustering
Transcriptional ‘Signatures’
Identification of common
promoter elements
and regulatory networks.
GO: Gene Ontology
Gene product description
Biological process
Cellular component
Molecular function
Chromosomal Location
Name
Gene ID
Cytochrome p450 subfamily 4
HMG CoA synthase
Apolipoprotein CIII
Stearoyl-coenzyme desaturase
Carnitine palmitoyl transferase-1
Fatty acid binding protein
Phosphoenoyl carboxykinase
Cluster determinant 36
Cyp 4A
HMG-CoA Syn
Apo CIII
SCOD-1
CPT-1
FABP
PEPCK
CD36
T1
T2
x
x
x
x
Metabolic Pathway Assignment
x
x
x
x
x
Chromosomal Location: Annotation
Affymetrix
Representative Sequence
Representative sequence
Consensus sequence
BLAT against assembly
sequence from UCSC
Comparison with UG DB
NetAffx
Unigene
Ensembl DB
Probes
Tagger
Exact mapping to UG and RefSeq DB
Exact mapping to temp cDNA DB
SIB annotation
4 quality levels
EnsMart DB
Representative Sequence: Chosen during chip design as a sequence which is best associated with the
transcribed region being interrogated
BLAT threshold: Only records whose match / Qsize >= 75% and; only records whose score >= 0.70, where
score = (match - mismatch - gap# x 5 - gap_size x 2) / Qsize; If record has several mapping locations with score >
0.70, choose the highest one; if a record has several mapping locations with the same highest score, all mapping
locations kept.
EnsMart Approach: cDNA sequence plus an additional length of downstream sequence immediately following
the most 3' exon. The individual probe sequences are mapped, by exact matching. If more than 50 % of probes
mapped, then listed as hits.
Comparison of Various Annotations
NetAffx
A: 21545
B: 22014
EnsMart
A: 3209
A: 2686
A: 796
B: 904
B: 8473
B: 499
A: 15421
B: 5507
A: 11269
B: 4027
A: 4381
A: 147
B: 8610
B: 77
Mouse MOE A and B
A: 5085
B: 2533
NetAffx
Tagger
A: 20882
A: 22446
B: 22112
A: 2384
A: 418
B: 169
B: 7300
B: 15247
Human U133 A and B
EnsMart
A: 1193
B: 355
A: 12460
A: 6409
B: 1853
A: 149
B: 12790
B: 85
A: 2657
B: 1728
Tagger
A: 21675
B: 16456
A: 14220
B: 2462
Quality of Probe Sets
Chip
HG-133A
HG-133B
Mu74v2A
Mu74v2B
Mu74v2C
MOE-A
MOE-B
High
13792
3795
5340
2587
756
12683
2453
Medium
1663
790
1283
969
302
2395
620
Low
1103
519
1697
1190
982
1194
592
Undefined
5657
17473
4102
7665
9828
6354
18846
Chip
HG-133A
HG-133B
Mu74v2A
Mu74v2B
Mu74v2C
MOE-A
MOE-B
High
15703
10096
8015
7010
2600
18070
11602
Medium
1196
2026
615
1421
780
1222
2376
Low
3983
3125
2127
2306
2555
2383
2478
Undefined
1333
7330
1665
1674
5933
951
6055
Mapped on:
RefSeqs
Mapped on:
RefSeqs
mRNAs
ESTs
HTCs
Distribution: UGs per Probe Set
100000
Number of Probe Sets
10000
EnsMart A
EnsMart B
1000
Tagger A
Tagger B
100
NetAffx A
NetAffx B
10
1
1
10
Number of UniGenes
100
Functional Classification
Grouping and Clustering
Transcriptional ‘Signatures’
Identification of common
promoter elements
and regulatory networks.
GO: Gene Ontology
Gene product description
Biological process
Cellular component
Molecular function
Chromosomal Location
Name
Gene ID
Cytochrome p450 subfamily 4
HMG CoA synthase
Apolipoprotein CIII
Stearoyl-coenzyme desaturase
Carnitine palmitoyl transferase-1
Fatty acid binding protein
Phosphoenoyl carboxykinase
Cluster determinant 36
Cyp 4A
HMG-CoA Syn
Apo CIII
SCOD-1
CPT-1
FABP
PEPCK
CD36
T1
T2
x
x
x
x
Metabolic Pathway Assignment
x
x
x
x
x
Gene Ontology Project
GO Output
Cellular Component
L3
L3
L4 GO:X
Molecular Function
L2
L3 GO:Y
Biological processes
L3 GO:Z
L3
L4 GO:Y
ABCB1
Two pragmatic purposes of ontology:
1. Facilitate communication between people
and organizations
2. Improve interoperability between systems
Ontologies are structured vocabularies in the form
of directed acyclic graphs (DAGs) that represent a
network in which each term may be a “child” of one or
more than one ”parent”.
Distribution: Probe Sets per UG
100000
U133A
10000
U133B
U133AB
Number of UniGenes
U74Av2
U74Bv2
U74Cv2
1000
U74ABCv2
U74ABCv3_NA
MOE430A
MOE430B
MOE430AB
100
10
1
1
10
Number of Probe Sets
100
Functional Classification II
Grouping and Clustering
Transcriptional ‘Signatures’
Identification of common
promoter elements
and regulatory networks.
GO: Gene Ontology
Gene product description
Biological process
Cellular component
Molecular function
Chromosomal Location
Name
Gene ID
Cytochrome p450 subfamily 4
HMG CoA synthase
Apolipoprotein CIII
Stearoyl-coenzyme desaturase
Carnitine palmitoyl transferase-1
Fatty acid binding protein
Phosphoenoyl carboxykinase
Cluster determinant 36
Cyp 4A
HMG-CoA Syn
Apo CIII
SCOD-1
CPT-1
FABP
PEPCK
CD36
T1
T2
x
x
x
x
Metabolic Pathway Assignment
x
x
x
x
x
MAPPFinder – GenMAPP
Doniger et al. Genome Biology 2003
http://www.genmapp.org/
GenMAPP:
Gene Microarray Pathway Profiler
KEGG: Kyoto Encyclopedia of Genes and Genomes
The 3 main goals of the KEGG project:
1.
2.
3.
Computerizing the current knowledge of genetics, biochemistry, and molecular and cellular biology in
terms of the pathway of interacting molecules or genes
Collection of genes catalogs for all organisms with completely sequenced genomes and selected
organisms with partial genomics (consistent annotation)
Catalog of chemical elements, compounds and other substances in living cells
Summary of KEGG release 8.0
Kanehisa et al. 2002, Nucleic Acids Research, Ogata et al. 1999, Nucleic Acids Research; http://www.genome.ad.jp/kegg/
Functional Classification II
Grouping and Clustering
Transcriptional ‘Signatures’
Identification of common
promoter elements
and regulatory networks.
GO: Gene Ontology
Gene product description
Biological process
Cellular component
Molecular function
Chromosomal Location
Name
Gene ID
Cytochrome p450 subfamily 4
HMG CoA synthase
Apolipoprotein CIII
Stearoyl-coenzyme desaturase
Carnitine palmitoyl transferase-1
Fatty acid binding protein
Phosphoenoyl carboxykinase
Cluster determinant 36
Cyp 4A
HMG-CoA Syn
Apo CIII
SCOD-1
CPT-1
FABP
PEPCK
CD36
T1
T2
x
x
x
x
Metabolic Pathway Assignment
x
x
x
x
x
Signaling Pathways
Similar to other nuclear hormone receptors, PPAR acts as a ligand activated transcription factor. Upon binding fatty acids or hypolipidemic drugs, PPARa
interacts with RXR and regulates the expression of target genes. These genes are involved in the catabolism of fatty acids. Conversely, PPARg is activated by
prostaglandins, leukotrienes and anti-diabetic thiazolidinediones and affects the expression of genes involved in the storage of the fatty acids. PPARb is only
weakly activated by fatty acids, prostaglandins and leukotrienes and has no known physiologically relevant ligand. However, data from PPARb null mice suggest
PPARb does serve a role in fatty acid metabolism and perhaps in skin proliferation and cancer.
Genetic Network Models: Goals
„
„
„
„
Must incorporate rule-based dependencies between genes
„ Rule-based dependencies may constitute important biological
information.
Must allow to systematically study global network dynamics
„ In particular, individual gene effects on long-run network behavior.
Must be able to cope with uncertainty
„ Small sample size, noisy measurements, robustness
Must permit quantification of the relative influence and sensitivity of
genes in their interactions with other genes
„ This allows us to focus on individual (groups of) genes.
Microarray and Data Repositories
Name
Archival
Treatment
Visualization
Acuity
dual-color cDNA/oligo
dual-color cDNA/oligo
ArrayDB
dual-color cDNA/oligo
dual-color cDNA/oligo
dual-color cDNA/oligo. Dendrograms, 2-D interactive
plots, animated interactive 3-D plots, line graphs, scatter
plots.
dual-color cDNA/oligo
ArrayInformatics
dual-color cDNA/oligo
dual-color cDNA/oligo
dual-color cDNA/oligo, Affymetrix, Scatter, line and
series plots and a cluster image map,. is not supporting
XML as of yet.
BASE
dual-color cDNA/oligo,
Affymetrix, SAGE
Affymetrix
dual-color cDNA/oligo, Affymetrix, SAGE
Expressionist
dual-color cDNA/oligo,
Affymetrix, SAGE
Affymetrix
Affymetrix, dual-color cDNA/oligo
Normalization to LOWESS, total intensity, median
ratio or to a user generated gene list, graphing data
trends after normalization enabling examination of
data variability.
global mean or median ratio based normalization,
Lowess, MDS module
standard data processing and clustering
GeneDirector
dual-color cDNA/oligo
dual-color cDNA/oligo
dual-color cDNA/oligo, Affymetrix
ImaGene and GeneSight packagse
GeNet
dual-color cDNA/oligo,
Affymetrix
filters, dual-color
cDNA/oligo, Affymetrix,
dual-color cDNA/oligo,
Affymetrix
filters, dual-color
cDNA/oligo, Affymetrix,
dual-color cDNA/oligo, Affymetrix
GeneSpring package
filters, dual-color cDNA/oligo, Affymetrix,
GeneX
dual-color cDNA/oligo,
Affymetrix
dual-color cDNA/oligo
dual-color cDNA/oligo, Affymetrix
Global normalization, z-score, Lowess
normalization, full and sub-grid, for Affymetrix,
alternative probe based protocol
R routines are available to manipulate the data
(normalization, clustering, etc.)
maxdSQL
dual-color cDNA/oligo,
Affymetrix
dual-color cDNA/oligo,
Affymetrix
Filtering based on numerical values. 2-D
correlation plot with overlay of cluster data,
multidimensional plots.
NOMAD
dual-color cDNA/oligo,
Axon scanner outcome
dual-color cDNA/oligo,
Axon scanner outcome
dual-color cDNA/oligo, Affymetrix, maxdView,
expression data class which represents results from one
or more hybridizations and any associated clusters of
genes. Profiles viewers.
dual-color cDNA/oligo, Axon scanner outcome
PartisanarrayLIMS
filters, dual-color
cDNA/oligo, Affymetrix,
Affymetrix, Nylon filters,
dual-color cDNA/oligo
filters, dual-color
cDNA/oligo, Affymetrix,
Affymetrix, Nylon filters,
dual-color cDNA/oligo
filters, dual-color cDNA/oligo, Affymetrix,
global mean or median ratio based normalization
Affymetrix, Nylon filters. Table Viewer: K-means, Kmedians clustering, and SOM algorithms.
dual-color cDNA/oligo
dual-color cDNA/oligo
dual-color cDNA/oligo
Error models with any experimental replicates
performed, P-values computed and error bars for
every gene expression measurement, ANOVA.
ScanAlyse package: global normalization
GeneTraffic(Multi)
Resolver
SMD
Data normalization protocols and data analyses
modules
global normalization, normalization on control
spots, spike controls, or subset of spots.
Hierarchical clustering, K-means, PCA, SOM.
global mean or median ratio based normalization
ScanAlyse package: global normalization
Microarray and Data Repositories
Name
GEO
RAD
ExpressDB
CleanEx
Gene
Expression
Database
SMD
Data Type
Microarray/
SAGE
Microarray/
SAGE
Tissue Type
Normal and
tumor
Normal and
tumor
Microarray/
SAGE
Microarray/
EST
libraries
Microarray
Yeast
Description
Gene expression and hybridization array data
repository
The ultimate goal is to allow comparative analysis of
experiments performed by different laboratories using
different platforms and investigating different
biological systems.
Collection of yeast RNA expression datasets
Normal and
tumor
Gene expression and hybirdization array data
repository. SAGE will be added.
Tumor
Data from 60 cancer cell lines based on Affymetrix
and cDNA technology
http://discover.nci.nih.gov/arraytools
Microarray
Normal and
tumor
Normal and
tumor
Extensive collection of cDNA microarray data
http://genomewww.stanford.edu/microarray
http://www.ncbi.nlm.nih.gov/SAGE/
SAGEmap
SAGE
SAGE
SAGE
UniGene
EST
libraries
EST
libraries
EST
libraries
EST
libraries
CGAP/Tissue
BodyMap
TissueInfo
Normal and
tumor
Normal and
tumor
Normal and
tumor
Normal and
tumor
Normal
Data from one hundred SAGE (Serial Analysis of
Gene Expression) CGAP (Cancer Genome Anatomy
Project) libraries
SAGE data from over 600,000 transcripts including
SAGE data from human, mouse and yeast transcripts.
Collection of EST libraries from different species
Web address
http://www.ncbi.nlm.nih.gov/geo/
http://www.cbil.upenn.edu/RAD2/
http://arep.med.harvard.edu/cgibin/ExpressDByeast/EXDStart
http://www.epd.isb-sib.ch/cleanex/
http://www.sagenet.org/SAGEData/
sagedata.htm
http://www.ncbi.nlm.nih.gov/UniGene/
Information on CGAP and other cDNA libraries.
http://cgap.nci.nih.gov/Tissues/xProfiler
Database of expression information of human and
mouse genes in various tissues and cell types.
Information on tissue expression profile of a sequence
by comparing the given sequence against the EST
database. Each EST comes from a library derived
from a specific tissue type
http://bodymap.ims.u-tokyo.ac.jp
http://icb.mssm.edu/services/tissueinfo/qu
ery
Web Resources : General Information
Leung’s
Links page & software info
Davison’s
DNA Microarray Methodology - Flash Animation
gene-chips
Overview of the technique, papers…
Chips & microassays
General information
SMD guide
Stanford's links page, very complete
Introduction
Online introduction to microarrays (EBI)
Brown Lab Guide
Microarrays protocols and arrayer construction.
Web Resources : Data Analysis Tools
Expression Profiler
Online clustering and analysis tools (EBI)
GenEx
Database, repository and analysis tools (NCGR)
MAExplorer
MicroArray Explorer for data mining Gene
Expression, free download
ArrayDB
Downloadable tools, short online demo
MAXD
Downloadable data warehouse and visualisation
for expression data
Jexpress
Java tools for gene expression data analysis, free
download
Eisen Lab
Michael Eisen's suite for image quantitation and
data analysis (Scanalyze, Cluster, TreeView).
Downloadable.
Web Resources : Public Databases I
SMD
The Stanford Microarray Database
Chip DB
Searchable database on gene expression (MIT)
ExpressDB
Public queries of E. coli and yeast data
GEO
Gene expression data repository and online resource (NCBI)
RAD
RNA Abundance Database
Expression
Connection
Saccharomyces Genome Database expression data retrieval
EpoDB
Expression information retrieval for one gene at a time
yMGV
Public queries of yeast data
Web Resources : Public Databases II
AMAD
Downloadable web driven database system
ArrayExpress
Public data deposition and public queries (EBI)
maxdSQL
Downloadable data warehouse and visualization environment
GXD
Mouse expression data storage and integration
GeNet
Distribution and visualization of gene expression data from any
organism
Web Resources : Public Databases III
Drosophila microarray project Drosophila Metamorphosis Time Course Database
Samson Lab
Yeast Transcriptional Profiling Experiments
SageMap
NCBI SAGE data and analysis tools
NCI60 cancer project
Supplement to Ross et al. (Nat Genet., 2000).
Serum-response
Supplement to Lyer et al.(1999) Science 283:83-87
Breast cancer
Supplement to Perou et al. Nature 406:747-752(2000)
Cancer Molecular
Pharmacology
Integration of large databases on gene expression and
molecular pharmacology.
References
Interesting Books
Kohane et al., Microarrays for an integrative genomics, 2003 MIT
Baldi and Hatfield, DNA Microarrays and gene expression, 2002 Cambridge University Press
Jagota, Microarray data analysis and visualization, 2001 Bay Press
Download