Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for

advertisement
Query-driven search methods
for large microarray databases
Matt Hibbs
Troyanskaya Laboratory for
BioInformatics and Functional Genomics
Broad Goals/Challenges
• Characterize the function of proteins
• Learn the mechanisms of gene expression
and regulation under many conditions
– Growing amounts of data facilitate this goal
• Noise, heterogeneity, and biases in available
data must be addressed
Specific Goals
• Large collection of S. cerevisiae microarray
data
– From > 80 publications
– Totaling ~2400 conditions
– Divided into ~130 “datasets”
• How can such a large amount of data be
leveraged?
– What can we learn? Or not learn?
– Accessibility, usefulness to community
Outline
•
•
•
•
•
Microarray methodology
Analysis concerns
Functional Biases
Improved Approaches
Preliminary Conclusions
Outline
•
•
•
•
•
Microarray methodology
Analysis concerns
Functional Biases
Improved Approaches
Preliminary Conclusions
Central Dogma
TF
DNA
• Transcription factors recruit
or repress polymerase
Polymerase
• Transcription
– DNA  mRNA
mRNA
• Translation
– mRNA  Proteins
• Proteins do work
Ribosome
Proteins
Molecular Measurements
• Measurements of protein abundance in a
variety of conditions can suggest function
– Difficult to measure accurately in a large-scale
manner
• One off: measure abundance of mRNA
transcripts as a proxy
– Much easier to measure on a large scale
– Several competing technologies reaching
maturity
Basic Microarray Methodology
reference mRNA
Step 2: Add mRNA to
slide for Hybridization
add green dye
test mRNA
add red dye
hybridize
Step 1: Prepare
cDNA spots
Step 3: Scan
hybridized array
Microarray Outputs
Measure amounts of green and red
dye on each spot
Represent level of expression as a
log ratio between these amounts
Raw Image from Spellman et al., 98
Microarray Outputs
Experiments
Genes
•
Log ratios in data matrix
•
Missing values present
•
Potentially high levels of
noise
Additional Technology
• Two-color (homemade, Agilent)
– Process just described, with 2 labeled samples
undergoing competitive hybridization
• Single-color (Affymetrix)
– Highly calibrated hybridization spots
– Match and Mis-match spots for each oligo
• Other techniques/tricks
– Randomized layouts, barcode arrays, tiling
arrays, etc.
Outline
•
•
•
•
•
Microarray methodology
Analysis concerns
Functional Biases
Improved Approaches
Preliminary Conclusions
Noise Sources
• Transcriptional noise
– mRNA transcripts not a direct reflection of protein
levels
– Process of isolating mRNA can stress cells
• Especially true of older protocols/data
• Chemical noise
– Fluorescent labels sensitive to environment
• Operator noise
– High variation between scientists running the
same experiment
Missing Values
• Several choices:
– Ignore missing values
– Remove genes with missing values
– Impute missing values
• KNN-Impute
– Replace missing values with a weighted average
of the K-nearest neighbors
– Used for analysis presented later
Normalization
• “Bright” arrays
– Whole arrays often normalized by average
intensity
• Two-color
– Choice of reference population can affect
measurements
– Avoid divide by zero errors
• Affymetrix
– Convert hybridization values to log ratios
• Divide by average value
• Log transform
Clustering Analysis
• Distance metrics
– Euclidean
– Pearson
– Spearman
–…
• Algorithms
– Hierarchical
– K-means
– SOM
–…
Megaclustering
• Combining data from multiple sources can
cause problems
– Normalization differences
– Technology differences
– Noise biases
• Requires unified pre-processing and smart
application of statistics
Apples to Apples
• Pearson correlation distributions not always normal
– Large dependence on number of conditions
6 condition dataset
40 condition dataset
Histograms of Pearson correlation coefficients
Apples to Apples
• Fischer’s Z-score transform normalizes the distributions
– Z = ln[(r+1)/(r-1)] / 2, where r = Pearson corr. coeff.
6 condition dataset
40 condition dataset
Histograms of Z-scores
Evaluation Measurements
• Gene Ontology (GO)
– Hierarchical organization of biological processes,
molecular functions, and cellular components
– Cross-organism structure, organism-specific
annotations
– Closest available approximation of a “gold
standard”
• True Positives and False Positives can be
defined from the ontology
– Node size, depth, expert voting used for cutoffs
Precision / Recall
• Calculate and sort distances between all pairs of
genes
• Determine a cutoff, all pairs below cutoff are
predicted “true,” above “false”
• Given these predictions, can calculate precision
and recall
– Precision = TP / (TP + FP)
– Recall = TP / TotalPositives
• Slide the cutoff from smallest to largest distance
to create a curve of precision / recall pairs
– Ramp down from few, high confidence predictions to
many, low confidence predictions
Example
Precision/Recall of various data types
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Outline
•
•
•
•
•
Microarray methodology
Analysis concerns
Functional Biases
Improved Approaches
Preliminary Conclusions
Functional Biases
• Microarray experiments often targeted at a
particular process, pathway, or function
• However, several “global” signals are often
present
– Ribosomal response
– General Stress Response
• Some datasets do contain more targeted
“local” signals as well
Ribosome Bias
Precision/Recall of various data types
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Ribosome Bias
Precision/Recall excluding Ribosome Biogenesis
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Process-specific P/R
• Can generate PR-curves on a per-GO term basis
– TPs are pairs of genes annotated to term
– TFs are pairs with one gene in term, with
smallest common ancestor in very large term
– Normalize by size of GO term
• Results for individual data sets can expose
functional biases
Per-dataset Biases
Typical Results
Per-dataset Biases
Poor Results
Per-dataset Biases
Diverse Results
Z-test for significance
• Difference between pair-wise distances for all
genes in a term vs. background
A Global View
Z-test P-values
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Columns - datasets
Rows - GO terms
Red at a cutoff of 10-10
A Global View
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
A Global View
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
A Local View
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
A Local View
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
Outline
•
•
•
•
•
Microarray methodology
Analysis concerns
Functional Biases
Improved Approaches
Preliminary Conclusions
Bi-clustering
• Traditional clustering will be driven by “global”
signals and ignore “local” signals
• Bi-clustering identifies groups of genes and
conditions rather than just genes
Traditional clustering
Bi-clustering
Bi-clustering goals/issues
• Better capture biological reality
– Genes only cooperate in certain conditions
– Genes can have multiple functions
– Datasets have functional biases
• Computationally difficult problem
– Reducible to bi-clique finding
• NP-complete
• Heuristics, simplifications, approximations
– e.g. -biclusters, SAMBA, PISA
Bi-clustering goals/issues
• Microarray noise can lead to spurious output
– As compendiums increase in size, patterns by
chance increase
– Datasets have “smallest logical groupings”
• Restrict co-expression to these groups
• Long running times + large result sets
– Difficult to validate results
– Scientifically frustrating
Query-driven approach
• Allow users to specify a starting point for
search
– Leverages expert knowledge of domain
– Known to be useful in other contexts
• bioPIXIE
• Identify conditions/datasets of interest based
on the set of query genes
• Expand query set to include additional related
genes in these conditions
Query-driven approach
• Reduces problem complexity to allow for realtime results
• Fast results allow for user-driven refinement
of search criterions
• Extensible to larger data compendiums and
more complex organisms
– Locality sensitive hashing
– Pre-processing
Query Weighting
• Identify data conditions related in query set
– Average correlation, distance, etc.
– Signal to Noise ratio of query
– Centroid significance
• Additional genes related to query
– Correlation, distance, etc. weighted by identified
condition sets
Simple Scheme
• Weighted by correlation of query
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Simple Scheme
• Results, weighted sum of correlation to query
decreasing correlation
decreasing correlation
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Ongoing Work
• Compare query weighting schemes
• UI challenges
• Scalability concerns
– Indexing, Locality Sensitive Hashing
– Human data
• Assess biological usefulness
Preliminary Conclusions
• Noise, functional biases, collection sizes
require consideration in microarray analysis
• Evaluation metrics can be influenced by
biases creating misleading results
• Query-driven approaches show promise
– Targeted search
– Computational feasibility / Real-time results
– Extensibility
Acknowledgements
• Olga Troyanskaya
• Chad Myers
• Curtis Huttenhower
• Kai Li and lab
• Botstein and Kruglyak labs
• Kara Dolinski, Maitreya Dunham
Jessy
Download