Differential Analysis Differential Analysis Marker selection Given phenotypically distinct classes, find “markers” that distinguish these classes from one another Tumor Normal Tumor Normal Gene Marker Selection Hierarchy of difficulty Problem Gene Markers Error Example I. Tissue or Cell Type Normal vs. Abnormal ~1000-2000 ~0% Normal vs. Renal carcinoma II. Morphological Type ~200-500 ~0-5% Leukemia ALL vs. AML III. Morphological Subtype ~50-100 Multiclass Classification ~0-15% ALL B- vs. T-Cell IV. Treatment Outcome Drug Sensitivity ~5-50% AML Treatment Outcome ~1-20 Degree of Difficulty adapted from P. Tamayo Gene Marker Selection Compute score for each gene Phenotype/ class labels T-test: t-test, SNR, etc. Ranked gene list Score Dataset Compute score: Signal-to-Noise Ratio (SNR): Gene Marker Selection Challenges • Small sample size. • Each gene tested is a separate hypothesis likelihood of false positives. • Gene interaction not taken into account. Gene Markers Selection Small Sample Size Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5) Pick n columns (6,14,30,100) Assign sample labels yellow and green Select top 25 markers for yellow, top 25 markers for green Yellow Green 6 samples Yellow Green 14 samples Yellow Green 30 samples Yellow Green 100 samples With small sample size it is easy to find genes correlated with phenotype P-value calculation • If a gene is normally distributed the t-score follows the t-distribution – What if they aren’t normally distributed? • Permutation Test: – shuffle labels (class membership) – compute score for each gene (t-score, SNR, .. ) – repeat many times Empirical null distribution of scores for each gene • Compare observed score to empirical distribution. Distribution of permuted scores for given gene Observed score of gene scores No distributional assumptions are made - compute gene-specific p-values Permutation test and P-value To determine how significant a gene’s statistical score is “Called” Class A “Called” Class B “True” classes Permutation 1 Permutation 2 Permutation n Known class A samples Known class B samples 7 4 1 9 9 4 6 7 1 9 4 5 6 10 3 8 4 1 2 1 7 3 5 1 4 3 9 4 5 5 7 6 9 8 8 3 10 6 7 3 8 10 9 7 8 5 10 10 2 4 2 8 10 2 4 1 10 9 6 6 5 10 10 10 3 8 10 8 4 9 7 9 8 10 4 5 6 5 2 7 7 2 4 9 6 2 4 1 2 9 10 9 1 3 7 1 1 1 5 5 7 5 4 7 1 2 6 5 8 1 10 9 4 8 7 2 9 1 10 3 8 4 2 6 6 9 2 10 5 2 5 3 7 10 7 6 2 9 3 10 5 9 9 7 10 2 5 2 4 8 4 2 9 2 5 8 2 10 7 5 5 3 2 5 8 9 3 4 5 6 1 1 9 2 6 2 5 1 6 5 6 1 5 2 7 9 9 3 4 2 2 9 1 4 8 3 8 6 6 6 3 1 7 2 8 2 4 2 4 1 2 9 10 8 3 7 3 9 8 6 8 10 7 4 3 10 3 1 5 6 1 8 3 1 9 3 4 1 2 6 9 2 8 8 4 7 9 8 9 10 8 9 6 5 5 7 3 6 5 2 4 2 10 8 9 3 8 3 9 10 5 2 9 6 5 2 10 5 3 9 1 9 7 1 8 10 10 2 7 10 2 9 1 4 3 2 8 8 9 2 1 6 6 1 8 8 6 4 9 8 8 5 5 5 8 7 4 10 4 9 5 1 1 5 5 2 1 7 2 4 9 10 1 4 10 9 7 7 7 5 Generates a “null distribution” of values for this gene Compare with “real” score for this gene Score Marker Selection Process Phenotype/ class labels t-test, SNR, etc. Ranked gene list Score Dataset Compute score: Measure significance: Measure of significance permutation test Correct for multiple hypotheses: FDR, FWER, etc. Markers Multiple Hypotheses What to control • Bonferroni Correction: – Most conservative metric – Divides the p-value by the number of hypotheses • FWER (Family-Wise Error Rate): probability of calling one or more hypotheses significant given that they are all null • FDR (False Discovery Rate): probability that the null hypothesis is true given that the result is significant • Try to reduce the number of hypotheses tested in the first place (i.e. filtering) Exercise ComparativeMarkerSelection Module 1. Choose module: • Gene List Selection ComparativeMarkerSelection 2. Choose input file: Next to “input file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.preprocessed.gct 3. Choose class file: Next to “cls file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.cls 4. Click Run Viewing Analysis Results Differential Analysis Cookbook • Reduce number of hypotheses/genes by variation filtering (attempt at reducing false negatives) • Choose test statistic (e.g., SNR, t-score, ...) • If enough samples, compute p-values by permutation test (otherwise, compute asymptotic test using the standard tdistribution). • Control for Multiple Hypothesis Testing by using the FDR correction – Remember: if you choose FDR ≤ 0.05, you’re willing to accept 5% of false positives. – If number of significant hypotheses/genes “too large” even for very small threshold values, either: • use the maxT correction (possible w/ empirical p-values only). • use additional criteria (e.g., min fold-change, min expression value, etc.) Differential Analysis GenePattern modules • Create expression data set – ExpressionFileCreator • Reduce number of hypotheses/genes by variation filtering – PreprocessDataset • Make class file • Run Differential Analysis – ComparativeMarkerSelection – Choose test statistic (say, t-score) • View results with ComparativeMarkerSelectionViewer – If enough samples, compute p-values by permutation test (otherwise, use asymptotic test). – Control for MHT by using the FDR correction – Use HeatMapViewer to view results for top genes • Use GSEA to find gene sets (or pathways) that are enriched in your dataset. Working with Samples and Features Overview • Extracting a set of samples • Computing co-expressed genes • Converting probe set ids to gene names • Computing overlap between gene sets Working with Samples and Features 1. 2. 3. From a combined dataset of cancer and normal samples, select the normal samples. Within the normal samples, find the genes coexpressed with LRPPRC (Affymetrix probe M92439_at), a gene with mitochondrial function. Compare these genes and those coexpressed with LRPPRC in another expression dataset to determine the coexpressed genes common to both datasets. GCM_Total.r es SelectFeaturesColumns GCM_Normals.res GeneNeighbors GCM_Normals.markerdata.g ct GCM_Normals.markerlist.o df GeneListSignificanceViewer CollapseDataset GCM_Total_Normals.markerdata.collapsed. gct ExtractRowNames GCM_Total_Normals.markerdata.collapsed.row.nam es.txt VennDiagram Exercise