Flow chart of Affymetrix from sample to information Functional annotation Pathway assignment Co-ordinate regulation Tissue Hyb. cRNA Hybridize to Affy arrays Generate Affy.dat file Output as Affy.chp file Text Self Organized Maps (SOMs) Promoter motif commonalities Microarray Data Analysis Data preprocessing and visualization Supervised learning Unsupervised learning Machine learning approaches Clustering and pattern detection Gene regulatory regions predictions based coregulated genes Linkage between gene expression data and gene sequence/function databases … Data preprocessing Data preparation or pre-processing Normalization Feature selection Base on the quality of the signal intensity Based on the fold change T-test … Normalization Experiment1 Experiment2 Control Need to scale the red sample so that the overall intensities for each chip are equivalent Control Normalization To insure the data are comparable, normalization attempts to correct the following variables: Can use simple math Number of cells in the sample Total RNA isolation efficiency Signal measurement sensitivity … Normalization by global scaling (bring each image to the same average brightness) Normalization by sectors Normalization to housekeeping genes … Active research area Basic Data Analysis Fold change (relative change in intensity for each gene) Mn-SOD Annexin IV Aminoacylase 1 Microarray Data Analysis Data preprocessing and visualization Supervised learning Unsupervised learning Machine learning approaches Clustering and pattern detection Gene regulatory regions predictions based coregulated genes Linkage between gene expression data and gene sequence/function databases … Microarrays: An Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000 probes well-studied (CAMDA-2000), good test example ALL AML Visually similar, but genetically very different Feature selection Probe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value D21869_s_at 170.7 55.0 43.7 5.5 807.9 1283.5 0.243 D25233cds_at 605 31.0 629.2 441.7 95.3 205.6 0.487 D25543_at 2148.7 2303.0 1915.5 49.2 96.3 89.8 0.0026 L03294_g_at 241.8 721.5 77.2 66.1 107.3 132.5 0.332 J03960_at 774.5 3439.8 614.3 556 14.4 12.9 0.260 M81855_at 1087 1283.7 1372.1 1469 4611.7 3211.8 0.178 L14936_at 212.6 2848.5 236.2 260.5 2650.9 2192.2 0.626 L19998_at 367 3.2 661.7 629.4 151 193.9 0.941 L19998_g_at 65.2 56.9 29.6 434.0 719.4 565.2 0.022 AB017912_at 1813.7 9520.6 2404.3 3853.1 6039.4 4245.7 0.963 AB017912_g_at 385.4 2396.8 363.7 419.3 6191.9 5617.6 0.236 U86635_g_at 83.3 470.9 52.3 3272.5 3379.6 5174.6 0.022 … … … … … … … … Hypothesis Testing Null hypothesis is an hypothesis about a population parameter. Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data Example: Test whether the time to respond to a tone is affected by the consumption of alcohol Hypothesis : µ1 - µ2 = 0 µ1 is the mean time to respond after consuming alcohol µ2 is the mean time to respond otherwise Z-test Theorem: If xi has a normal distribution with mean and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai 2. xi /n ~ N(, 2/n). Z test : H: µ = µ0 (µ0 and 0 are known, assume = 0) What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and = 8? Use Reject the null hypothesis. Histogram Set 1 Set 2 T-test William Sealey Gosset (1876-1937) (Guinness Brewing Company) Project 3 A training data set A testing data set (38 samples, 7129 probes, 27 ALL, 11 AML) (35 samples, 7129 probes, 22 ALL, 13 AML) Lab today: pick the top probes that can differentiate the two sub types and process the testing data set K Nearest Neighbor Classification L Feature 1 M M M L L M M L L L L Feature 2 L = ALL M = AML = test sample M Distance measures Euclidean distance Manhattan distance Jury Decisions Use one feature at a time for the classification Combining the results from the top 51 features Majority decision test sample Feature0 Feature1 … Feature50 M L … M M False Discovery Two possible errors in making a decision about the null hypothesis. 1. We could reject the null hypothesis when it is actually true, i.e., our results were obtained by chance. (Type I error). 2. We could fail to reject the null hypothesis when it is actually false, i.e. our experiment failed to detect the true difference that exists. (Type II error) We set at a level which will minimize the chances of making either of these errors. False Discovery Type I error: False Discovery False Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array For a p-value of 0.01 10,000 genes = 100 false “different” genes You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values RCC subtypes Clear Cell RCC (70-80%) Papillary (15-20%) Chromoprobe (4-5%) Collecting duct Oncocytoma Saramatoid RCC ? Goal: Identify a panel of discriminator genes Genetic Algorithm for Feature Selection Raw measurement data Sample Clear cell RCC, etc. f1 f2 f3 f4 f5 Feature vector = pattern Why Genetic Algorithm? Assuming 2,000 relevant genes, 20 important discriminator genes (features). Cost of an exhaustive search for the optimal set of features ? C(n,k)=n!/k!(n-k)! C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20 = 10^40 If it takes one femtosecond (10-15 second) to evaluate a set of features, it takes more than 310^17 years to find the optimal solution on the computer. Evolutionary Methods Based on the mechanics of Darwinian evolution Population of competing candidate solutions Chromosomes (a set of features) Genetic operators (mutation, recombination, etc.) The evolution of a solution is loosely based on biological evolution generate new candidate solutions Selection pressure directs the search those that do well survive (selection) to form the basis for the next set of solutions. A Simple Evolutionary Algorithm Evaluation Genetic Operators Selection Genetic Algorithm 4 g100 g2 g5 g7 g20 3 g21 g3 g6 g1 g2 2 g10 g12 g15 g7 g12 1 g1 g21 g51 g17 g201 5 g10 g23 g56 g72 g25 Stop g100 g2 g5 g1 g2 g21 g3 g6 g7 g20 g10 g22 g15 g7 g12 g14 g23 g25 g7 g20 g10 g23 g56 g72 g25 Encoding Most difficult, and important part of any GA Encode so that illegal solutions are not possible Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space Most GA’s use a binary encoding of a solution, but other schemes are possible GA Fitness At the core of any optimization approach is the function that measures the quality of a solution or optimization. Called: Objective function Fitness function Error function measure etc. Genetic Operators Mutation Crossover 10 30 50 10 70 30 62 80 Randomly Selected Mutation Site 20 40 60 80 Randomly Selected Crossover Point 10 30 60 80 20 40 50 70 Recombination is intended to produce promising individuals. Mutation maintains population diversity, preventing premature convergence. Genetic Algorithm/K-Nearest Neighbor Algorithm Classifier Feature Selection (kNN) (GA) Microarray Database