Project – 1: Mining Differentially Expressed Phenomena Name: Masrur Sobhan Date of Submission September 04, 2020 Course title and code CAP 6778: Advanced Topics in Data Mining Abstract: In the context of biology, specially in the field of detection of disease, identification of variations in gene expression are very much significant. Generally, the measurements of mRNA levels are being calculated by both DNA microarrays and RNA-Seq technology. In this project, expression data from gastric and paired normal tissues were analyzed by expression profiling by array on the dataset available in GEO (Gene Expression Omnibus) and the accession ID is GSE79973. This dataset consists 10 pairs of gastric cancer tissue and adjacent non-tumor mucosa as samples (20 samples) and the numbers of features are 54674 (Genes experimented at some specific time). Among these total numbers, genes are considered to be significant or differentially expressed if the absolute value of logfc were >=2 or the p values of the genes were <= 0.05. After the statistical calculation it was found that only 773 genes were differentially significant among which 586 genes were downregulated and 187 genes were upregulated. The features are reduced by more than 98% which is very noteworthy in terms of further experiments. In this project both GEO2R web interface and Limma R package were used to validate the results. Both cases, they produced the identical final output. Keywords: Gene Expression, DNA Microarray, RNA-Seq, GEO, Limma R Package, Differentially Expressed Genes. 1. Materials and Methods Identifying the differentially expressed genes is a great challenge because of the abundant of data. In this project, the data was extracted from GEO (Gene Expression Omnibus), accession ID for the dataset is: GSE79973 (GEO Accession viewer, n.d.). The number of total RNA was extracted from 10 pairs of Gastric Cancer tissue and adjacent non-tumor tissue. The dimension of the data was 54675X21, having 54674 features and 20 samples (10 pairs- tumor and normal). To identify the Differentially Expressed Genes (DEG), I used Limma R Package which has some in-built functions dedicated for identifying genes. I also used RStudio to perform the whole tasks. The Libraries [ (Bioconductor - limma, n.d.), (Bioconductor - GEOquery, n.d.)] - “GEOquery”, “Biobase” and “limma” were used to identifying the genes with some statistical values. Upregulated genes and Downregulated genes were then easily identified based on these statistical values. There are some steps (Altman, 2013) to perform the analysis on the GEO dataset in LIMMA. The steps include: a. Computing the pool variance by computing within region variance of each of the genes from the dataset. Afterwards, the correlation among the region variances were calculated. b. The second step includes the creation of coefficient matrix for the contrasts of the samples. c. The third step is to calculate the estimated contrast values. Project – 1: Mining Differentially Expressed Phenomena PAGE 2 d. The final step is to identify the significant genes based the adjusted p-values, p-values, logfc and b-values. Computing the pooled variance includes 2 steps- creating a design matrix and then fitting the model for finding the pool variance. The design matrix has the coefficient of the linear model consisting value 1 if the column belongs to the treatment, else 0. The functionmodel.matrix() is used in LIMMA to create the matrix. According the type of the dataset, my generated model matrix includes 20 rows (total number of samples) and 2 columns (tumor and normal). The model.matrix function includes column of all 1’s representing μ in the ANOVA model Yij = μ + αi + error. ANOVA identifies the mean decomposition of the observed data. As a result, the fitted values are the estimates of µ and α. Afterwards, the pooled sample variance for each genes were calculated using the lmfit() functions. Then the computing of the coefficient matrix for the contrasts was performed. LIMMA package has a function named makeContrasts() which helps to create the contrast matrix. One noticeable matter is that, the treatment names are taken from the column matrix by default, but these names can be changed. The rules of contrast matrix is- sum of coefficients must be equal to zero and the rows of the matrix must be equal to the number of parameters in the model. LIMMA has another function named contrasts.fit() which is used to fit the contrast matrix to the previous fitted model. Next, the t-test was performed on the fitted dataset using the eBayes command which determines the empirical Bayes pooled variance for each gene. The basic form of the t-test calculates the division between Difference between group Means and Variablity within Groups. This command also determines the associated p-values. Afterwards the topTable() function was used to get the gene lists. I used the “FDR”/ “BH” as adjusted p-values to assemble the genes. There are lots of options that can be used in the parameter (toptable function | R Documentation, n.d.) which can also be used for more and different analysis using different functionality. The mathematical intuitions behind the functions used under LIMMA package (Smyth) will be described in brief. Let, the linear model for gene j has residual variance πΌ 2 (j) with sample value π 2 (j) and degree of freedom f(j). The fit function performs the ordinary tstatistic for the kth contrast for gene j which is π‘ππ = π½ππ /(π’ππ π π ) where π’ππ is the unscaled standard deviation. The covariance matrix for the estimated π½ππ can be calculated by πΌπ 2 πΆ 2 (π 2 ππ π)−1 πΆ where ππ is a weight matrix determined by prior weights. The empirical Bayes method assumes an inverse Chisquare prior for the ππ 2 with mean π 0 2 and degrees of freedom π0 . The posterior values for the residual variances are given by where ππ is the residual degrees of freedom for the jth gene. The output from eBayes() contains π 0 2 and π0 . Limma also provides function topTable() which summarize the results of the linear model, perform hypothesis tests and adjust the p-values for multiple testing. Project – 1: Mining Differentially Expressed Phenomena PAGE 3 Results usually contain (log) fold changes, standard errors, t-statistics and p-values. The basic statistic used for significance analysis is the moderated t-statistic, which is computed for each probe and for each contrast. This has the same interpretation as an ordinary statistic except that the standard errors have been moderated across genes. Moderated t-statistics lead to p-values in the same way that ordinary t-statistics do (Ritchie, et al., 2015). A number of summary statistics are presented by topTable() for the top genes and the selected contrast (Smyth) and (Altman, 2013) Types of value Description This represents a log2-fold change between two or more M-value (M) experimental conditions A-value (A) Average log2-expression level for that gene across all the arrays t-value Moderated t-statistic p-value Associated p-value after adjustment for multiple testing If all genes with pvalue below a threshold, say 0.05, are selected as Adjustment pdifferentially expressed, then the expected proportion of false value [most common:” fdr”] discoveries in the selected group is controled to be less than the theshold value, in this case 5%. (Benjamini & Hochberg, 1995) 2. Results and Discussion The dataset was collected from Gene Expression Omnibus (GEO) database having the accession ID: GSE79973. It has 54674 features and 20 samples (10 pairs of tumor and normal data). At first, we need to check whether the data are ready to use or not. For that purpose, ‘mean-variance trend’ plot [Figure:1] was drawn using voom() function in LIMMA R package. The smoothed curve (having the red line) is fitted to sqrt (residual standard deviation) by average expression and it is also noticeable that there is little noise for this dataset. A boxplot [Figure 2] was also drawn to check further for the usability of the dataset for further analysis and it is visible that the mean for each of the genes are very close to each other which validates the result of Figure 1. Afterwards, the dataset was fitted to ANOVA model using the lmfit() function of LIMMA package. Using the eBayes() function of LIMMA a t-test operation was performed with generating the associated pvalues. Later, topTable() function of LIMMA was used to obtain the p-value, adjusted pvalue, logFC value, B-value and many more. As the purpose of this project is to identify the differentially expressed genes, the focus was only to the p-value and the logFC value. Because, genes are significant or differentially expressed if the absolute value of logfc were >=2 or the p values of the genes were <= 0.05. Based on the p-value and logFC value the Volcano plot was [Figure 3] was drawn. The vertical and horizontal dashed line are determining the threshold and only the red dotted features are considered to be the significant genes for the dataset according to the condition of p-value and logFC value. Based on the condition criteria only 773 genes were detected to be significant ones, which actually, reduces the feature more than 98% (original features are 54674). Among the 773 features, the upregulated genes were 187 in numbers and the downregulated genes were Project – 1: Mining Differentially Expressed Phenomena PAGE 4 Figure 1: Mean-Variance Trend Figure 2: Boxplot for 10 pairs of genes Figure 3: Volcano Plot to categorize the significant genes (red dotted features) downregulated genes (left part of each of the both figures). Besides, the features extracted from the LIMMA package were then validated with the freely available GEO2R web interface incorporated with GEO (Gene Expression Omnibus). Then the data we observed were very much similar only with an exception of 4 gene mismatches after applying the condition of p-value. It might happen because of the floating precision values of the pvalues (the floating precision is more accurate in LIMMA). Table 1 elucidates the comparison more clearly. Project – 1: Mining Differentially Expressed Phenomena PAGE 5 Figure 4: Scatterplot of upregulated (right) and downregulated (left) genes Figure 5: Histogram of upregulated (right) and downregulated (left) genes Table 1: Comparison of produced Data using LIMMA R Package and GEO2R Data Types of Conditioning Original Data from GEO2R Produced Data using LIMMA R Package No Condition 54674 54674 P.Value < 0.05 13040 13044 Absolute value of logFC >=2 773 773 Downregulated Genes 586 586 Upregulated Genes 187 187 3. Conclusion Microarrays are powerful tool for monitoring the expressions of lots of genes, specially with the advancement of microarray technology. The challenging part is to analyze a large amount of microarray data and making biological sense in them. LIMMA package helps to analyze the huge dataset and helps to create meaning of genes by detecting and analyzing the significant genes, such as, upregulated and downregulated genes determined by some statistical values. I learned a lot about gene expression data and their significance for Project – 1: Mining Differentially Expressed Phenomena PAGE 6 research and disease detection. The challenging part was to learn the LIMMA package and to learn their functionalities. The statistical values have been very significant in identifying the significant genes. These statistical models can be used for predicting lots of real life problems having huge data, the LIMMA package can be used for more complex gene dataset (two factor data or more) as LIMMA powers differential expression analyses for RNA data. References Altman, N. (2013). Differential Expression Analysis using LIMMA. Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Bioconductor - GEOquery. (n.d.). Retrieved from https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html Bioconductor - limma. (n.d.). Retrieved from https://bioconductor.org/packages/release/bioc/html/limma.html GEO Accession viewer. (n.d.). Retrieved from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE79973 Ritchie, M., Phipson, B., Wu, D., Hu, Y., Law, C., Shi, W., & Smyth, G. (2015, 1 6). Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47. Smyth, G. (n.d.). Limma: Linear Models for Microarray Data. toptable function | R Documentation. (n.d.). Retrieved from https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/toptabl e Project – 1: Mining Differentially Expressed Phenomena PAGE 7