Uploaded by masrur sobhan

Mining Differentially Expressed Phenomena

advertisement
Project – 1: Mining Differentially
Expressed Phenomena
Name: Masrur Sobhan
Date of Submission
September 04, 2020
Course title and code
CAP 6778: Advanced Topics in Data
Mining
Abstract: In the context of biology, specially in the field of detection of disease,
identification of variations in gene expression are very much significant. Generally,
the measurements of mRNA levels are being calculated by both DNA microarrays
and RNA-Seq technology. In this project, expression data from gastric and paired
normal tissues were analyzed by expression profiling by array on the dataset available
in GEO (Gene Expression Omnibus) and the accession ID is GSE79973. This dataset
consists 10 pairs of gastric cancer tissue and adjacent non-tumor mucosa as samples
(20 samples) and the numbers of features are 54674 (Genes experimented at some
specific time). Among these total numbers, genes are considered to be significant or
differentially expressed if the absolute value of logfc were >=2 or the p values of the
genes were <= 0.05. After the statistical calculation it was found that only 773 genes
were differentially significant among which 586 genes were downregulated and 187
genes were upregulated. The features are reduced by more than 98% which is very
noteworthy in terms of further experiments. In this project both GEO2R web
interface and Limma R package were used to validate the results. Both cases, they
produced the identical final output.
Keywords: Gene Expression, DNA Microarray, RNA-Seq, GEO, Limma R Package,
Differentially Expressed Genes.
1. Materials and Methods
Identifying the differentially expressed genes is a great challenge because of the abundant
of data. In this project, the data was extracted from GEO (Gene Expression Omnibus),
accession ID for the dataset is: GSE79973 (GEO Accession viewer, n.d.). The number of
total RNA was extracted from 10 pairs of Gastric Cancer tissue and adjacent non-tumor
tissue. The dimension of the data was 54675X21, having 54674 features and 20 samples
(10 pairs- tumor and normal). To identify the Differentially Expressed Genes (DEG), I
used Limma R Package which has some in-built functions dedicated for identifying genes.
I also used RStudio to perform the whole tasks. The Libraries [ (Bioconductor - limma,
n.d.), (Bioconductor - GEOquery, n.d.)] - “GEOquery”, “Biobase” and “limma” were used
to identifying the genes with some statistical values. Upregulated genes and
Downregulated genes were then easily identified based on these statistical values.
There are some steps (Altman, 2013) to perform the analysis on the GEO dataset in
LIMMA. The steps include:
a. Computing the pool variance by computing within region variance of each of the genes
from the dataset. Afterwards, the correlation among the region variances were
calculated.
b. The second step includes the creation of coefficient matrix for the contrasts of the
samples.
c. The third step is to calculate the estimated contrast values.
Project – 1: Mining Differentially Expressed
Phenomena
PAGE 2
d. The final step is to identify the significant genes based the adjusted p-values, p-values,
logfc and b-values.
Computing the pooled variance includes 2 steps- creating a design matrix and then fitting
the model for finding the pool variance. The design matrix has the coefficient of the linear
model consisting value 1 if the column belongs to the treatment, else 0. The functionmodel.matrix() is used in LIMMA to create the matrix. According the type of the dataset,
my generated model matrix includes 20 rows (total number of samples) and 2 columns
(tumor and normal). The model.matrix function includes column of all 1’s representing
μ in the ANOVA model Yij = μ + αi + error. ANOVA identifies the mean decomposition
of the observed data. As a result, the fitted values are the estimates of µ and α. Afterwards,
the pooled sample variance for each genes were calculated using the lmfit() functions.
Then the computing of the coefficient matrix for the contrasts was performed. LIMMA
package has a function named makeContrasts() which helps to create the contrast matrix.
One noticeable matter is that, the treatment names are taken from the column matrix by
default, but these names can be changed. The rules of contrast matrix is- sum of coefficients
must be equal to zero and the rows of the matrix must be equal to the number of parameters
in the model.
LIMMA has another function named contrasts.fit() which is used to fit the contrast matrix
to the previous fitted model.
Next, the t-test was performed on the fitted dataset using the eBayes command which
determines the empirical Bayes pooled variance for each gene. The basic form of the t-test
calculates the division between Difference between group Means and Variablity within
Groups. This command also determines the associated p-values. Afterwards the topTable()
function was used to get the gene lists. I used the “FDR”/ “BH” as adjusted p-values to
assemble the genes. There are lots of options that can be used in the parameter (toptable
function | R Documentation, n.d.) which can also be used for more and different analysis
using different functionality.
The mathematical intuitions behind the functions used under LIMMA package (Smyth)
will be described in brief. Let, the linear model for gene j has residual variance 𝛼 2 (j) with
sample value 𝑠 2 (j) and degree of freedom f(j). The fit function performs the ordinary tstatistic for the kth contrast for gene j which is π‘‘π‘—π‘˜ = π›½π‘—π‘˜ /(π‘’π‘—π‘˜ 𝑠𝑗 ) where π‘’π‘—π‘˜ is the unscaled
standard deviation. The covariance matrix for the estimated π›½π‘—π‘˜ can be calculated by
𝛼𝑗 2 𝐢 2 (𝑋 2 𝑉𝑗 𝑋)−1 𝐢 where 𝑉𝑗 is a weight matrix determined by prior weights. The empirical
Bayes method assumes an inverse Chisquare prior for the πœŽπ‘— 2 with mean 𝑠0 2 and degrees
of freedom 𝑓0 . The posterior values for the residual variances are given by
where 𝑓𝑗 is the residual degrees of freedom for the jth gene. The output from eBayes()
contains 𝑠0 2 and 𝑓0 . Limma also provides function topTable() which summarize the results
of the linear model, perform hypothesis tests and adjust the p-values for multiple testing.
Project – 1: Mining Differentially Expressed
Phenomena
PAGE 3
Results usually contain (log) fold changes, standard errors, t-statistics and p-values. The
basic statistic used for significance analysis is the moderated t-statistic, which is computed
for each probe and for each contrast. This has the same interpretation as an ordinary statistic
except that the standard errors have been moderated across genes. Moderated t-statistics
lead to p-values in the same way that ordinary t-statistics do (Ritchie, et al., 2015).
A number of summary statistics are presented by topTable() for the top genes and the
selected contrast (Smyth) and (Altman, 2013)
Types of value
Description
This represents a log2-fold change between two or more
M-value (M)
experimental conditions
A-value (A)
Average log2-expression level for that gene across all the
arrays
t-value
Moderated t-statistic
p-value
Associated p-value after adjustment for multiple testing
If all genes with pvalue below a threshold, say 0.05, are selected as
Adjustment pdifferentially expressed, then the expected proportion of false
value [most
common:” fdr”] discoveries in the selected group is controled to be less than the
theshold value, in this case 5%. (Benjamini & Hochberg, 1995)
2. Results and Discussion
The dataset was collected from Gene Expression Omnibus (GEO) database having the
accession ID: GSE79973. It has 54674 features and 20 samples (10 pairs of tumor and
normal data). At first, we need to check whether the data are ready to use or not. For that
purpose, ‘mean-variance trend’ plot [Figure:1] was drawn using voom() function in
LIMMA R package. The smoothed curve (having the red line) is fitted to sqrt (residual
standard deviation) by average expression and it is also noticeable that there is little noise
for this dataset. A boxplot [Figure 2] was also drawn to check further for the usability of
the dataset for further analysis and it is visible that the mean for each of the genes are very
close to each other which validates the result of Figure 1. Afterwards, the dataset was fitted
to ANOVA model using the lmfit() function of LIMMA package. Using the eBayes()
function of LIMMA a t-test operation was performed with generating the associated pvalues. Later, topTable() function of LIMMA was used to obtain the p-value, adjusted pvalue, logFC value, B-value and many more. As the purpose of this project is to identify
the differentially expressed genes, the focus was only to the p-value and the logFC value.
Because, genes are significant or differentially expressed if the absolute value of logfc were
>=2 or the p values of the genes were <= 0.05. Based on the p-value and logFC value the
Volcano plot was [Figure 3] was drawn. The vertical and horizontal dashed line are
determining the threshold and only the red dotted features are considered to be the
significant genes for the dataset according to the condition of p-value and logFC value.
Based on the condition criteria only 773 genes were detected to be significant ones, which
actually, reduces the feature more than 98% (original features are 54674). Among the 773
features, the upregulated genes were 187 in numbers and the downregulated genes were
Project – 1: Mining Differentially Expressed
Phenomena
PAGE 4
Figure 1: Mean-Variance Trend
Figure 2: Boxplot for 10 pairs of genes
Figure 3: Volcano Plot to categorize the significant genes (red dotted features)
downregulated genes (left part of each of the both figures). Besides, the features extracted
from the LIMMA package were then validated with the freely available GEO2R web
interface incorporated with GEO (Gene Expression Omnibus). Then the data we observed
were very much similar only with an exception of 4 gene mismatches after applying the
condition of p-value. It might happen because of the floating precision values of the pvalues (the floating precision is more accurate in LIMMA). Table 1 elucidates the
comparison more clearly.
Project – 1: Mining Differentially Expressed
Phenomena
PAGE 5
Figure 4: Scatterplot of upregulated (right) and
downregulated (left) genes
Figure 5: Histogram of upregulated (right) and
downregulated (left) genes
Table 1: Comparison of produced Data using LIMMA R Package and GEO2R Data
Types of Conditioning
Original Data from GEO2R
Produced Data using LIMMA
R Package
No Condition
54674
54674
P.Value < 0.05
13040
13044
Absolute value of
logFC >=2
773
773
Downregulated Genes
586
586
Upregulated Genes
187
187
3. Conclusion
Microarrays are powerful tool for monitoring the expressions of lots of genes, specially
with the advancement of microarray technology. The challenging part is to analyze a large
amount of microarray data and making biological sense in them. LIMMA package helps
to analyze the huge dataset and helps to create meaning of genes by detecting and analyzing
the significant genes, such as, upregulated and downregulated genes determined by some
statistical values. I learned a lot about gene expression data and their significance for
Project – 1: Mining Differentially Expressed
Phenomena
PAGE 6
research and disease detection. The challenging part was to learn the LIMMA package and
to learn their functionalities. The statistical values have been very significant in identifying
the significant genes. These statistical models can be used for predicting lots of real life
problems having huge data, the LIMMA package can be used for more complex gene
dataset (two factor data or more) as LIMMA powers differential expression analyses for
RNA data.
References
Altman, N. (2013). Differential Expression Analysis using LIMMA.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical
and Powerful Approach to Multiple Testing.
Bioconductor - GEOquery. (n.d.). Retrieved from
https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html
Bioconductor - limma. (n.d.). Retrieved from
https://bioconductor.org/packages/release/bioc/html/limma.html
GEO Accession viewer. (n.d.). Retrieved from
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE79973
Ritchie, M., Phipson, B., Wu, D., Hu, Y., Law, C., Shi, W., & Smyth, G. (2015, 1 6). Limma
powers differential expression analyses for RNA-sequencing and microarray studies.
Nucleic Acids Research, 43(7), e47.
Smyth, G. (n.d.). Limma: Linear Models for Microarray Data.
toptable function | R Documentation. (n.d.). Retrieved from
https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/toptabl
e
Project – 1: Mining Differentially Expressed
Phenomena
PAGE 7
Download