- UCLA Human Genetics

advertisement
Manual for the MTOM software
Ai Li1 Steve Horvath1,2
1 Dept. of Biostatistics, School of Public Health, UCLA
2 Dept. of Human Genetics, David Geffen School of Medicine, UCLA
Correspondence: liaipk@gmail.com or shorvath@mednet.ucla.edu
The MTOM software uses the multi-node topological overlap measure (MTOM) for
gene neighborhood analysis and for module detection.
To cite the MTOM software, please use the following references:

Li A, Horvath S (2006) Network Neighborhood Analysis with the multi-node topological
overlap measure. Bioinformatics. doi:10.1093/bioinformatics/btl581

Li A, Horvath S (2007) Network Module Detection: Affinity Search Technique with the
Topological Overlap Measure. Submitted.
Here we provide a brief user manual.
0. The Windows software can be downloaded from the following webpage
http://www.genetics.ucla.edu/labs/horvath/MTOM/
1. After double clicking the MTOM icon, you should see the following screen that allows you to
input the data.
Fig. 1
Note that the MTOM software has 3 tabs:
a) Data Input
b) Finding Gene Neighbors
c) Module Detection
In the following, we explain these tabs.
1
2. Data Input
The software allows one to input gene expression data and to compute a gene co-expression
network. Alternatively, one can input a co-expression similarity measure (assume to take on values
in the unit interval). This similarity measure can be transformed into an adjacency measure by
raising it to a power (soft thresholding which results in a weighted network adjacency) or by
dichotomizing it (hard-thresholding it which results in an unweighted network).
Alternatively, one can input an adjacency matrix directly. Input format: square matrix where all
entries take on values in the unit interval.
[1] Input Microarray Gene Expression Data
In most applications, the user will input gene expression data. In the following, we describe how
to input gene expression data and how to compute a corresponding gene co-expression network.
Later, we describe how to carry out network neighborhood analysis and module detection.
The file format looks as follows
The first column gives the names of rows (genes). It is recommended because the
names will be regarded as the IDs for genes. If there is no such a column, the software
will assign integer numbers as the IDs according to the orders of genes in the file.
The first row indicates the names of each
column (samples). It is not necessary
Fig. 2
The file should be a comma delimited Excel “.csv” file. The first row can contain the column
names first column can contain the gene names (or probeset IDs). Rows correspond to genes,
columns correspond to microarray samples. Missing data should be represented by a blank or by
“NA” (which stands for not available).
2
To input a gene expression data file, select “Expression” and click “Input”.
Then you will see the following input dialog:
If the first row contains the
column
samples)
names
(microarray
select
“Yes”;
Otherwise “No”
If the first column contains the
gene names select “Yes”
Fig. 3
Select the file you want to input in the file-open dialog, click “open”.
One may need to wait for a few seconds if your file has more than 10,000 genes. A summary of
the input data file will appear
Fig. 4
Click “OK” to proceed. But if the number of rows or columns does not make sense to you, check
the input data. Remember that this should be a comma delimited csv file!
Since neighborhood analysis can be computationally intensive, it is often advisable to reduce the
number of genes that are considered for the network construction. To do this, we provide multiple
filters for reducing the number of genes.
3
Click “OK” on the dialog which just jumped out.
Fig 5
The software implements four filtering options:
a) coefficient of variation, which is defined as the standard deviation divided by the mean,
i.e., CV=sqrt(variance)/mean.
b) variance of the expression
c) mean expression
d) whole network connectivity
Usually, we filter genes based on the variance and the connectivity. The order in the panel below
matters. For example, when choosing a number of 10000 for the variance and a number of 4000
for the connectivity, the program will first select the 10000 most varying genes, next compute the
connectivity of each of these genes, and then restrict the analysis to the 4000 most connected
genes (among the 10000 most varying genes). Since modules are comprised of highly connected
genes, one does not lose much information when restricting the analysis to the most connected
genes.
Select
the
filter(s)
first.
Multiple choices are allowed.
And the genes satisfying the
filters
will
be
used
in
subsequent analysis.
Note that the connectivity is
computed based on a soft
thresholding
approach
(weighted gene co-expression
network) or a hard threshold
Fig. 6
Click “Filter” after choosing the filtering criteria. The computation of the connectivity may take a
few minutes if you have more than 10,000 genes. After the filtering is done, you will be asked
whether you want to save the filtered data set into a separate file. Regardless of whether or not you
save the filtered data, the software will compute a co-expression network based on these filtered
genes.
4
Correlation matrix: The next step is to compute a correlation matrix which is a measure of
similarity between the gene expression profiles. We have implemented an option to compute a
leave one out correlation matrix which is the average correlation after leaving out microarray
samples one at a time. This may take a few minutes if you have more than 10,000 filtered genes.
Adjacency matrix: After the correlation matrix is computed, you can choose thresholding
approach. Soft thresholding with the power adjacency function will result in a weighted gene
co-expression network (Zhang and Horvath 2005). Hard thresholding will result in an unweighted
network. The standard approach considers the absolute value of the correlation matrix, i.e. the
co-expression information ignores the sign of the correlation. However, we have also implemented
an option to keep track of the sign of the correlation which results in a signed network, i.e. the
neighbors will have positive correlations with the seed genes.
Standard weighted network
aij | cor ( xi , x j ) |
where  is a soft threshold.
A signed weighted network is defined by
aij | (1  cor ( xi , x j )) / 2 |
Power
adjacency
function:
input
the
power
Input hard threshold paramater. Pairs of genes with
parameter here. All the correlation will be raised to
correlations above this threshold, will be connected
this power in the adjacency matrix. Power
(adjacency 1). Otherwise 0
adjacency function is recommended.
Check this box to get a signed
network
5
Fig. 7
It may take several minutes if you input a data set with more than 10,000 genes.
[2] Input a similarity matrix data file
A similarity matrix specifies how similar two nodes are. The program allows the user to turn this
similarity into a network by soft or hard thresholding. For example, the similarity measure could
be the absolute value of a correlation matrix. After inputting it, you need to choose thresholding
approach to turn it into an adjacency matrix.
There is no difference for the format between similarity matrix data file and expression file
described in Fig. 2. One just needs to make sure that it is a symmetric square matrix without
missing entries.
After inputting a similarity matrix, one needs to calculate the adjacency matrix just as is described
in Fig. 7
[3] Input an adjacency matrix data file
Same input format as for a similarity matrix and expression file described in Fig. 2.
sure that it is a symmetric square matrix without missing entries.
Just make
[4] Input an interaction data file
The software also allows the user to input the network in two column format. For example, Fig. 8
presents an example of a protein protein network interaction file. This assume an unweighted
network, i.e adjacencies are either 1 or 0.
Interaction Part A
Interaction Part B
Fig. 8
6
2. Neighborhood Analysis
After inputting the data, we are now ready for neighborhood analysis.
Click the tab “Finding Gene Neighbors”
Input your initial seed genes here; multiple seeds are
Input the number of the closest
separated by a comma. To achieve best result, highly
neighboring genes you want
correlated or topological overlaped initial genes are
(i.e. the neighborhood size)
recommended
List of closest
neighbors
The
corresponding
MTOM value
Click here to begin the search
Choose the search approach. A recursive search is
recommended but
a non-recursive search is much faster.
Fig. 9
7
After click search a dialog box in Fig. 10 will appear if multiple seeds are input. Click “OK” to
proceed.
Correlation of the two seeds
Fig. 10
This output allows one to determine whether seed genes have a reasonably high correlation with
each other. It does not make sense to input a pair of seed genes if they don’t have a reasonably
high correlation (say larger than .5 but this depends to some extent on the number of microarrays).
8
3. Module Detection.
This tab implements the Module Affinity Search Technique (MAST).
The procedure forms modules around a set of hub neighborhoods which can either be input
directly or automatically determined by the program.
The procedures is carried out in 3 steps, which are described in our article and in the figure below.
Don’t try to understand the details from the figure.
The main message is that you can either import a list of initial hub seeds or the program can find
initial seeds automatically (step 1).
In step 2, the hub seeds and neighborhoods are grown into preliminary modules.
In step 3, the preliminary modules may be merged if their relative similarity passes a threshold
9
The following tab describes module detection.
Click here to let the software to find hub
Click here to extend the hub
neighborhoods automatically (step 1)
neighborhood
Click to read in your own initial seeds of
Merge threshold for step 3
to
preliminary
modules (step 2)
hub neighborhoods (step 1)
Step 3:
Merge
preliminary
modules
Click here to save the result
These toggles allow you to choose
which clustering results you want to
output.
10
Regarding step 1
The following file is not necessary when the initial hubs are chosen automatically. But if you want
to input your own initial seeds or hub neighborhoods directly, use the following format:comma
delimited csv file.
The second column labels the initial seeds or hub neighborhoods.
In other words, if you input 10 distinct hubs then you end up with 10 rows whose labels run from
1 to 10. However, if you have 2 seeds for one intial neighborhood, then you would have 2
corresponding rows.
Potential application: This kind of input allows you to first use hierarchical clustering in R to find
modules. Next to use hub nodes in those modules as seeds for MTOM.
The first column gives the
names of genes/proteins in the
hub neighborhoods
The second column gives the
membership
of
the
corresponding
protein/gene:
The genes/proteins with 1
belong to the first hub
neighborhood;
The
genes/proteins with 2 belong to
the second hub neighborhood,
etc.
11
The output file assigns to each gene a cluster label.
Importantly, 0 is reserved for un-assigned nodes.
Sometimes we use the color grey in network analysis to denote these unclustered genes.
Output file:
Gene Names
Module Membership:
0 indicates that the
corresponding gene is
not in any module;
1
indicates
the
corresponding gene is in
the module 1; 2 indicates
the corresponding gene
is in the module 2, etc.
If you want to learn more about the module detection, please contact us by email.
THE END
12
Download