- UCLA Human Genetics

Manual for the MTOM software Ai Li1 Steve Horvath1,2 1 Dept. of Biostatistics, School of Public Health, UCLA 2 Dept. of Human Genetics, David Geffen School of Medicine, UCLA Correspondence: liaipk@gmail.com or shorvath@mednet.ucla.edu The MTOM software uses the multi-node topological overlap measure (MTOM) for gene neighborhood analysis and for module detection. To cite the MTOM software, please use the following references:  Li A, Horvath S (2006) Network Neighborhood Analysis with the multi-node topological overlap measure. Bioinformatics. doi:10.1093/bioinformatics/btl581  Li A, Horvath S (2007) Network Module Detection: Affinity Search Technique with the Topological Overlap Measure. Submitted. Here we provide a brief user manual. 0. The Windows software can be downloaded from the following webpage http://www.genetics.ucla.edu/labs/horvath/MTOM/ 1. After double clicking the MTOM icon, you should see the following screen that allows you to input the data. Fig. 1 Note that the MTOM software has 3 tabs: a) Data Input b) Finding Gene Neighbors c) Module Detection In the following, we explain these tabs. 1 2. Data Input The software allows one to input gene expression data and to compute a gene co-expression network. Alternatively, one can input a co-expression similarity measure (assume to take on values in the unit interval). This similarity measure can be transformed into an adjacency measure by raising it to a power (soft thresholding which results in a weighted network adjacency) or by dichotomizing it (hard-thresholding it which results in an unweighted network). Alternatively, one can input an adjacency matrix directly. Input format: square matrix where all entries take on values in the unit interval. [1] Input Microarray Gene Expression Data In most applications, the user will input gene expression data. In the following, we describe how to input gene expression data and how to compute a corresponding gene co-expression network. Later, we describe how to carry out network neighborhood analysis and module detection. The file format looks as follows The first column gives the names of rows (genes). It is recommended because the names will be regarded as the IDs for genes. If there is no such a column, the software will assign integer numbers as the IDs according to the orders of genes in the file. The first row indicates the names of each column (samples). It is not necessary Fig. 2 The file should be a comma delimited Excel “.csv” file. The first row can contain the column names first column can contain the gene names (or probeset IDs). Rows correspond to genes, columns correspond to microarray samples. Missing data should be represented by a blank or by “NA” (which stands for not available). 2 To input a gene expression data file, select “Expression” and click “Input”. Then you will see the following input dialog: If the first row contains the column samples) names (microarray select “Yes”; Otherwise “No” If the first column contains the gene names select “Yes” Fig. 3 Select the file you want to input in the file-open dialog, click “open”. One may need to wait for a few seconds if your file has more than 10,000 genes. A summary of the input data file will appear Fig. 4 Click “OK” to proceed. But if the number of rows or columns does not make sense to you, check the input data. Remember that this should be a comma delimited csv file! Since neighborhood analysis can be computationally intensive, it is often advisable to reduce the number of genes that are considered for the network construction. To do this, we provide multiple filters for reducing the number of genes. 3 Click “OK” on the dialog which just jumped out. Fig 5 The software implements four filtering options: a) coefficient of variation, which is defined as the standard deviation divided by the mean, i.e., CV=sqrt(variance)/mean. b) variance of the expression c) mean expression d) whole network connectivity Usually, we filter genes based on the variance and the connectivity. The order in the panel below matters. For example, when choosing a number of 10000 for the variance and a number of 4000 for the connectivity, the program will first select the 10000 most varying genes, next compute the connectivity of each of these genes, and then restrict the analysis to the 4000 most connected genes (among the 10000 most varying genes). Since modules are comprised of highly connected genes, one does not lose much information when restricting the analysis to the most connected genes. Select the filter(s) first. Multiple choices are allowed. And the genes satisfying the filters will be used in subsequent analysis. Note that the connectivity is computed based on a soft thresholding approach (weighted gene co-expression network) or a hard threshold Fig. 6 Click “Filter” after choosing the filtering criteria. The computation of the connectivity may take a few minutes if you have more than 10,000 genes. After the filtering is done, you will be asked whether you want to save the filtered data set into a separate file. Regardless of whether or not you save the filtered data, the software will compute a co-expression network based on these filtered genes. 4 Correlation matrix: The next step is to compute a correlation matrix which is a measure of similarity between the gene expression profiles. We have implemented an option to compute a leave one out correlation matrix which is the average correlation after leaving out microarray samples one at a time. This may take a few minutes if you have more than 10,000 filtered genes. Adjacency matrix: After the correlation matrix is computed, you can choose thresholding approach. Soft thresholding with the power adjacency function will result in a weighted gene co-expression network (Zhang and Horvath 2005). Hard thresholding will result in an unweighted network. The standard approach considers the absolute value of the correlation matrix, i.e. the co-expression information ignores the sign of the correlation. However, we have also implemented an option to keep track of the sign of the correlation which results in a signed network, i.e. the neighbors will have positive correlations with the seed genes. Standard weighted network aij | cor ( xi , x j ) | where  is a soft threshold. A signed weighted network is defined by aij | (1  cor ( xi , x j )) / 2 | Power adjacency function: input the power Input hard threshold paramater. Pairs of genes with parameter here. All the correlation will be raised to correlations above this threshold, will be connected this power in the adjacency matrix. Power (adjacency 1). Otherwise 0 adjacency function is recommended. Check this box to get a signed network 5 Fig. 7 It may take several minutes if you input a data set with more than 10,000 genes. [2] Input a similarity matrix data file A similarity matrix specifies how similar two nodes are. The program allows the user to turn this similarity into a network by soft or hard thresholding. For example, the similarity measure could be the absolute value of a correlation matrix. After inputting it, you need to choose thresholding approach to turn it into an adjacency matrix. There is no difference for the format between similarity matrix data file and expression file described in Fig. 2. One just needs to make sure that it is a symmetric square matrix without missing entries. After inputting a similarity matrix, one needs to calculate the adjacency matrix just as is described in Fig. 7 [3] Input an adjacency matrix data file Same input format as for a similarity matrix and expression file described in Fig. 2. sure that it is a symmetric square matrix without missing entries. Just make [4] Input an interaction data file The software also allows the user to input the network in two column format. For example, Fig. 8 presents an example of a protein protein network interaction file. This assume an unweighted network, i.e adjacencies are either 1 or 0. Interaction Part A Interaction Part B Fig. 8 6 2. Neighborhood Analysis After inputting the data, we are now ready for neighborhood analysis. Click the tab “Finding Gene Neighbors” Input your initial seed genes here; multiple seeds are Input the number of the closest separated by a comma. To achieve best result, highly neighboring genes you want correlated or topological overlaped initial genes are (i.e. the neighborhood size) recommended List of closest neighbors The corresponding MTOM value Click here to begin the search Choose the search approach. A recursive search is recommended but a non-recursive search is much faster. Fig. 9 7 After click search a dialog box in Fig. 10 will appear if multiple seeds are input. Click “OK” to proceed. Correlation of the two seeds Fig. 10 This output allows one to determine whether seed genes have a reasonably high correlation with each other. It does not make sense to input a pair of seed genes if they don’t have a reasonably high correlation (say larger than .5 but this depends to some extent on the number of microarrays). 8 3. Module Detection. This tab implements the Module Affinity Search Technique (MAST). The procedure forms modules around a set of hub neighborhoods which can either be input directly or automatically determined by the program. The procedures is carried out in 3 steps, which are described in our article and in the figure below. Don’t try to understand the details from the figure. The main message is that you can either import a list of initial hub seeds or the program can find initial seeds automatically (step 1). In step 2, the hub seeds and neighborhoods are grown into preliminary modules. In step 3, the preliminary modules may be merged if their relative similarity passes a threshold 9 The following tab describes module detection. Click here to let the software to find hub Click here to extend the hub neighborhoods automatically (step 1) neighborhood Click to read in your own initial seeds of Merge threshold for step 3 to preliminary modules (step 2) hub neighborhoods (step 1) Step 3: Merge preliminary modules Click here to save the result These toggles allow you to choose which clustering results you want to output. 10 Regarding step 1 The following file is not necessary when the initial hubs are chosen automatically. But if you want to input your own initial seeds or hub neighborhoods directly, use the following format:comma delimited csv file. The second column labels the initial seeds or hub neighborhoods. In other words, if you input 10 distinct hubs then you end up with 10 rows whose labels run from 1 to 10. However, if you have 2 seeds for one intial neighborhood, then you would have 2 corresponding rows. Potential application: This kind of input allows you to first use hierarchical clustering in R to find modules. Next to use hub nodes in those modules as seeds for MTOM. The first column gives the names of genes/proteins in the hub neighborhoods The second column gives the membership of the corresponding protein/gene: The genes/proteins with 1 belong to the first hub neighborhood; The genes/proteins with 2 belong to the second hub neighborhood, etc. 11 The output file assigns to each gene a cluster label. Importantly, 0 is reserved for un-assigned nodes. Sometimes we use the color grey in network analysis to denote these unclustered genes. Output file: Gene Names Module Membership: 0 indicates that the corresponding gene is not in any module; 1 indicates the corresponding gene is in the module 1; 2 indicates the corresponding gene is in the module 2, etc. If you want to learn more about the module detection, please contact us by email. THE END 12

- UCLA Human Genetics

Related documents

Products

Support

- UCLA Human Genetics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib