[TYPE THE COMPANY NAME] Network Generation & Motif Detection in Biological Networks Application of the MASS Library Sai Badey Winter 2014 This document outlines my research over the last quarter under Professor Fukuda. There are multiple ways to generate biological networks for comparison. This project identifies four key ways to generate methods and the potential implications. 1 Application of biological networks occur at both micro and macro levels. They map out connections between two organic compounds in organisms or represent the connections between different organisms or communities. The current work focuses on detecting the essential proteins in microorganisms. Currently existing software will search through networks, represented by vertices and edges, and locate recurring patterns. If there is a significantly larger level of frequency of occurrence of these patterns than in other similar networks, the pattern is labelled a network motif. These motifs are unique across networks and are invaluable in determining essential proteins in microorganisms. Some of important terminology is defined in this section. Biological networks are simply a list of connections between different components. Motifs are certain patterns which emerge from these connections that are significant for being uniquely prevalent in a particular network. Isomorphs are different representations of a single pattern. These are difficult to detect since many patterns seem unique until viewed from a specific angle. Vertices are the nodes, or the agents, which form the core of the network. The edges represent the connections between the vertices. Although in real life, the connections are very complex, they are represented by edges which simply imply whether or not there is a connection. My work is based on Amala Ghandi’s (a previous research student) previous paper. Her work allows for an efficient method of searching patterns within a network. Her search method includes tracking patterns where each vertex connects to a higher-level vertex only. This search method ensures that no pattern is counted more than once (solving the issue with counting isomorphs). 2 My work attempts to explore the area of network generation. More specifically, it attempts to determine a proper method of network generation that creates networks that is random, yet still contains the key characteristics of the original. In many previous studies conducted in the biological network area, few focus on the methods of network generation. However, this may be the case since it is possible that the manner of network generation does not play any effect when scaled up to large counts in the hundreds or thousands of graphs. This is an important area to explore, nonetheless, since if there is a difference, it will have great implications for many study results published thus far. The research attempts to create several methods of network generation, some which are currently in use. The 4 main methods of random network generation include: taking the vertex and changing its order, but still maintaining the same number of edges; swap the node degrees in a random manner; combine 1,000 graphs into a single graph where only samples are drawn; and “direct generation”. All four methods of network generation assure that the key aspects of the networks are preserved. All of the vertices are intact and the number of interactions between the vertices is untouched as well. The only change occurs to the where the interactions occur. Some nodes have higher degrees of interactions compared to other nodes. This node cannot be suddenly isolated, but the nodes and the number of nodes that it interacts with can change. The first two methods capitalize on this in order to simply swap the places of the nodes or the node degrees. The latter two methods are already in use in several publications (which are referenced at the end of this paper) proving their validity. 3 There are two main possible results: there IS NO significant difference between the results of each of the network generation methods and that there IS a significant difference. This is determined through z-score analysis and comparisons of the results of the different generation methods. Since the methods are so vastly different from one another, a single seed cannot determine the validity of the information. Several networks will have several sets of graphs created from which the data is drawn. This allows for good comparison of performance time and the results. If there is no significant difference in the generation methods, it is still possible that one of the methods has a shorter performance time than the others. If this is a significant difference, it will still be a viable standard for future network generations. However, if there is no significant difference, it will give rise to a larger issue – how to determine the accuracy of one network generation method over the other? The applications of this project vary in range due to its applicability in the various fields of biology (macro & micro). It can be essential in detecting certain types of diseases or cancers in the medical field, it can help detect essential proteins in microbiology, it can even be used for finding key niches in ecosystems in zoology. Since all of these fields have a massive amount of information to process, a MASS integrated approach to finding network motifs can have enormous impact in the field. Currently, the scope of this project is to analyze a network through the MASS library. Then, several graph will be generated through the 4 network generation methods outlined above using the MASS and JUNG libraries. These graphs will undergo different types of analyses (using the entire graph for smaller graphs, and using samples for larger ones). 4 In the future, this project will attempt to determine motifs along multiple graphs using the network motif processes and have a fully-fledged compare networks class that allows a user to determine the types of comparisons and analyses they would like to do. Resources: Amala Ghandi’s research paper Sahand Khakabimamaghani, Iman Sharafuddin, Norbert Dichter, Ina Koch, Ali Masoudi-Nejad QuateXelero: An Accelerated Exact Network Motif Detection Algorithm (Article) Joseph Blitzstein and Persi Diaconis A SEQUENTIAL IMPORTANCE SAMPLING ALGORITHM FOR GENERATING RANDOM GRAPHS WITH PRESCRIBED DEGREES (Article) Bjorn H. Junker & Falk Schreiber Analysis of Biological Networks (Book)