Integrative Analysis of Heterogeneous Genomic Datasets to Discover Genetic Etiology of Autism Spectrum Disorders by Sumaiya Nazeen B.Sc. in Computer Science and Engineering, Bangladesh University of Engineering and Technology (2011) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science MASSACHI1-g 516 O TECHNOLOGY at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. SEP 2 5 20% LIBRARIES Signature redacted Author ................. ................... ... Department of Electrical Engineering and Computer Science August 28, 2014 Certified by......Signature ........................ Bonnie A. Berger Professor of Applied Mathematics and Computer Science Thesis Supervisor Accepted by ................. Signature redacted....... / )tOjie A. Kolodziejski Professor of Electrical Engineering Chair, Department Committee on Graduate Students Integrative Analysis of Heterogeneous Genomic Datasets to Discover Genetic Etiology of Autism Spectrum Disorders by Sumaiya Nazeen Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2014, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract Understanding the genetic background of complex diseases is crucial to medical research, with implications to diagnosis, treatment and drug development. As molecular approaches to this challenge are time consuming and costly, computational approaches offer an efficient alternative. Such approaches aim at predicting and prioritizing genes for a particular disease of interest. State-of-the-art gene prediction and prioritization methods rely on the observation that disease-causing genes have some sort of functional similarity based on either sequence, phenotype, protein-protein interaction (PPI) network, or functional annotation. Another increasingly accepted view is that human diseases result from perturbations of molecular networks, and genes causing the same or similar diseases tend to be close to one another in molecular networks. Such observations have built the basis for a large collection of computational approaches to find previously unknown genes associated with certain diseases. The majority of the methods are designed based on protein interactome networks, with integration of other large-scale omics data, to infer how likely it is that a gene is associated with a disease. In this thesis, we set out to address this outstanding challenge of understanding the genetic etiology of autism spectrum disorder (ASD), which refers to a group of complex neurodevelopmental disorders sharing the common feature of dysfunctional reciprocal social interaction. We introduce three novel methods for computing how likely a given gene is to be involved in ASDs based on copy number variations (CNVs), phenotype similarity, and protein interactome network topology. We also customize a random walk with restarts algorithm for ASD gene prioritization for the first time. Finally, we provide a novel integrative approach for combining CNV, phenotype similarity, and topology-related information with existing knowledge from literature. Our integrative approach outperforms the individual schemes in identifying and ranking ASD related genes. Our candidate gene set provides a number of interesting biological insights in that it is overrepresented in a number of interesting signaling, cell-adhesion and neurological pathways, molecular functions, and biological processes that are worth further investigation in connection with ASDs. We also find evidence for an interesting connection between gastrointestinal disorders, particularly inflammatory bowel diseases (IBD), and ASDs. The subnetworks we identify indicate the possibility of existence of subclasses of disorders along the autism spectrum. Thesis Supervisor: Bonnie A. Berger Title: Professor of Applied Mathematics and Computer Science 3 4 Acknowledgments This thesis owes its existence to Professor Bonnie Berger. It has been an amazing experience to work with her. She has been an excellent source of encouragement and inspiration to me. She has been incredibly patient with me and always put my personal growth as a researcher first. I cannot thank her more for teaching me how to approach the process of learning and research. I am indebted to Dr. Rohit Singh for his constant help, advice, support, and mentorship in all aspects of my thesis. This work would not have been possible without his invaluable advice and support. I remember countless meetings with him in which I walked in frustrated, yet walked out encouraged and excited again. I'd like to thank Rohit for his warm support and patience in teaching me how to face the moments when progress seems slow. I would like to thank Professor Isaac Kohane, Dr. Nathan Palmer, and Dr. Finale DoshiVelez for sharing their knowledge of autism spectrum disorders with me. Many thanks to the members of Berger lab for sharing my exciting as well as frustrating moments. I'd like to thank Patrice for lightening up my days with her warm greetings. I am grateful to George, Hoon, Sean, and Christina for having discussions with me and encouraging me along in my research. Thanks to Andrew, Deniz, Jian, Noah, and William for being there whenever I needed help. I owe my gratitude to the Bangladeshi Students Association at MIT, which has become my family in Boston. As always, I am ever grateful to my parents and siblings for their love and constant support. Finally, I express my utmost gratitude to my greatest supporter: to the Almighty Allah, who has bestowed good health upon me, kept me free from anxiety, and filled my everyday with joy and hope. 5 6 Contents Abstract 3 Acknowledgments 5 List of Figures 11 List of Tables 13 1 15 M otivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2.1 General Trends in Disease Gene Prediction . . . . . . . . . . 18 1.2.2 Computational Advances in ASD Gene Prediction . . 1.1 . . . . . . . . . . . . . . . 26 Contributions . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.4 O utline . . . . . . . . . . 29 . 1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predicting and Prioritizing Candidate Genes for ASD 2.1.1 Copy Number Variation (CNV) . . . . . . . . . . . . . . . . . . . . . 32 2.1.2 Copy Number Variants in ASD . . . . . . . . . . . . . . . . . . . . . . 33 2.1.3 Calculating Information Entropy Score from CNVs . . . . . . . . . . . 34 2.1.4 Quality of CNV Information Entropy based Prioritization . . . . . . . 35 . . . . . . . . . . . . 32 ASD Similarity based Prioritizer . . . . . . . . . . 36 2.2.1 Similarity of Phenotypes or Diseases . . . . . . . . . . . . . . . . . . 36 2.2.2 Gene-Phenotype Association Data . . . . . . . . . . . . . . . . . . . 38 2.2.3 Calculating ASD Similarity Scores . . . . . . . . . . . . . . . . . . . 38 . . . . . . . . . . . . . . . . 2.2 CNV Information Entropy based Prioritizer . . . . . . . . . 2.1 31 . 2 Introduction 7 2.2.4 2.3 2.4 3 Diffusion State ASD Proximity based Prioritizer . . . . . . . . . . . . . . . . . 40 2.3.1 Diffusion State Distance (DSD) in PPI Network . . . . . . . . . . . . . 40 2.3.2 Calculating Diffusion State ASD Proximity (DSAP) of Genes . . . . . 42 2.3.3 Quality of DSAP-based Ranking . . . . . . . . . . . . . . . . . . . . . 42 Network Crosstalk based Prioritizer . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.1 Motivation 2.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.4.3 Calculating Network Crosstalk Scores 2.4.4 Dealing with Statistical Bias 2.4.5 Performance of Network Crosstalk based Prioritizer . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 . . . . . . . . . . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 3.3 49 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.1 Lasso-penalized Logistic Regression . . . . . . . . . . . . . . . . . . . . 50 Predicting ASD Association via Logistic Regression based Integrative Approach 50 3.2.1 Preparing Data for Training and Validation . . . . . . . . . . . . . . . 50 3.2.2 Constructing Lasso-regularized Binomial Regression Model 3.2.3 Selecting Model Coefficients . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.4 Creating Regularized Model and Making Predictions . . . . . . . . . . 51 . . . . . . 50 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 ASD Genetics: Implications from Candidate ASD Risk Genes 57 4.1 Gene Sets for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Hypergeometric Test for Enrichment . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Pathway Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.1 5 . . . . . . . . . . . . 38 Integrative Approach for Identifying ASD Risk Genes 3.1 4 Performance of ASD Similarity based Prioritizer An Interesting Connection with Inflammatory Bowel Disease (IBD) . . 62 4.4 Enrichment Analysis on GO gene sets 4.5 Enrichment Analysis for Subnetworks . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 Functional Analysis for Overlap with Diseases and Bio-functions . . . . . . . . . . . . . . . . . . . . . . 62 Conclusion . . . . . . . 66 71 Appendix A SFARI Genes for Autism Spectrum Disorders 8 75 Appendix B Risk Genes for ASDs Identified by Integrative Approach 87 Appendix C Subnetworks in ASD Risk Gene Set 95 Bibliography 99 9 10 List of Figures . . . . . . . . . . . . . . . 32 2-1 Copy number variations in a pair of chromosomes. 2-2 Steps in CNV-based prediction-prioritization of ASD genes. 2-3 Receiver operating characteristic curves for CNV-based prioritizer using dif- . . . . . . . . . . 35 ferent scaling factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2-4 Lift chart for CNV-based prioritizer. 2-5 Receiver operating characteristic curve for ASD similarity based prioritizer. 2-6 Lift chart for ASD similarity based prioritizer. . . . . . . . . . . . . . . . . . . 40 2-7 Receiver operating characteristic curve for Diffusion State ASD Proximity (DSAP) based prioritizer. . . . . . . . . . . . . . . . . . . . . . . . 37 . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2-8 Lift chart for Diffusion State ASD Proximity (DSAP) based prioritizer. . . . . 43 2-9 Receiver operating characteristic curves for network crosstalk based prioritizer using different restart probabilities (r). . . . . . . . . . . . . . . . . . . . . . . 48 2-10 Lift chart for network crosstalk- based prioritizer. . . . . . . . . . . . . . . . . 48 3-1 Performance curves for integrative approach on training data. . . . . . . . . . 53 3-2 Receiver operating characteristics curves for different ASD gene predictionprioritization methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3-3 Lift chart of integrative approach for ASD.gene prediction-prioritization. 4-1 Significant GO biological processes associated with ASD risk gene set. .... 64 4-2 Significant GO molecular functions associated with ASD risk gene set. ..... 65 4-3 Top four subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity@ . . . 55 Pathway Analysis (IPA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 11 12 List of Tables 1.1 Summary of general trends in disease gene prediction-prioritization methods. 3.1 Selected regression coefficients for the integrative approach from logistic regression in order of predictive value. 3.2 25 . . . . . . . . . . . . . . . . . . . . . . . 52 Selected logistic regression coefficients for integrating different ASD association scores in order of predictive value. . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Selected logistic regression coefficients for integrating ASD-pathway membership information with weights in order of predictive value. . . . . . . . . . . . 53 4.1 Canonical pathways having significant overlap with ASD risk genes . . . . . . 62 4.2 IBD-related pathways having significant overlap with ASD risk genes. 4.3 Top 10 diseases having significant overlap with ASD risk genes found by QIAGEN's Ingenuity® Pathway Analysis (IPA). 4.4 . . . . 63 . . . . . . . . . . . . . . . . 68 Top 30 functions having significant overlap with ASD risk genes found by QIAGEN's Ingenuity@ Pathway Analysis (IPA). . . . . . . . . . . . . . . . . 68 A.1 ASD risk genes reported by SFARI gene module. . . . . . . . . . . . . . . . . 85 B.1 Probabilities of association with ASDs for candidate genes identified. by our integrative analysis approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 C.1 Subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity® Pathway Analysis (IPA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 13 14 Chapter 1 Introduction 1.1 Motivation Identifying disease-causing genes is a fundamental challenge in human health with applications in understanding disease mechanisms, diagnosis, and therapy. Many approaches have been adopted for discovery of candidate genes to date [124]. Traditional genetic mapping methods include linkage analysis and genome-wide association studies (GWAS) of Mendelian diseases and complex traits. While GWAS are powerful and effective, they face challenges in narrowing down long lists of candidate genes [5]. Furthermore, diseases often do not follow the simple genotype-phenotype model, but are rather the consequences of perturbations of multiple genes connected in a molecular network, induced by various factors such as genetic mutations, epigenetic changes, and pathogens [114]. Efforts towards discovering the properties of disease genes in molecular networks have shown that genes associated with the same or similar diseases, tend to have some degree of functional similarity. Such similarity can be based on sequence [36], functional annotation [89], protein-protein interactions [34,56,85], etc. [84]. These findings became the basis for the development of computational approaches for predicting and prioritizing disease genes. While traditional disease-causing gene identification methods are time-consuming and costly, these computational approaches offer an efficient alternative. Autism spectrum disorder (ASD) refers to a group of neurodevelopmental disorders defined by three categories of deficits: abnormal development or impairment of social interaction, abnormal development or impairment of communication skills, and stereotypic and repetitive behaviors [9]. Recent estimates show that ASDs are prevalent in 0.75% to 15 1% of the population [33,53, 54]. Among the conditions encompassed by ASDs, pervasive developmental diseorder-not otherwise specified (PDD-NOS) and autistic disorder are the most common, whereas Asperger syndrome appears less frequently. ASD is almost five times more common among boys (1 in 42) than among girls (1 in 189) [28], an effect that becomes even more pronounced in so-called high-functioning cases. Before the 1970s, autism was not widely appreciated to have a strong genetic basis. Instead, various psychodynamic interpretations, including the role of a cold style of mothering, were considered as potential causes. The importance of gentic contributions came into light in the 1980s, when the co-occurence of chromosomal disorders and rare syndromes with ASDs were identified [161. Subsequent twin and family studies provided support for a strong genetic component, but lack of uniform diagnostic criteria limited the power of those studies. The development of validated diagnostic and assessment tools like ADI-R and ADOS for ASDs in 1990s has proven crucial to the advancement of ASD research, and since then the diagnosis of ASDs has been gaining in frequency. These tools in concert with important technological advances, has made it possible to carry out a range of studies such as, candidate gene association studies, resequencing studies, genome-wide assessment of copy number variations (CNVs), etc. This ability has led to identification of a large number of autism susceptibility genes and an increased attention to the effects of de novo and inherited CNVs, thus supporting the notion that genetic factors are a predominant cause of ASDs. Moreover, higher ASD concordance rates in monozygotic twins (36-95%) compared to dizygotic twins (0-31%) [40,95,96, 108] and increased risk (at least 2-18%) in families with a history of related disorders [46,86,106] also suggest a strong genetic component behind ASD. However, genetic studies have been able to connect only 1-2% of autism cases to individual mutations in the autism susceptibility genes and loci, and about 20% of cases to their combined effect [2]. One difficulty in studying genetic causes of ASDs is that different conditions are caused by different genetic mutations. In addition, since a condition is caused by a combined effect of many mutations, the individual effects of each mutation are often small and thus hard to detect. An additional difficulty in studying ASD relates to its heterogeneous nature. Specifically, the ASD population exhibits a wide range of conditions characterized by impairments in reciprocal social interaction and communication, as well as restricted and repetitive behaviors. Although some common pathways related to ASD have been identified [21,91, 98], this heterogeneity of ASDs makes things challenging. Furthermore, small 16 sample sizes in studies limit their statistical power in most cases. Thus, to comprehensively identify risk genes and molecular pathways in ASDs, we need to perform either molecular analysis with substantially larger sample sizes stratifying patients into more heterogeneous groups by diagnostic criteria, sex, or family history; or more sophisticated computational analysis. Towards understanding the genetics of ASDs over the past two decades, researchers have mainly focused on linkage studies, genome-wide association studies, and microarray gene expression studies. Linkage studies aim at finding out the rough location of a disease gene relative to another DNA sequence called a genetic marker, which has its position already known. Affected families are genotyped using a collection of genetic markers across the genome, and how those genetic markers segregate with the disease across multiple families is examined. Most autism-related linkage studies have identified linkage regions reaching the threshold of suggestive linkage at best [35]. Loci on most chromosomes have been suggested to harbor ASD risk, but only a few of them have been independently identified. To date, only loci 7q22-23 [80,81] and 17q11-21 [22,104,123] have been replicated and considered significant on a genome-wide scale. Currently, there are over 25 different loci that may be considered to contain autism susceptibility candidate genes (ASCG), and many more complicated loci are under observation [2]. The lack of genome-wide significant results in most published linkage studies is a consequence of small sample sizes. Thus the establishment of collaborative groups, such as the International Molecular Genetic Study of Autism Consortium (IMGSAC) and Autism Genome Project (AGP) Consortium [80,107], and shared resources, such as the Autism Genetic Research Exchange (AGRE) Consortium [37] have become important steps in facilitating the identification of ASD candidate genes [14]. Unlike linkage studies, genome-wide association studies examine many common genetic variants in different individuals (either in case-control groups or within families) to see if any variant is associated with a disease. Association studies have identified a good number of genome-wide significant chromosomal variations, including CNVs (copy number variations - presence of variable number of copies of a particular gene in the genotype of an individual compared to a reference genome), which play an important role in the etiology of ASD [101]. De novo CNVs, hypothesized to be ASD-specific, have been found in up to 7-10% of sporadic ASD [14,74]. To date more than two thousand CNV loci, harboring both rare and common variants [7,20, 29,62,83,90,116], have been identified in more than three hundred studies 17 attributing to an awful lot of candidate genes. The challenge for CNV studies is to narrow down this list of candidate genes. Besides linkage and association studies, microarray gene expression studies are also being conducted to provide important insights into genes and pathways that might be dysregulated across ASDs [12,15, 39,41,78] and within individual subtypes of pervasive developmental disorders. Gene expression studies measure the activity (i.e., expression) of thousands of genes across the genome at once, to create a global picture of a specific cellular function or disease. But these studies often suffer from the problem of small sample sizes, and probe and platform specific artifacts [100]. However, availability of vast collections of omics data from all these different types of studies suggests developing sophisticated computational approaches to extract knowledge that will help us better understand the biological underpinnings of ASD. This goal is further motivated by the recent successes of using computational methods in detecting and ranking causal genes for various complex diseases, including Glioblastoma multiferome (GBM) [52], pancreatic cancer [120], type 2 diabetes [111], and so on. 1.2 State of the art In this section, we provide a brief overview of the computational methods currently available for predicting and prioritizing genes for diseases in general and ASDs in particular. As the challenge of predicting and prioritizing disease-causing genes is central to human health research, a large collection of computational methods have been developed to solve the general problem. The vast majority of these approaches are based on the human proteinprotein interaction (PPI) network. We describe the main themes of these approaches as well as some representative methods. On the other hand, researchers have started to design other methods for the problem of ASD gene prediction and prioritization. We discuss the most recent work here. 1.2.1 General Trends in Disease Gene Prediction General trends in designing computational methods for disease gene prediction-prioritization can be grouped loosely under four categories as discussed below (Table 1.1). 18 Methods Based on Protein Proximity in PPI Networks Many of the current approaches for disease gene prioritization are based on the proximity of candidate genes to known disease genes within interactome networks using different scoring schemes. The intuition behind this is the 'guilt-by-association' hypothesis, which suggests that genes that are physically or functionally close to each other tend to be involved in the same biological pathways and have similar phenotypic effects [4,82]. Thus a key step in these approaches is to measure the distance between candidate genes and known disease genes in the PPI network. Approaches to measure proximity of elements in the PPI network are based on direct neighborhood, shortest path length, diffusion kernel, random walk with restart, propagation flow, etc. [117] Oti et al. [85] predicted disease-causing genes in known disease loci by counting the number of known causative genes that are direct network neighbors (Table 1.1). The authors achieved approximately 10-fold enrichment by comparing their candidates to a random selection of candidate genes at the same locus. Krauthammer et al. [57] assigned known disease genes as seed nodes and computed the shortest path length between these and other nodes in the network. A node that has close proximity to multiple seed nodes receives a higher score as a candidate disease gene. However, K6hler et al. [56] demonstrated that the closeness of two proteins cannot be fully captured by their shortest path length. Different network structures surrounding two proteins imply different degrees of closeness between them. This can be captured by global distance measures, such as random walk with restarts and similarity-based diffusion kernel, by allowing equal probability of each protein diffusing along the links of the PPI network. The authors tested 783 genes under 110 disease families and achieved an area under the Receiver Operating Characteristic (ROC) curve up to 98% on simulated linkage intervals containing 100 genes. Navlakha and Kingsford [75] compared the performance of disease gene prediction using different proximity measurements including network neighbors, random walk with restarts, propagation flow, unsupervised graph partitioning, Markov clustering, or semi-supervised graph partitioning. They reported random walk with restarts to give the best performance in terms of precision and recall. They also proposed a consensus method combining all closness measures, which could capture different topological properties of the PPIs and yielded better performance than the individual measures. 19 Methods Integrating Large-scale Genomic Data In addition to being proximal in the interactome network, disease genes are assumed to share common features in gene ontology annotations, gene expression, protein sequences, and domains and are likely to be involved in similar biological and functional pathways [38]. Thus, a number of computational methods have been designed to integrate genomic data from multiple sources to achieve better performance [45]. Endeavour, a prioritization algorithm through genomic data fusion, integrates functional annotations, microarray expression, expressed sequence tag (EST) expression, literature, protein domains, PPIs, pathway membership, cis-regulatory modules, transcriptional motifs, sequence similarity, and user-data and ranks the candidate genes based on their similarity to known disease genes for each of these features [3]. A global ranking to prioritize candidate genes is generated by combining the ranks of individual features using order statistics. Prioritizer, a Bayesian classifier based tool, consolidates data from different sources, such as gene ontology, gene expression, and PPIs onto functional networks [34]. The closeness in the functional network of a candidate gene in one susceptible locus to genes residing in another locus was assessed and assigned a higher score for a shorter distance. Prioritizer achieves 2.8-fold enrichment compared to random selection. While at least two susceptible loci are desired by Prioritizer, Linghu et al. [65] performed genome-wide prioritization by constructing an evidence-weighted functional linkage network of 21657 genes based on 16 data sources using a naive Bayes classifier. Candidate genes were assigned a score based on the sum of the weights of the network links to known disease genes. The method was able to achieve a 62% success rate on monogenic, polygenic, and cancer disease families which was a marked improvement over the 44% success rate achieved by PPI network-only methods, confirming the importance of data integration in prioritizing disease genes. Methods Integrating Phenotype Similarity Disease with similar phenotypes often share either a common set of underlying genes or functionally related genes [38]. Several studies reported that the integration of disease phenotype networks and PPI networks outperform other approaches in the gene prioritization task [24,36,58,63,113,121,122]. Wu et al. [121] used a simple linear regression method called CIPHER (Correlating protein Interaction network and PHEnotype network to pRecdict disease 20 genes) to model the correlation between the phenotype similarity profile and closeness profile in the PPI network. The algorithm used the phenotype similarity data from van Driel et al.'s [1121 text mining results along with curated PPIs from the Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Molecular Interaction Database (MINT), and Online Predicted Human Interaction Database (OPHID) to calculate the Pearson correlation coefficient between the disease similarity profile and gene closeness profile for each disease-gene pair which was recorded as a concordance score to represent the association of a gene with a disease. CIPHER's performance was shown to be reliable and comparable to Endeavour. Based on the same phenotype similarity metric computed by van Driel et al. [112], Vanunu et al. [113] developed a slightly different method named PRINCE (PRIoritizatioN and Complex Elucidation). They calculated association between a query disease and a gene with a known disease association using a logistic function dependent on the phenotype similarity between the query disease and the known disease. This disease-gene association was then used as prior knowledge in the prioritization function and iteratively smoothed over the network using propagation flow. PRINCE showed superior performance over CIPHER in prioritizing genes for 1369 diseases with a known causal gene by approximately 10% in ranking the real disease gene as the top scoring one. Li and Patra [63] constructed a heterogeneous network by integrating, the PPI network and phenotype network based on disease-gene relationships in the Online Mendelian Inheritance in Man (OMIM) database [1]. They developed an algorithm RWRH (Random Walk with Restart on Heterogeneous network) which extends the random walk with restart algorithm from only PPI network to the entire heterogeneous network of PPIs and phenotypes. The authors reported RWRH was superior to CIPHER in prioritizing disease genes under three circumstances: known disease genes and genetic loci, known disease genes but no known genetic loci, and no known disease genes or loci. 21 L~3 - integrating in PPI large-scale genomic data Methods network tein proximity et et et al. [3] Linghu et Radivojac al. [92] al. [65] et Franke at al. [34] Aerts Kingsford [751 and al. [56] al. [85] Navlakha Kohler Oti Reference - PhenoPred Prioritiser Endeavour - - - Name Method GO, SEQ, [1151, DIP BIND, located in a locus known to be associated Combine all 13 closeness measurements disease genes for disease genes are summed to score the worm, fly, mouse-rat and component tion (molecular funccellular genomic data. The weights of the links to known candidate genes. (continued on next page) though integrating large-scale [71],DIP, MIPS IntACT, MINT [64],STRING, yeast, GO GN, a weighted functional linkage network BioGrid, PG, PDS, TXT, EXP, DDI, BIND, Use a Bayesian classifier to construct diction. HPRD, Direct neighbor Masspec, from Employ support vector machines for pre- using Gaussian kernel scoring function. PPIs are Shortest path length Curated HPRD, OPHID tive networks, then score candidate genes grated network baspd on distance to known disease genes Use a Bayesian classifier to build integraShortest path length in the inte- ing order statistics to obtain final rank. each feature, then combine the ranks us- similarity to known Rank each candidate gene based on their random forest classifier. scale experiments Direct neighbor for the ensemble decision trees using a surement score is above a threshold; (ii) ants partitioning, Markov clustering and their vari- graph with the disease and the network mea- neighbor, clustering, direct (i)Predict a gene as disease gene if it is flow, agation imity scores to known disease genes. Rank candidate genes based on the prox- locus lacking identified disease genes. ease gene and resides in a known disease if it directly interacts with a known dis- Predict a candidate gene as disease gene Prediction methods Random walk with restart, prop- walk with restart length, diffusion kernel, random [971, mapped Direct neighbor, shortest path Direct neighbor of BIND, HPRD, and large- BIND HPRD, OPHID fly, and yeast measurement elements In networks Proximity BioGrid, fly, from worm, mouse, fruit- STRING IntACT, HPRD, worm, yeast HPRD, human Y2H, PPI data source Curated PPI, Y2H, SEQ, DO PPI, GO, Structure, PPI, GO, EXP and others TRANSFAC, TOUCAN, PDS, (microarray TXT, EST), KEGG, and EXP PPI, PPI PPI PPI Features Summary of the trends in disease gene prediction-prioritisation methods. Methods based on pro- Category Table 1.1 C~3 - Continued. al. [24] Vanunu et al. [113] Li and Patra [63] et al. [122] et Wu Care al. [121] et Wu PRINCE RWRH - AlignPI CIPHER - PPI PPI PPI, SNPs PPI PPI PPI ease conditions PPI, EXP under dis- MINT, BIND, HPRD, Y2H PPI data source IntAct, for each candidate gene Candilate method to candidate genes. (continued on next page) use the final converged flow as scores for smooth flow in the PPI network and then propagation Use evidence weighted PPI network network networks. walker jump between PPI and phenotype eases simultaneously by allowing random and disUse RWRH heterogeneous network to score genes and deleterious SNPs. on PPI networks, phenotype similarities ing the same learning approach based forest, then predict disease genes us- a positive prediction. Predict deleterious SNPs using random highest scoring sub-network is taken as networks.The candidate gene with the associated and used in constructing sub- genes are first assumed to be disease- high scoring sub-networks. PPI and phenotype networks and obtain Use NetworkBlast algorithm to align the profile and PPI profile. based on linear regression of phenotype dance score Use correlation coefficients as concor- complex to the disease phenotype. phenotype caused by the genes in the gene by measuring the similarity between Random walk with restart on Direct neighbor - length Direct neighbor, shortest path weighted PPI network tual pull down, then score the candidate expectation gene cover algorithm. Construct candidate complexes by vir- (virtual down) counting in the evidence- neighbor disease related genes using the Maximum Find smallest set of genes that cover the Prediction methods pull Direct Shortest path length of periments BioGrid, BIND, measurement Network propagation flow in the IntAct, OPHID, Proximity elements in networks HPRD and large scale ex- HPRD MINT BIND, HPRD MINT HPRD, Reactome [27,72] al. [58] - Features tion et et at. [49] Name Method Pprel [47,481, Ecrel [47,48], Lage Karni Reference ease phenotype informa- Methods integrating dis- Category Table 1.1 - methods based Continued. Disease-module Category Table 1.1 - - Taylor et al. [109] Chen et al. [261 Gentrepid Name Method George et al. [36] Reference compar- First, known disease Second, net- disease- specific QTL work, co-expression - for each hub assess the average Pearson path length Identify causal subnetworks (continued on next page) works as causal genes. traits. Mark genes in the causal subnet- by testing for enrichment of expression hood test. wise regression and multivariate likeli- pleiotropic effects using forward step- works within these; Identify QTL with works, identify highly connected subnet- Combine the gene expression and genotype data to construct co-expression net- and disease-related genes. architecture, etc. For each hub, identify tion, linear motifs, globularity, domain ing hubs based on length, phosphoryla- move insignificant hubs Classify remain- for each interaction and the hub and re- correlation coefficient of co-expression Identify hubs in the global network, then betweenness centrality, shortest curated tervals. modules between proteins linking the in- ease to find common pathways or shared intervals associated with the same dis- comparing all the genes in the multiple candidate disease genes are predicted by without knowledge of the disease genes, ated with the same disease. genes in chromosomal intervals associ- genes are used to predict novel disease ease intervals. tion of disease genes within known dis- ule profiling for the automated predic- Combines two methods - common pathway scanning (CPS) and common mod- protein domains Prediction methods Common pathway, similarity of of OPHID, yeast, literature- OPHID measurement elements in networks Proximity ray) pathways, PPI data source PPI, EXP (microar- PPI ison, Domain Features 01 - Continued. [67] al. [301 al. et Liu et Dess6 Reference - GNEA Name Method - Proximity measurement PPI, EXP MetaCore [191 shortest path length interactions and identify signifi- Construct shortest path subnet- network. subnetwork as well as in the global PPI shortest paths through the gene in the the subnetwork based on the number of late topological score for each gene in shortest paths to disease genes. Calcu- works containing only the nodes in the work. genes and map them on to PPI net- background distribution. Identify differentially expressed disease enriched based on comparison against a the number of conditions in which it was For each gene set, assign a p-value to sentation in each identified subnetwork works. Test each gene set for over repre- cantly transcriptionally affected subnet- tein tein in a global network of proteinOpro- sets gene in each insulin resistance or di- Map the relative mRNA expression of ev- Prediction methods abetes condition to the associated pro- of ually curated gene cumulative expression level elements in networks ery HPRD PPI data source pression data, man- PPI, GO, DGAP ex- Features genes/proteins; QTL, quantitative trait loci. PPI, protein-protein interaction; Y2H, yeast two hybrid experiment; PDS, protein domain sharing; PG, phylogenetic profiles; GN, gene neighbor; GO, gene ontology; EXP, gene expression; KEGG, Kyoto encyclopedia for genes and genomes for pathway membership; TOUCAN, cis-regulatory modules; TRANSFAC, transcriptional motifs; SEQ, sequence similarity; DO, disease ontology; TXT, literature text mining; Masspec, mass spectrometry; DDI, domain-domain interactions; SNPs, single nucleotide polymorphisms; DIP, database of interacting proteins; STRING, search tool for the retrieval of interacting Category Table 1.1 Disease Module-based Methods In addition to generic candidate gene prioritization methods, significant efforts have been made towards the prediction of disease genes for individual diseases by constructing disease modules [11]. These methods start with identifying the disease modules or subnetworks, in which members would share similar functions, expression patterns or metabolic pathways assuming that breakdown of one such module causes a disease. This concept has been applied to a wide range of diseases, including several different types of cancers [25,59,77,109], type 2 diabetes [67], obesity [26], asthma [44], neurological diseases [43,73,93], and psoriasis [30]. Liu et al. [67] used a network based approach to identify an insulin signaling module as well as a network of molecular receptors that play significant roles in type 2 diabetes. Chen et al. [26] identified subnetworks in liver and adipose tissues that contain genes for which variants associated with obesity and diabetes have been identified. Taylor et al. [109] constructed disease-associated protein interaction modules for adenocarcinoma of the breast, providing useful predictors for breast cancer outcome. A slightly different approach was developed to prioritize disease-specific genes by constructing disease- and condition-specific subnetworks [30]. Disease-specific genes, differentially expressed under disease conditions, were mapped to global PPI network. The shortest path subnetwork was then built by including only the nodes in the shortest path connecting the disease-specific genes. Each node in this subnetwork was evaluated and assigned a topological score by comparing the number of shortest paths through it in the subnetwork to the number of shortest paths in the global network. This scheme was able to identify novel candidate genes for psoriasis. 1.2.2 Computational Advances in ASD Gene Prediction To implicate ASD risk genes, recently, Liu et al. have developed an algorithm DAWN (for Detecting Association With Networks) [66]. The algorithm is based on the intuition that ASD genes cluster within a co-expression network [87,119]. DAWN uses two kinds of data: rare variations from exome sequencing and gene co-expression in the mid-fetal prefrontal and motor-somatosensory neocortex. The algorithm casts the ensemble data as a Markov random field in which the graph structure is determined by gene co-expression and it combines these interrelationships with node-specific observations, namely gene identity, expression, genetic data, and the estimated effect on disease-risk. 26 The algorithm works as follows: first it identifies 'hot spots' within the co-expression network at which multiple ASD risk genes (identified from exome data) cluster together. For these hot spots, it uses evidence from neighboring genes to reinforce ASD signal, while in 'cooler' regions the absence of neighboring genes with evidence of ASD association downgrades the signal. By modeling this data, DAWN was able to identify 127 ASD risk genes, many of which are novel. It was also successful in predicting some known ASD genes, not included in the genetic data used to create the model. In addition, the method was able to find three interesting sub-networks in support of the role of abberant connectivity of neuronal circuits due to intrinsically abnormal synapses in ASD. Although currently DAWN's findings are limited by the power of test statistics derived from available samples with exome sequencing, its success shows that computational approaches hold sufficient promise in identifying ASD associated genes. 1.3 Contributions To address the classic problem of disease-gene prediction in the context of ASDs, this thesis designs three novel computational methods, one modified random walk with restarts method, and a novel integrative method for combining these four with prior knowledge. While the recent computational approach for solving the problem of ASD gene prediction focuses mainly on rare variations from exome sequencing and gene co-expression data, our methods focus on computationally extracting knowledge from other data sources, including copy number variations (CNVs), phenotype similarity to ASD, and proximity to ASD genes in the PPI network. Our first method utilizes the copy number variations that have ever been observed in the ASD population as well as appropriate control groups. We calculate an information entropy based score for all the genes that can be mapped to the reported CNV loci, taking into account their frequency of occurrence in ASD case-control groups. To the best of our knowledge, this is the first information theoretic approach to extract knowledge from disease CNVs. In our second method we incorporate phenotype similarity information to quantify functional association of ASD genes to the rest of the genes. Our method incorporates dis- ease/phenotype similarity scores computed by van Driel et al. [112] and gene-phenotype relationships from the Online Mendelian Inheritance in Man (OMIM) database [1]. This 27 method is seeded by high confidence ASD genes from the literature to identify ASD-like phenotypes in OMIM. Genes involved in diseases with phenotypes similar to ASDs are ranked highly by this algorithm. In our third method, we use the power of topological proximity in the network. We introduce a new diffusion based proximity metric for the proteins in the PPI network namely, Diffusion State ASD Proximity (DSAP). DSAP is defined on diffusion state distances (DSDs) in the PPI network which have supremacy over direct neighborhood and shortest path distances in capturing the functional association of proteins in the PPI network. DSAP of a gene is calculated based on its diffusion state distances to ASD seed genes. Since random walks with restarts are one of the most effective approaches in solving the generic disease-gene prediction problem, we customize this approach specifically for the ASD context for the first time. Our approach uses the global PPI network structure and can be considered as a generalization of Google's Pagerank algorithm. This method starts with identifying high confidence ASD genes from the literature and simulates a random walk with restarts on the connected PPI network to simulate network crosstalk between the genes in the network. The simulated crosstalk gives a quantification of the functional association of ASD genes to the rest of the genes in the network. All these methods are shown to perform better than random selection. Finally, we propose a novel integrative approach which incorporates CNV, phenotype similarity, and connectivity, proximity, and topological similarity in the PPI network with ASD-pathway knowledge from available literature. Each gene is assigned an association probability based on a logistic regression model. Lasso regularization with cross validation is performed to avoid over-fitting of the model. We show that the integrative approach significantly outperforms the above four methods. We provide a number of interesting biological insights into the mechanism of ASDs by performing a series of analyses on the candidate genes selected by our integrative method. Pathway enrichment analysis reveals that, our candidate gene set is overrepresented in a number of pathways related to signal transduction, cell adhesion, and nervous system development. These pathways can be useful in explaining the pathophysiology of ASDs. We also find an interesting link between ASDs and Inflammatory Bowel Diseases (IBD) in that our candidate gene set has significant overlap with the majority of the IBD-related pathways. Furthermore, we identify a number of disjoint subnetworks in our candidate gene set, char28 acterized by different categories of diseases and bio-functions, which provide an indication of the existence of subclasses of disorders in the autism spectrum. The topmost subnetwork characterized by gastrointestinal disorders, is particularly interesting. Functional and gene ontology enrichment analyses help us identify a number of interesting molecular functions and biological processes in which the candidate genes are overrepresented. For some of these terms, their connection to ASDs is not so obvious and thus worth further investigation. 1.4 Outline In Chapter 2, we describe three novel computational methods for predicting and prioritizing ASD genes. We also introduce a random walk based approach for solving the disease gene prediction problem in the context of ASDs for the first time. In Chapter 3, we describe a novel integrative analysis approach which outperforms the individual methods described in the previous chapter in identifying and ranking ASD genes. We select a set of candidate genes which are highly likely to be associated with ASDs. We perform a series of analyses to find significant pathways, bio-functions, diseases and subnetworks in which the candidate gene set is overrepresented. The methodology of the analyses as well as the results and their biological implications are discussed in Chapter 4. Finally, we present closing remarks and discussion in Chapter 5. 29 30 Chapter 2 Predicting and Prioritizing Candidate Genes for ASD In this chapter we introduce three novel methods for gene prediction-prioritization for ASDs. The first one is based on the copy number variations observed in the ASD population as well as appropriate control groups. The second method incorporates disease similarity information with gene-phenotype mappings for OMIM to quantify the association of a gene to ASDs. The third method cbmputes functional association of ASD seed genes with the rest of the genes in the network based on a new diffusion based proximity measure. Finally, we customize a random walk with restarts based algorithm for ASDs which takes into consideration the proximity and connectivity information of the genes in the global PPI network to quantify the ASD-association of genes in the network. The landscape of genes for our methods covers the largest connected component of the PPI network constructed using human PPIs collected from BioGRID [103] and ASD related PPIs collected from the SFARI Autism PIN module [13]. It comprises of 22192 genes and 227341 interactions. In what follows we refer to this largest connected component of the PPI network as "connected PPI network". For measuring the performance of our methods, we need to consider a set of ASD genes as a gold standard. We collected a list of known ASD genes from SFARI Human Gene Module [13]. As of June 2014, this module reported 606 known human genes in connection to ASDs, 548 of which reside in the largest connected component of our PPI network. We use these genes as our gold standard (Appendix A). 31 2.1 2.1.1 CNV Information Entropy based Prioritizer Copy Number Variation (CNV) For decades, it has been known to researchers that chromosomal rearrangements can result in a wide range of developmental disorders. However, technological and computational advances in the past decade have enabled the development of assays capable of identifying submicroscopic structural changes in chromosomes that could not have been detected by traditional cytogenetic analysis. Among the most heavily scrutinized of these structural variants are copy number variants, or CNVs. CNVs refer to submicroscopic chromosomal deletions and/or duplications that are typically defined as DNA segments of 1000 base pairs or larger in size that are present in a varying (or zero) number of copies when compared to a reference genome [94] (Figure 2-1). Deletion Duplication Normal pair of chromosomes Pair of chromosomes with one copy of "C" Pair of chromosomes with three copies of "C" Figure 2-1: Copy number variations in a pair of chromosomes. The pair of normal chromosomes (middle pair) each have sections A-B-C-D. However, the loss of section C from one of the chromosomes results in an abnormal chromosome with only sections A-B-D (left pair); an individual with this deletion has only one copy of section C in their chromosomes. On the other hand, the gain of an extra copy of section C on one of the chromosomes results in an abnormal chromosome with sections A-B-C-C-D (right pair); an individual with this duplication has three copies of section C in their chromosomes. Thus, both of the individuals (left and right) have CNVs involving section C - one has lost a copy, the other has gained a copy, but both have a varied number of copies of C when compared to the reference pair of chromosomes. 32 There are many CNVs throughout the human genome that have no adverse influence on the individual(s) harboring them in the general population. However, there are also a large number of CNVs that have been definitively linked with diseases. Evidence also indicates that interaction with additional genetic or environmental factors may influence whether CNVs have a detectable adverse effect on an individual. 2.1.2 Copy Number Variants in ASD Analyses of large autistic populations over the past decade suggest that CNVs at specific locations in the genome result in increased susceptibility to ASD [69]. It has been estimated that 10-20% of ASD cases result from the presence of one or more pathogenic CNVs in an affected individual [2]. This finding implicates that CNVs are one of the most, if not the most, common genetic causes of ASD. In 2003, Simons Foundation launched the project "Simons Foundation Autism Research Initiative (SFARI)" to advance the research of autism spectrum disorders. SFARI Gene [13] is a publicly available, curated, web-based, searchable, integrated resource, made available to the autism research community by SFARI. This resource is built on information extracted from the studies on molecular genetics and biology of ASD. SFARI Gene includes genetic, proteomic, and structural variation data from linkage and association studies, cytogenetic abnormalities, and specific mutations associated with ASD. The Copy Number Variant (CNV) module of SFARI Gene is a comprehensive, up-to-date collection of all copy number variants associated with autism spectrum disorders (ASD). The content of the CNV module is compiled in a systematic way from available case studies, CNV studies, and large-scale, genome-wide CNV screens. CNVs from autistic case cohorts and, when available, unaffected control cohorts are reported by the module. CNVs in the module are organized based upon the locus (chromosomal region or band) in which they were observed in each study. As of March 2014, more than 1800 CNV loci have been reported in connection with ASDs. These CNVs map to thousands of genes, which is too large a number to be useful. Thus, we sought an intelligent approach to narrow down the number of ASD risk genes by utilizing the copy number variants reported in ASDs. 33 Calculating Information Entropy Score from CNVs 2.1.3 We downloaded the CNV loci and corresponding case-control occurrence data from SFARI CNV module [13]. We collected sideband annotations for chromosomes from Ensembl [51]. Human gene-locus mapping information was collected from Entrez [681. We designed a mapper that maps the CNVs to corresponding genes and calculates their frequency of occurrence in cases and controls using the aforementioned information. Then, for each mapped gene g, we calculated the information entropy score, pg using Formula 2.1. The work flow for our CNV-based prioritizer is shown in Figure 2-2. Pg = Here, fgy Kg (2.1) x (1 - IEg) + offset denotes the number of occurrences of gene g in disease group y, where y E {case, control}; p9 denotes the probability of gene g occurring in ASD cases; Kg corresponds to the scaling factor corresponding to gene g; IEg denotes the information entropy of gene g. These terms are defined by Formula 2.2. ( fcase +fontrol V(f"ase) Kg - IEg = 'I 2 t +(fgon ro1) 2 (2.2a) "-fc**r* +fcontrol -case Pg log 2 (P9 ) - (1 - Pg) lo 2 (1 - Pg) (2.2b) _g fgase Pg = fease + fcontrol We calculate pg using three different Kgs and chose the one which gives largest area (AUC) under the Receiver Operating Characteristic (ROC) curve. The selected scaling factor is: K9 fcasefcontrol V/cas onero as it gives an AUC of 59.81% (Figure 2-3). We chose a small positive number such as le - 6 as offset. All the genes in the human PPI network to which no CNV is mapped by the mapper were assigned a score equal to the offset value. 34 Chromosome Sidaband Annotations from Ensembl CV loci in r Scorer cases controls from SFARI Mapper tion ed Scores 16pl.-qI2.2 116 15 Gene-locus mappings from Elitrez Info -MNE#Er Gene frequencies in cases & controls ADA116 I Ii YWA ABAT116I15 YWHAE, lp1.2,... ABAT, 16q11.2,.. Figure 2-2: Steps in CNV-based prediction-prioritization of ASD genes. At first, our custom-built mapper maps CNV loci in ASD case-control groups from SFARI CNV module to genes using chromosome sideband annotations from Ensembl and gene-locus mapping information from Entrez. The mapper also counts the numbers of occurrences of each gene in the case group and control group separately. Next, the scorer calculates an information entropy based score for each gene based on its frequency of occurrence in ASD case-control groups. Genes are ranked in descending order of entropy based scores. 2.1.4 Quality of CNV Information Entropy based Prioritization To measure the quality of our information entropy based ranking, we calculated the area (AUC) under the Receiver Operating Characteristic (ROC) curve (Figure 2-3). The true positive rate (TPR) or recall, and false positive rate (FPR) are calculated using Equations 2.3 and 2.4 respectively. Using any of the scaling factors we get an AUC of approximately 59%, which is better than the random case (AUC = 50%). Since, we are more interested in identifying the ASD genes than the non-ASD ones, we look at the lift chart for this method (Figure 2-4). The lift chart shows how much more likely we are to identify ASD genes than if we make random guesses. For example, by considering only the top 2% of genes in the ranklist found by our method, we are able to identify 2.3 times as many known ASD genes, in comparison to using no method. This enrichment indicates a reasonable improvement considering the unbalanced nature of our dataset with ASD genes accounting for only 2.5% of the entire dataset. Here, the lift of a bucket, or a group of genes in the dataset is calculated using Equation 2.5. recall = TPR = Number of ASD genes correctly identified by the method Total number of ASD genes in the dataset 35 (23 (2.3) FPR =Number of ASD genes wrongly identified by the method Total number of non-ASD genes in the dataset (2 4) Percentage of true ASD genes in the bucket identified by the method Percentage of ASD genes in the bucket selected randomly lift of a bucket = (2.5) 0.9 0.8 0.7 *. 0.5 0.45 c.0 0.4 03 - Scaling Factor 1: AUC - 0.5930 -- Scaling Factor 2: AUC - 0.5981 Scaling Factor 3: AUC - 0.5977 Baseline: AUC -0.5000 0.2 0.1 n 0 0.1 0.2 0.3 OA 0.5 0.6 0.7 0.8 0.9 I False Positive Rate (FPR) Figure 2-3: Receiver operating characteristic curves for CNV-based prio ritizer using different scaling factors. 2.2 2.2.1 ASD Similarity based Prioritizer Similarity of Phenotypes or Diseases Similarity between phenotypes reflects biological modules of interacting functionally-related genes. These similarities are positively correlated with a number of measures of gene function, including relatedness at the level of protein sequence, protein motifs, functional annotation, and direct protein-protein interaction [112]. In fact, genes or proteins associated with similar diseases or phenotypes lie in close proximity in the PPI network. Furthermore, phenotype grouping reflects the modular nature of human disease genetics. These facts bring forth the idea of utilizing disease or phenotype similarity information for identification 36 I I 0.05 0.1 I I 0.15 0.2 I I I I I I I I I I 0.7 0.75 I I I I 0.8 025 0.9 0.95 2.2- 2- 1.6- - 1A 12- 0*8, 0 0.25 0.3 0.35 0.45 0.5 0.55 0.6 %of Genes in the Rankist 0.4 0.5 1 Figure 2-4: Lift chart for CNV-based prioritizer. of disease genes. In 2006, van Driel et al. [112] introduced a text mining algorithm to compute disease or phenotype similarity information for 5080 phenotypes collected from OMIM database. The steps of the algorithm can be summarized as follows. * At first, all the OMIM records are searched and the keywords are searched for presence in the anatomy (A) and the disease (C) sections of the Medical Subject Headings (MeSH) vocabulary. MeSH is a controlled vocabulary of U.S. National Library of Medicine. It is specially useful for applications that use information that contains different terminologies for identical medical concepts. " Each OMIM record is then represented by a (0,1)-vector where each entry of the vector corresponds to whether a term is present (denoted by 1) or absent (denoted by 0) in the record. " Similarity of two phenotypes is then computed by calculating the cosine of the angle between their respective feature vectors. The similarity score ranges from 0 to 1. We collected the phenotype similarity matrix computed by van Driel et al. which is available through a web interface (http://www. cmbi.ru.nl/MimMiner/). 37 2.2.2 Gene-Phenotype Association Data OMIM provides a publicly-accessible and comprehensive database of genotype-phenotype relationships in humans. We downloaded gene-phenotype relationship information from OMIM database [1]. We retained only those gene-phenotype relationships where the phenotype also has a similarity score available in the disease similarity matrix computed by van Driel et al. [112]. We then mapped the genes associated with those phenotypes onto the connected PPI network. Note that multiple genes can be mapped to a single phenotype and one gene can be involved in multiple phenotypes. Thus, after this step, we are left with a total of 1474 genes mapped to 1999 OMIM phenotypes. 2.2.3 Calculating ASD Similarity Scores We use the disease similarity matrix computed by van Driel et al. [112] and the genephenotype association data from OMIM database [1] to compute the association between each gene and our disease of interest, ASD. We call this association the ASD similarity score of the gene. Let 6 = {di, d 2, d3 , ... , dt} be the set of diseases for which similarity scores are available, and q(di, dj) denote the similarity between diseases di and dj. Also let 6 g g 6 be the set of phenotypes associated with gene g. Let S be the set of seed genes which are known to be associated with ASD with high confidence. We select the genes that appear in eight or more ASD-related studies from our gold standard (Appendix A) as the seed set. Thus our seed set S contains 106 genes. Let 6 s denote the set of phenotypes related to ASD genes. We compute the association of a gene to ASD by looking at the similarity of the phenotypes related to it to the ASD-like phenotypes, S (Equation 2.6). The association score is normalized by the sum of pairwise similarity of ASD-like phenotypes. Thus we get an ASD similarity score, VPg for each gene in the largest connected component in the PPI network. Genes with no phenotype mapping receive an ASD similarity score of zero. -0 2.2.4 di) dj E 0 . Ed, 0.5 X c,,EOs Ed EOs 4dm, dn) (2.6) Performance of ASD Similarity based Prioritizer By sorting the genes in descending order of ASD similarity scores, we obtain the ASD similarity based ranking of genes. To measure the performance of ASD similarity based pri- 38 oritizer, we calculate the area (AUC) under the Receiver Operating Characteristic (ROC) curve (Figure 2-5). The TPR and FPR are calculated using Equations 2.3 and 2.4 respectively as before. We measure the performance of this method on 22086 genes of the PPI network. These genes does not include the 106 ASD genes used in identifying ASD-like phenotypes. We achieve an AUC of 55.96% using this method, which is better than the random case (AUC = 50%). Figure 2-6 depicts the lift chart for this method which shows how much more likely we are to identify ASD genes than if we make random guesses. By considering only the top 2% of genes in the ranklist found by our method, we are able to identify 3.62 times as many known ASD genes, in comparison to using no method. This enrichment indicates quite an improvement considering the imbalanced nature of our dataset with ASD genes accounting for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5. I 0.9 6.6 c. 6.7 0 0 2 6.3 I- 6U AUC =0.5596 6. 0 6.1 6.2 6.3 0.4 0.5 6.6 6.7 6.3 6.3 I False Positive Rate (FPR) Figure 2-5: Receiver operating characteristic curve for ASD similarity based prioritizer. 39 I I I I I I I I I I 33r I I I I Exuding Seeds - incduding Seeds -- ~-BseIne 3 2.5 2 1.5 I 0 0.05 i 0.1 I 0.15 I 02 I I I 025 0.3 025 I A 0.45 I I I 0.5 0.55 0.6 I 0.65 I I I I I I 0.7 0.75 0.8 0.85 0.9 0.96 1 % of Genes in the Randist Figure 2-6: Lift chart for ASD similarity based prioritizer. 2.3 2.3.1 Diffusion State ASD Proximity based Prioritizer Diffusion State Distance (DSD) in PPI Network As discussed in Section 1.2.1, functional similarity of genes or proteins in the PPI network is often inferred based on direct interaction or some notion of network proximity in a local neighborhood. Most of the disease gene prediction prioritization methods typically measure local proximity based on either direct neighborhood, or shortest path distance, but this has only a limited ability to capture fine-grained neighborhood distinctions because most proteins are close to each other, and there are many ties in proximity. Also, the accuracy of these methods is often limited by the incomplete and noisy nature of the PPI data. Addressing these issues, Cao et al. [23] introduced Diffusion State Distance (DSD), a new distance metric based on the graph diffusion property. DSD captures fine-grained distinctions in proximity for transfer of functional annotation in the PPI network and is able to perform much better than the conventional distance metrics. Definition of DSD Metric Cao et al. [23] defined diffusion state distance (DSD) as follows: let G(V, E) be the undirected connected PPI network, where V = {v1, v2, V3,... , vn} 40 is the set of genes or proteins in the network with IVi = n; E = {ei, e2, e3, . .. , em } is the set of interactions with ej = (vi, vj) denoting the interaction between genes vi and vj. Let Hefk) (A, B) be the expected number of times that a random walk starting at node A and proceeding for k steps, will visit node B. Assuming k is fixed, Hejk}(A, B) can be simply denoted as He(A, B). The n-dimensional vector He(vi), Vvi E V is defined as, He(vi) = (He(vi, vi)He(vi, v 2 ),.. . ,He(vi, vn)). Then, the DSD between two genes u and v, Vu, v E V is given by Equation 2.7. DSD(u,v) = IIHe(u) - He(v)jji where, (2.7) II.I|1 denotes the L, norm of a vector. Cao et al. [23] proved three lemmas establishing the fact that, DSD is a metric which is symmetric, positive definite and non-zero whenever u = v, and it obeys the triangle inequality. It also converges as k approaches infinity, and thus, can be defined independent of k. Lemma 1 DSD is a metric on V, where V is the vertex set of a simple connected graph G(V, E). Lemma 2 Let G be a connected graph whose random walk one-step transition probability matrix P is diagonalizable and ergodic as a Markov chain, then for any u, v E V, DSD(u, v) converges as k, the length of the random walk, approaches infinity. Lemma 3 Let G be a connected graph whose random walk one-step transition probability matrix P is diagonalizable and ergodic as a Markov chain, then for any u, v limkoo DSD k} (U, v) = (bUT - b T)(I - E V, we have P + W)-', where I is the identity matrix, W is the constant matrix in which each row is a copy of riT, 1 .T -being the unique steady state distribution, and for any i E V, biT is the ith basis vector, i.e., the row vector of all zeros except for a 1 in the ith position. Proofs of these lemmas are out of the scope of this discussion, but can be found in [23]. 41 2.3.2 Calculating Diffusion State ASD Proximity (DSAP) of Genes A key step in our approach towards ASD gene prediction-prioritization is to measure the proximity between candidate genes and known ASD genes in the connected PPI network. For this purpose, we define a new proximity measure based on DSD and call it Diffusion State ASD Proximity, or DSAP in short. Let DSD(u, v) denote the pairwise diffusion state distance between any two nodes u, v E V in the connected PPI network G(V, E) which is defined by Equation 2.7. Let S be the set of genes known to be associated with ASD with high confidence. Out of the 548 genes in our gold standard (Appendix A), 106 genes appear in eight or more ASD related studies. We build our ASD gene set S using these genes. We define pairwise diffusion state proximity (DSP) of two nodes u, v e V by a Gaussian kernel over DSD(u, v) as follows. DSP(u, v) = e( -DSD(-,,)) 2 7,) Here, we divide the DSD(u, v) by 7 not to let the DSP(u, v) value become too small, given that the median DSD(u, v) for connected human PPI network is found to be approximately equal to 7. Then, we define the diffusion state ASD proximity of a gene, g E V by Equation 2.8. DSAP(g) = ( DSP(g, s) (2.8) We calculate DSAP scores for all the genes in the connected PPI network and sort them in descending order of DSAP scores which gives us the DSAP-based ranking of genes. 2.3.3 Quality of DSAP-based Ranking To measure the performance of the DSAP-based prioritizer, we calculate the area (AUC) under the Receiver Operating Characteristic (ROC) curve (Figure 2-7). The TPR and FPR are calculated using Equations 2.3 and 2.4 respectively as before. We measure the quality of ranking on 22086 genes of the PPI network. These genes do not include the 106 ASD genes used in measuring proximity to ASD. We achieve an AUC of 54.05% using this method, which is better than the random case (AUC = 50%). With this approach we achieve a lift of 1.1% of ASD genes (excluding seeds) in the top 4% of the ranklist over random selection. Although this measure is worse than the previous two methods, it is still able to identify more non-seed ASD genes than random selection. 42 Inclusion of seed genes boost the lift up to 5.7-fold which means that the seed genes are very close to each other in the network in terms of DSAP and hence make up a significant portion of the top 4% of the ranklist. The lift chart including the seeds is shown in Figure 2-8. Here, lift is calculated using Equation 2.5. .I 0.9 0.7 0.7 0.2 6 -064406 6.4 0.1 0 6.2 0.3 M. &.S 0.6 0.7 False Positive Rate (FPR) 6.8 1.9 Figure 2-7: Receiver operating characteristic curve for Diffusion State ASD Proximity (DSAP) based prioritizer. 6 i I I j 5.5 -E 5 Excluding Seeds ncluding Seeds laselin. 4 3.5 -J 3 2.5 2 1.5 1 0. i 0 0.05 i 0.1 i 0.15 I I I I 02 025 0.3 0.35 I I I I OA 0.45 0.5 0.55 0.6 % of Genes in the Rankdist 0.5 I I 0.7 0.75 I 0.8 I I I 0.85 0.9 0.95 Figure 2-8: Lift chart for Diffusion State ASD Proximity (DSAP) based prioritizer. 43 1 2.4 Network Crosstalk based Prioritizer 2.4.1 Motivation Functional association between genes or proteins in the PPI network are often measured using diffusion kernel, random walk with restart, or propagation flow based algorithms. These approaches axe global in nature in that they consider multiple alternate paths and the whole topology of the PPI network. The basic steps for most of these approaches are: first identify seed genes that are significantly associated with the disease of interest. Next, map these seed genes onto the PPI network. Finally, quantify the functional association between genes in the PPI network and the seed genes based on network proximity and connectivity in a global manner. As discussed in Section 1.2.1, these approaches have recently been successfully applied to identify genes for a number of complex diseases including different types of cancers, type-2 diabetes, neurological disorders, psoriasis, asthma and so on. Motivated by these successes, we aim to develop a global network-based scoring scheme to quantify functional association between ASD seed genes and the rest of the genes in the human PPI network. We redefine the notion of network crosstalk introduced by Nibbe et al. [76] in the context of ASDs and compute ASD association in an approach based on random walk with restarts. To the best of our knowledge, this is a first attempt to capture functional association of ASD genes via network connectivity and proximity in a global manner. 2.4.2 Problem Formulation Following closely the approach adopted by Nibbe et al. [76] for identifying candidate genes and subnetworks for human colorectal cancers, we reformulate the problem of disease gene prediction-prioritization for ASDs. Let G = (V, E) be the connected PPI network, where V consists of the genes in the network, and an undirected edge e(u, v) E E represents an interaction between genes u E V and v E V. Let N(v) be the set of direct neighbors (i.e., interacting partners) of gene v E V, i.e., N(v) = {u E V : (u, v) E E}. Let S C V be the set of genes known to be associated with ASD with high confidence. Among the 548 gold standard ASD genes, 106 genes appear in eight or more ASD related studies. We build our ASD gene set S using these genes. Our goal is to compute a score a(v) for each gene v E V, to quantify network crosstalk between v and the genes in S, network crosstalk being the indicator of functional association between genes. 44 In order to develop a biologically sound measure of network crosstalk, Nibbe et al. [76] relied on two observations. (i) Functional similarity between proteins is significantly correlated with their network proximity, as measured by the number of hops between these proteins. (ii) Existence of multiple alternate paths between two proteins is an indicator of their functional association, since functional multiple paths are often conserved through evolution owing to their contribution to robustness against perturbations, as well as amplification of signals. Like Nibbe et al. [76], we compute network crosstalk scores for genes in the PPI network using an information flow approach based on random walks with restarts. This approach incorporates both the number of hops and multiple alternate paths between genes into the assessment and can be considered as a generalization of Google's well-known Pagerank algorithm [17]. 2.4.3 Calculating Network Crosstalk Scores For a given ASD seed gene set, S, we calculate network crosstalk scores for all the genes in the PPI network by simulating a random walk as follows. The random walk starts at a randomly chosen gene in S. At each step, when the random walk is at some gene, v E V, it either moves to a neighbor of v with a probability 1 - r, or it restarts at a gene in S with probability r. Here, the parameter 0 < r < 1 is called the restart probability. For each move, the neighbor to be moved to is chosen uniformly at random from N(v). Similarly, for each restart the gene to be restarted from is selected uniformly at random from S. The network crosstalk between the genes in S and each gene v E V can be computed as the relative amount of time spent at v by such an infinite random walk, or equivalently, the probability that the random walk will be at gene v at a randomly chosen time step after the random walk proceeds for a sufficiently long time. Formally, let at be the |VI-dimensional vector, such that at(v) denotes the probability that the random walk will be at gene v at step t, where ||at|l1 = 1 (here, |1.11 denotes the L 1-norm of a vector). Let P denote the stochastic matrix derived from network G = (V, E), i.e., P(u, v) = 1/IN(v)I if (u, v) E E, 0 45 otherwise. Then, at any step t +1, the crosstalk score vector can be defined by Equation 2.9. at+1 = (1 - r)(P)at + ry (2.9) where -y denotes the restart vector with -y(u) = 1/ISI for u E S, or 0 otherwise. With initial crosstalk scores set to ao = -y, the vector for final crosstalk scores for each gene in the network is given by a = imta at. In our experiments, we stopped our iterations when we . encountered the criterion: IIat+1 - atI11 < 1e- 09 As we can see, when r = 0, a is equal to the eigenvector of P that corresponds to its largest eigenvalue (with numerical value 1), i.e., a(v) is exactly equal to the page rank of v in G for all v E V. Thus, the crosstalk score of a gene v is not only an indicator of its connectivity and proximity of ASD genes, but it also considers the significance of centrality of the gene in the network. 2.4.4 Dealing with Statistical Bias PPI networks are often noisy in that well-studied proteins or genes are highly connected having a lot of interactions, whereas less studied ones often miss interactions. Thus there is a high probability the highly connected hub genes will be assigned artificially high crosstalk scores just by chance, skewing the result towards well-studied genes. However, we are interested in finding those genes that are less characterized but may provide novel insights into ASDs. To correct for this bias, we assign significance scores to the crosstalk scores using Monte Carlo simulations. We define a null model that accurately captures the degree distribution of the ASD seed genes in S as follows. For a given ASD seed set S, in order to generate a random instance S(i) representative of S, first, for every gene u E S, we create a bucket B(u) of genes in the network, such that UUESB(u) = V and B(u) n B(u') = 0 for all u, u' E S. A gene v e V is assigned to bucket B(u) if IN(v) - N(u) 5 1N(v) - N(u')I for all u' E S and ties are broken randomly. Next we choose one gene from each bucket uniformly at random to construct S('), so that IS(i) = IS1. Note that each bucket consists of genes that have similar number of interactions with a particular ASD seed gene; therefore each seed gene is represented in S(i) by exactly one gene in terms of its number of neighbors. Thus, the expected total degree of genes in S(') is likely to be very close to the total degree of 46 the genes in S. After generating a random instance S('), we compute the corresponding crosstalk vector a(i) by letting y(i) = 1/IS) I for u E SW2, and 0 otherwise. We repeat this procedure N times, where N is sufficiently large (we use N = 1000 in our experiments) to obtain a sampling {ai, a 2 , a3,..., aN} of the null distribution of the crosstalk scores, with respect to seed sets that are representative of S in terms of their sizes and degree distributions. dard deviation as = Next, we estimate the mean As = <N- S N and stan- of the null distribution of crosstalk scores for S using this sample. Finally, we compute the adjusted z-scores for the crosstalk scores using Equation 2.10. zS(v) = a(v) - ps(v) E V (2.10) These adjusted crosstalk scores represent the statistical significance of the crosstalk between each gene and the genes in the ASD seed set, accounting for the centrality and degree distribution of the genes in the PPI network. We sort the genes in our PPI network in descending order of the adjusted crosstalk scores, which gives us the network crosstalk based ranking of genes. 2.4.5 Performance of Network Crosstalk based Prioritizer To measure the performance of our network crosstalk based prioritizer, we calculate the area under the Receiver Operating Characteristic (ROC) curve (AUC). As before, the TPR and FPR are calculated using Equations 2.3 and 2.4 respectively. We measure the quality of ranking on 22086 genes of the PPI network. These genes does not include the 106 ASD genes used as seeds. We measure AUC for different values of the parameter r: r = {0, 0.25,0.5,0.75, 0.9} (Figure 2-9). Figure 2-10 depicts the lift chart for this method which shows how much more likely we are to identify ASD genes than if we make random guesses. By considering only the top 2% of genes in the ranklist found by our method, we are able to identify 2.37 times as many known ASD genes (excluding seeds), as if we selected randomly. This gain indicates quite an improvement considering the unbalanced nature of our dataset with ASD genes accounting for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5. Inclusion of seed genes boost the gain up to 10.5-fold. 47 11 0.9 0.8 0.7 0.6 CL 0.5 0.4 . 0.2 r = 0.50: AUC = .5 ~~~r =0.75: AUC = O.M57 2 r= 0.90: AUC = 0.5525 Baseline: AUC = 0.5000 0.1 u 0 0.1 0.2 0.3 0.7 0.4 0.5 0.6 False Positive Rate (FPR) 0.8 0.9 - 0.3 r = 0.00: AUC = 0.4474 r = 025: AUC = 0.5611 - 0 1 Figure 2-9: Receiver operating characteristic curves for network crosstalk based prioritizer using different restart probabilities (r). I I I I I I I I I IIII I 11 Excluding Seeds 10 Including Seeds Baseline 9 8 7 IS 6 5 4 3 2 1 0 I I I I I I I 0.05 0.1 0.15 0.2 0.25 0.3 0.35 I I I I 0.4 0.45 0.5 0.55 0.6 % Genes in the Ranklist I I I I I I I 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Figure 2-10: Lift chart for network crosstalk based prioritizer. 48 1 Chapter 3 Integrative Approach for Identifying ASD Risk Genes 3.1 Background Just to recapitulate, we are interested in the problem of quantifying the association of a gene with ASD and rank the genes based on the strength of association. Each of the methods we have discussed so far focuses on a single aspect of functional similarity of genes which is based on either sequence, phenotype, or topological similarity. However, as discussed in Section 1.2.1, there is plenty of evidence in the literature that an integrative approach incorporating multiple aspects of functional similarity of genes simultaneously can perform reasonably better in predicting and prioritizing disease genes, than the methods focusing on a single aspect. Motivated by this fact, we propose a logistic regression based integrative approach for solving this problem in the context of ASDs. We use lasso-penalized logistic regression [31, 110] to develop a predictor that predicts the probability of a gene being associated with ASDs. To avoid over-fitting the model, we used the adaptive lasso procedure, which simultaneously identifies influential variables and provides the model parameters. Our choice of variables include ASD association scores computed by the methods described in Section 2 as well as information on ASD-pathway membership of genes. 49 3.1.1 Lasso-penalized Logistic Regression Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which axe usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. It is one simple but widely used approach for integrating predictors from multiple sources. It is often used with lasso regularization. Lasso is a shrinkage estimator, often used to identify important predictors, select among redundant parameters, and produce shrinkage estimates. Lasso estimates have potentially lower predictive errors than an ordinary maximum likelihood estimator. Thus, lasso is a useful alternative to stepwise regression and other dimensionality reduction techniques. 3.2 Predicting ASD Association via Logistic Regression based Integrative Approach 3.2.1 Preparing Data for Training and Validation Our landscape of genes consists of all 22192 genes of the connected PPI network. Each gene in the set is labeled as ASD gene if it belongs to the gold standard ASD gene set (Appendix A), or non-ASD otherwise. We establish a training set of 4292 genes. It consists of 106 high confidence ASD genes from the ASD gold standard gene set. These genes appear in eight or more ASD-related studies. These 106 genes makeup roughly 19.34% of the total ASD genes in the dataset. To retain this proportionality for non-ASD genes as well, we randomly select 4186 of the 21644 non-ASD genes in the connected PPI network for the training set. The rest of the ASD and non-ASD genes are set aside for validating the performance of the logistic regression based predictor. Thus, the validation set consists of 17900 genes of which 442 are ASD genes and the rest are non-ASD genes. Note that our dataset under consideration is a highly unbalanced one, and we are interested in accurately predicting the ASD genes rather than the non-ASD genes. 3.2.2 Constructing Lasso-regularized Binomial Regression Model We formulate the logistic regression based predictor as follows. Let V = {v1, v2, v3 ,..., v } be the set of genes. Let the dependent variable p = {1i, 50 2, A3, ... , An } be the vector of predictions, where pi denotes the probability that gene vi is associated with ASDs. Pi can be any real value between 0 and 1 inclusive. We construct the set of independent variables, X = {CNVIE, AutSim, DSAP, NetCrTk, NeuronPath, SkeletalPath, SynapsePath, CaPath} with eight predictors, where CNVIE refers to CNV information entropy based scores, AutSim, autism similarity based scores, DSAP, diffusion state ASD proximity scores, NetCrTk, adjusted network crosstalk scores, and NeuronPath, SkeletalPath, SynapsePath, and CaPath refer to the membership information of genes in neuron development pathway, skeletal development pathway, synapse pathway, and Calcium (Ca) signaling pathway, respectively. These four pathways have been associated with ASDs in recent studies [21,91]. The gene membership information was extracted using the corresponding pathway gene sets from Molecular Signatures Database (MSigDB) version 4.0 [105]. Here, X is an n x 8 matrix where each row corresponds to the values of eight predictors for the corresponding gene. We fit a lasso regularized weighted binomial regression model with the aforementioned dependent and independent variables on the training data using 100 penalty terms, Lambda and 10-fold cross validation. Cross validation is used to correct for potential over-fitting bias. The weights are given by the number of ASD association studies related to each genes. If a gene does not have any association study associated with it, it is given a very small weight of le-0 6 . For each non-negative value A in Lambda, lasso tries to minimize the deviance of the model (often estimated as the negative log-odds ratio) fit to the responses using the predictor coefficients as well as a constant term. We use the lassogim function from Matlab (version 2012a) to fit the lasso regularized binomial regression model. 3.2.3 Selecting Model Coefficients We select the constant term as well as the set of predictor coefficients such that the deviance of the model remains within one standard error of the minimum deviance found by lasso. The selected model coefficients are shown in Table 3.1 in order of predictive value. According to the fitted lasso penalized logistic regression model all the predictors are informative to some extent. 3.2.4 Creating Regularized Model and Making Predictions Let the constant term be denoted by /0 and the vector of lasso regularized predictor coefficients be denoted by # = {/1, #2, /3',... , #8}". 51 The resulting regularized model is given by Variable AutSim DSAP SkeletalPath NeuronPath CaPath SynapsePath CNVIE NetCrTk Coefficient 65.4212 1.7637 1.4463 1.2254 1.1516 1.1487 0.3529 0.2223 Table 3.1: Selected regression coefficients for the integrative approach from logistic regression in order of predictive value. Equation 3.1. logit(p) = log = X# + 3o (3.1) Thus, the predictions are given by Equation 3.2. = e(XP+00)M + e(XP+0o) (3.2) We evaluate the model predictions on the training and validation data using this equation. The Matlab function glmval is used for that purpose. 3.3 Performance Analysis First we assess the accuracy of the model on the training data by measuring the area under the ROC curve as well as the area under the precision-recall curve. The TPR, or recall, and FPR are calculated using Equations 2.3 and 2.4 respectively. Precision is given by Equation 3.3. Number of ASD genes correctly identified by the method Number of genes identified as ASD genes by the method Figure 3-1 shows the precision-recall curve and ROC curve for the model on training data. We achieve an area of 99.54% under the ROC curve, and an area of 78.63% under the precision-recall curve which indicate the high quality of the fit on training data. To assess the overall accuracy of the model, we measure the AUC for ROC curve using the validation data set. It achieves an AUC of 65.34% (Figure 3-2). To compare its performance, we also fit two other lasso regularized logistic regression models - one integrating only the ASD association 52 A. Precdsion-Rocall Curve B. 0.9 0.9 02 0.7 0.7 0.5 0.5 0.3 . 0.9 0.3 0.3 0.2 0.2 0.1 AUC 0 0 02 OA 0. 0. 0.A ROC Curve 0.1 0 1 AUC 0 0.2 Recall 0.4 0* =.M 0. 1 False Positive Rate (FPR) Figure 3-1: Performance curves for integrative approach on training data. Precision-recall curve with an AUC of 0.7863; B. ROC curve with an AUC of 0.9954. A. scores from the four methods described in Section 2, and the other integrating only the gene membership information in the four pathways and the weights from the literature. The standardized regression coefficients are listed in Tables 3.2 and 3.3. Variable AutSim DSAP CNVIE NetCrTk Coefficient 18.0409 3.1580 0.2005 0.1870 Table 3.2: Selected logistic regression coefficients for integrating different ASD association scores in order of predictive value. Variable CaPath SkeletalPath NeuronPath SynapsePath Coefficient 2.6001 1.7088 0.9576 0.6344 Table 3.3: Selected logistic regression coefficients for integrating ASD-pathway membership information with weights in order of predictive value. We compute area under ROC curves (AUCs) for each of these models as well as the methods from Section 2 using the validation dataset. As we can see in Figure 3-2, our integrative approach which uses both the ASD association scores from different methods and the ASD-pathway membership information with weights, gives the best performance 53 among all of them. 0.9 0.8 0.7*0.6 I0.5- IntApp: AUC =0.534 OA - 2 U.4 IntMthd: AUC = 0.6173 lntPath: AUC = 0.5309 CNVIE: AUC = 0.5954 0.2 -- 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 AutSim: AUC = 0.5596 DSAP: AUC = 0.5416 NetCrTk: AUC = 0.5625 Baseline: AUC = 0.5000 0.7 0.8 0.9 1 False Positive Rate (FPR) Figure 3-2: Receiver operating characteristics curves for different ASD gene prediction-prioritization methods. IntApp: Integrative approach incorporating different ASD association scores as well as ASD pathway membership information from the literature; IntMthd: Integrative approach incorporating different ASD association scores only; IntPath: Integrative approach incorporating only ASD pathway membership information from the literature; CNVIE: CNV Information Entropy based Prioritizer; AutSim: Autism Similarity based Prioritizer; DSAP: Diffusion State ASD Proximity based Prioritizer; NetCrTk: Network Crosstalk based Prioritizer. Figure 3-3 depicts the lift chart for this method which shows how much more likely we are to identify ASD genes than if we make random guesses. By considering only top 2% of genes in the ranklist found by our method, we are able to identify 3 times as many known ASD genes (excluding seeds), as if we selected randomly. This gain indicates quite an improvement considering the imbalanced nature of our dataset with ASD genes accounting for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5. Inclusion of seed genes boosts the gain up to 11.2-fold. Thus, we construct our risk gene set for ASDs using the genes from the top 2% of our ranklist which yields 443 genes. Note that among these genes, 123 are known ASD genes (102 seeds and 21 non-seeds). Among the 21 non-seed ASD genes identified by our integrative approach, DLGAP3, APC, GPC6, and NTRK1 have appeared in seven ASD association studies; AR, ATRX, and RPS6KA3 appear in six studies; 54 112 I I I I I I I i i i 101 Test dat moet tInot e 9.8 ~Entire d 8. -Baseine 7. 6.8 5. 4.8 3. 21 1. 0. 0 0.05 T 0.1 I 0.15 T 0.2 I 025 0.3 0.35 0.4 0A5 0.5 0.55 0. 0.5 0.7 0.75 0.8 0.5 0. 0.95 1 % Genes in the Ranklist Figure 3-3: Lift chart of integrative approach for ASD.gene prediction-prioritization. SCN8A and TBR1 appear in five studies; GNAS and EGR2 appear in four studies; KCND2 and BIN1 appear in three studies; SETD2, TYR, and EPHB2 appear in two studies; and TBX1, PTPN11, DUSP22, BRCA2, and KIT appear in one study. Considering the high gain of known ASD genes in the candidate set, we can hypothesize that the other genes in it have a strong possibility of being associated with ASDs which are worth investigating. The complete list of risk genes along with the probabilities of their association to ASDs is given in Appendix B. 55 56 Chapter 4 ASD Genetics: Implications from Candidate ASD Risk Genes 4.1 Gene Sets for Analysis We downloaded prior knowledge-based gene sets consisting of gene symbols in Gene Matrix Transposed (GMT) format from the MSigDB version 4.0 [105]. Of the available pathway gene sets, we collected 1320 expert curated ones which include gene sets from Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (http://www.genome.jp/kegg, [47,48]), Reactome pathways (http: //www. react ome. org/ [27, 72]), BioCarta pathways (http://www. biocarta. com), Pathway Interaction Database (PID) pathways [99], SigmaAldrich gene sets (http://www.sigmaaldrich.com/life-science.html), Signaling Gateway gene sets (http: //www. signaling-gateway. org), Signal Transduction KE gene sets (http: //stke. sciencemag.org), and SuperArray gene sets (http://www.superarray.com). From the collection, we filtered out the disease and drug related gene sets. We also excluded very large (>,300 genes) and very small (< 10 genes) gene sets. Thus we were left with 1221 gene sets as of June 2014. From MSigDB version 4.0, we also collected Gene Ontology (GO) gene sets, which are derived from the controlled vocabulary of the GO project: "The Gene Ontology Consortium" [8]. These gene sets are based on GO terms and their associations to human genes. We collected gene sets belonging to two categories - C5:BP (biological process) and C5:MF (molecular function). After filtering out very small and very large gene sets, we were 57 left with 763 and 382 gene sets respectively under these categories. From the collection, we excluded the KEGG calcium signaling pathway, neuron development, skeletal development, and synapse gene sets since they have been used as prior knowledge in our integrative approach. We also filter out the neurite development and axon guidance gene sets as they are subsets of the neuron development pathway gene set. 4.2 Hypergeometric Test for Enrichment We use a hypergeometric test to determine the statistical significance of overlap of a pathway or GO gene set with our candidate gene set. The hypergeometric test uses the hypergeometric distribution to calculate the probability of more than k ASD risk genes (out of a set of n ASD risk genes in a dataset of N total genes) appearing in a specific pathway or GO gene set of size K. The probability mass function of the hypergeometric distribution is given by the following expression. (K) k (N-K\ \n-k n~) This test helps identify whether the ASD risk gene set is overrepresented in a certain pathway or GO gene set, and provides us with a p-value. We use the phyper function from R (version 2.15.1) for computing the hypergeometric p-values for the pathways and GO gene sets in our gene set collection. 4.3 Pathway Enrichment Analysis Performing hypergeometric enrichment analysis on pathway gene sets, we found a total of 32 canonical pathways in which ASD risk genes are overrepresented after Bonferroni correction (Table 4.1). Of note, there was an increased frequency of affected pathways associated with signal transduction pertinent to brain, cellular assembly and communication, synaptic development, and neuronal development. Most of the affected signaling pathways (e.g., MAPK signaling, FGF signaling, SHP2 signaling etc.) are found to be highly involved in processes such as cell growth and death, specifically neuron apoptosis, neurite outgrowth, inter neurite cell adhesion etc. They also affect cell proliferation, differentiation, and migration processes. We found a number of affected pathways involving Li family cell adhesion molecules (L1CAM) which play important roles in neuronal migration and synaptic forma- 58 tion [6]. Ankyrins (SHANK proteins) bind to Li proteins and couple them to ion channel proteins, and thus mediate branching and synaptogenesis of cortical inhibitory neurons. Pathways involving neural cell adhesion molecules (NCAM) play important roles in formation and maintenance of the nervous system. Clearly, incorrect synapse development, neuron development, and erosion of synaptic function are widely considered to be key contributors to ASDs. Our results also show that pathways involved in immune response, protein catabolism or modification, tissue and organ morphogenesis, muscle cell differentiation, inflammatory response, etc. are associated with ASDs, and may be involved in immune related disorders, developmental regression, metabolic abnormalities, morphological impairments, and sleep disturbances. Table 4.1 Canonical pathways having significant overlap with ASD risk genes. - Name 1 Category Functions # Genes Genes Signal Transduction cell 21 MEF2C, Adjusted p-value KEGG MAPK proliferation, SIGNALING differentiation, PATHWAY gration mi- FLNA, BDNF, NF1, FLNB, FGFR3, FGFR1, FGF14, IKBKG, RPS6KA3, MAPT, CACNA1H, CACNA1G, CACNA1A, CACNAlC, CACNAIS, SOS1, 1.00E-04 FGFR2, NTF3, TP53, NTRK1 PID FGF PATH- Signal Transduction WAY cell death, neuron apoptosis, 10 proteo- somal ubiquitin dependent FGFR1, PLCG1, FGFR2, 0.000108898 FGFR3, RUNX2, PTPN11, MET, FRS2, HGF, SOS1 protein catabolism PID SHP2 PATH- Signal Transduction WAY activation IL10, of IL2, 10 FGF, ERBB1 SOSi, NOS3, FRS2, 0.000184167 IL2RG, NTRK3, NTRK1, signaling cascades BDNF, PTPN11, IL6, NTF3 REACTOME IN- ANK3, LlCAM, SCN2A, TERACTION BE- velopment, branching SCN4A, SCN5A, SCN8A, TWEEN Li AND and synaptogenesis of SPTAN1 ANKYRINS cortical neurons REACTOME Cell Adhesion system de- neural cell adhesion, 7 COLlAl, COL2A1, SIG- formation and main- COL4A3, FGFR1, NALING FOR tenance PRNP, NEURITE OUT system NCAM Cell Adhesion nervous of 10 nervous GROWTH PID CATENIN SOSI, CACNA1S, 0.000204988 0.000480482 SPTAN1, CACNAlH, CACNA1G BETANUC Signal Transduction cell proliferation, immune response 11 AR, KRT1, TCF4, SALL4, PITX2, PATHWAY MITF, 0.000492712 MED12, NCOA2, AES, CACNA1G, APC (continued on next page) 59 Table 4.1 Continued. - Name Category Functions # Genes Genes Cell Adhesion cell 17 HGF, Adjusted p-value KEGG FOCAL communication, motility, ADHESION tion, prolifera- differentiation, survival FLT4, FLNA, RELN, FLNB, COLilA1, VWF, 0.00059351 MET, COL11A2, PTEN, ITGA2B, CAV3, COL2A1, COLiAl, SOS1, BCL2, LAMA3 PID P53 DOWNSTREAM Signal Transduction cell growth and death PATH- FDXR, TP63, WAY TSC2, BCL2, TP53, 0.000603742 RFWD2, MET, PTEN, HGF, APC, RB1, VDR, NDRG1, DDB2 KEGG ECM RE- Signal Transduction tissue and organ CEPTOR INTER- morphogenesis, ACTION adhesion, tion, cell prolifera- differentiation GP1BA, VWF, HSPG2, 0.000813133 ITGA2B, COL2A1, RELN, COLlA1, LAMA3, SDC3, COL11Al, COL11A2 and morphogenesis REACTOME L1CAM Cell Adhesion INTER- ACTIONS neurite outgrowth, neurite fascination, DNM2, FGFR1, ANK3, ITGA2B, 0.001033363 L1CAM, inter neuronal adhe- RPS6KA3, sion SCN4A, SCN2A, SCN5A, SCN8A, SPTAN1 REACTOME INSULIN Signal Transduction RE- CEPTOR SIG- NALLING CAS- activation of MAPK, FRS2, Ras/RAF cascades FGFR3, EIF4E, FGFR1, FGFR2, INSR, SOS1, STK11, PRKAG2, 0.001161965 TSC1, TSC2 CADE REACTOME Signal Transduction P13K CASCADE insulin receptor sub- FRS2, strate mediated FGFR3, sig- naling EIF4E, FGFR1, FGFR2, PRKAG2, 0.001293256 INSR, STK11, TSC1, TSC2 Signal Transduction REACTOME SIGNALING insulin binding FRS2, BY EIF4E, FGFR1, FGFR3, 0.001561111 FGFR2, INSULIN RECEP- ATP6VOA2, TOR PRKAG2, INSR, SOS1, STK11, TSC1, TSC2 PID SYNDECAN Signal Transduction 1 PATHWAY tumor necrosis fac- COL11A2, MET, COL7A1, tor mediated signal- COLIlAl, ing, COL4A3, HGF, COLlA1 protein ubiqui- 0.002906631 COL2A1, tation, degradation PID SMAD muscle cell differenti- NCOA2, FOXO4, FOXH1, NUCLEAR PATH- 2,3 Signal Transduction ation, endothelial cell MEF2C, AR, ESRI, WAY migration, RUNX2, DLX1, FOXG1, negative regulation PID TCR PATH- Protein Catabolism, protein WAY Signal Transduction process, PID INTEGRIN1 Cell Adhesion PATHWAY 0.004928594 VDR catabolytic IKBKG, activation PLCG1, of calcium signaling, PTPN11, NFKB signaling STIMI integrin family surface interactions cell for adhesion WAS, FLNA, 0.005740308 PTPRC, PTEN, COL11Al, SOS1, TGFBI, COL7A1, COL2A1, COLlA1, COL4A3, 0.005740308 LAMA3, FBN1, COL11A2 SIG PIP3 NALING CARDIAC SIGIN Signal Transduction cell growth and sur- RPS6KA3, MET, ERBB4, vival SOS1, MY- PTPN1, 0.00651532 TSC1, INPPL1, TSC2, PTEN OCTES (continued on next page) 60 Table 4.1 Continued. - Name Category Functions KEGG NEU- # Nervous System Genes Genes Adjusted I p-value I_ I I differentiation and 12 FRS2, BDNF, BCL2, ROTROPHIN survival of neuronal PLCG1, SIGNALING cells, RPS6KA3, SOSI, PSEN1, PATHWAY memory learning and 0.007891206 PTPN11, NTF3, TP53, NTRK1, NTRK3 REACTOME NCAMI Cell Adhesion neuronal -cell hesion, INTER- ACTIONS ad- COLIA1, COL2A1, cellular COL4A3, PRNP, differ- CACNAlS, survival, CACNA1G migration, entiation, 0.009759881 CACNA1H, synaptic plasticity ST MYOCYTE Signal Transduction AD PATHWAY formation of interface APC, between nervous sys- PITX2, CAV3, RYR1 GNAQ, EPHB2, 0.011606902 tem and cardiovascular system BIOCARTA GH Signal Transduction PATHWAY growth factor diated dwarfism, tion me- signaling, of HNF1A, GHR, GH1, 0.014524747 MEF2A, 0.018499536 PLCG1, INSR, SOS1 activaJAK-STAT, MAPK cascades REACTOME NGF Signal Transduction neuronal differentia- SIGNALLING tion in VIA TRKA FROM neurotrophins THE response to PTEN, MEMBRANE TEGRIN IN- cell adhesion to ECM, RAPGEF4, Cell Adhesion protein COL2A1, COL4A3, FBN1, IN- P73 catabolitic process ITGA2B, PATH- Transcription, Pro- tein Catabolism apoptosis, somal proteo- ubiquitation dependent DIFFER- Signal Transduction ENTIATION PATHWAY COLlA1, PTPN1, 0.025394674 SOS1, VWF WAY ST SOSI, Protein Catabolism, TERACTIONS PID PRKAR1A, RPS6KA3, TSC2 CELL SURFACE DNM2, MEF2C, FOXO4, NTRK1, PLCGI, PLASMA REACTOME FRS2, protein WT1, GATAI, BRCA2, BIN1, TP53AIP1, RB1, NTRK1, catabolitic process TP63 PC12 cell differentia- PTPN11, tion RPS6KA3, GNAQ, FRS2, IN 0.025394674 GNB2L1, NTRK1, 0.025972966 EGR2, OPN1LW PC12 CELLS PID MET PATH- Signal Transduction WAY growth factor medi- PLCG1, PTPN1, PTPN11, ated signaling HGF, EIF4E, 0.028115505 INPPL1, APC, SOSI, MET PID TRKR PATH- Signal Transduction WAY growth factor medi- NTRK3, ated signaling PLCGI, SOS1, FRS2, 0.028499715 NTRK1, PTPN11, BDNF, NTF3 REACTOME Signal Transduction cell proliferation, FRS2, FGFRI, FGFR3, DOWNSTREAM differentiation, mi- FGFR2, FOXO4, PLCG1, SIGNALING gration, survival and PRKARIA, PTEN, SOSI, cell shape TSC2 calcium signaling TP63, VDR, DLX6, BRCA2, OF ACTIVATED 0.028989079 FGFR PID DELTA NP63 Signal Transduction PATHWAY GNB2LI, 0.03478047 KRT14, DLX5 PID NCAD- HERIN PATH- WAY (TOLL PATHWAY) Signal Transduction inflammatory sponse, re- interferon production, cell proliferation and PTPN1, PLCG1, LRP5, 0.039242846 FGFR1, GJA1, PTPN11 migration (continued on next page) 61 Table 4.1 - Continued. Name Category Functions # Genes Genes REACTOME Nervous System cell 17 CHRNA1, CHRNA7, GABRB3, GRIK2, GRIN2A, GRIN2B, KCND3, Adjusted p-value communication, NEURONAL neuronal SYSTEM ment develop- KCND2, KCNQ1, RPS6KA3, 0.046181684 MAOA, SLC1A1, STXBP1, ABCC8, SYN1, CACNA1A, PICKI 1. We excluded neuron development, neurite development, axon guidance, synapse development, and calcium signaling pathways as they were used as input knowledge in our integrative approach. 4.3.1 An Interesting Connection with Inflammatory Bowel Disease (IBD) The fact that ASD patients often suffer from chronic inflammation of gastrointestinal tracts motivated us to look for possible shared pathogenesis between ASD and inflammatory bowel disease (IBD). We first identified a number of pathways related to IBD from extensive literature review. These IBD pathways are often related to innate and adaptive immunity (T-cell signaling, chemokine signaling, NOD2 signaling, NF-KB signaling, 1L23/Th17 signaling etc. [61]), autophagy (IL2 signaling, IL2RB signaling, IL10 signaling, IL6 signaling, TGF-,6 signaling etc. [61,79]), necrosis (TNF signaling, TNFR1/2 signaling, etc.) and apoptosis (cytokine signaling [79,881. A number of signaling pathways such as, ERK-MAPK signaling [18, 118], WNT signaling [42], Notch signaling [70], Adipocytokine signaling [50], Integrin signaling, Hedgehog signaling [60], BMP signaling, Hippo signaling, JAK-STAT signaling [10,32,102] also have been mentioned in relation to IBD. We collected corresponding pathway gene sets from MSigDB. When we looked at the overlap of these pathways with our candidate gene set, we found that ASD risk genes are overrepresented in most of these pathways. Table 4.2 lists the IBD-related pathways that have significant overlaps (p-value < 0.05) with ASD risk genes. This clearly indicates that IBD and ASDs have some sort of shared pathogenesis which is worth further investigation. 4.4 Enrichment Analysis on GO gene sets We performed hypergeometric enrichment analysis on the gene sets under GO biological processes and molecular functions categories to find in which biological processes and molecular 62 Name KEGG MAPK SIGNALING PATHWAY # Genes 21 BIOCARTA ERK5 PATHWAY BIOCARTA AKT PATHWAY REACTOME CYTOKINE SIGNALING IN IMMUNE SYSTEM 4 4 14 REACTOME SIGNALLING TO ERKS REACTOME PROLONGED ERK ACTIVATION EVENTS REACTOME P13K AKT ACTIVATION REACTOME ERK MAPK TARGETS WNT SIGNALING BIOCARTA IL6 PATHWAY ST T CELL SIGNAL TRANSDUCTION PID TNF PATHWAY PID IL6 7PATHWAY BIOCARTA TNFR1 PATHWAY BIOCARTA TNFR1 PATHWAY REACTOME NFKB ACTIVATION THROUGH FADD RIP1 PATHWAY MEDIATED BY CASPASE 8 AND10 PID IL2 1PATHWAY ST INTEGRIN SIGNALING PATHWAY KEGG HEDGEHOG SIGNALING PATHWAY ST ERKI ERK2 MAPK PATHWAY REACTOME APOPTOSIS 4 3 Overlapped Genes MEF2C, FLNA, FLNB, BDNF, NFl, FGFR2, FGFR3, FGFR1, FGF14, RPS6KA3, IKBKG, MAPT, CACNA1H, CACNAlG, CACNA1A, CACNA1C, CACNAlS, SOS1, NTF3, TP53, NTRK1 MEF2C, MEF2A, PLCGI, NTRK1 IKBKG, FOXO4, GHR, GH1 EIF4E, FLNB, GH1, GHR, HGF, IL2RG, IL6, INPPL1, IRF6, PLCG1, PTPN1, SOSI, TRIM25, IKBKG FRS2, NTRK1, PLCG1, SOSI FRS2, NTRK1, PLCGI 4 3 6 3 4 4 4 3 3 2 FOXO4, NTRK1, PTEN, TSC2 MEF2A, MEF2C, RPS6KA3 AES, APC, LRP5, PITX2, WNT2, HPRT1 PTPN11, IL6, SOS1 SOS1, PTPRC, PLCG1, EPHB2 SMPD1, GNB2L1, IKBKG, CYLD PTPN11, IL6, MITF, SOSI LMNA, SPTAN1, RB1 LMNA, SPTAN1, RB1 TRIM25, IKBKG 0.006763695 0.008043482 0.008835118 0.009177488 0.012244256 0.013204051 0.014210455 0.019653376 0.019653376 0.02298827 4 5 4 3 7 0.024014914 0.024148774 0.025467375 0.025536874 0.029317822 3 4 SOSI, PTPN11, BCL2, IL2RG EPHB2, SOS1, PLCG1, WAS, PTEN SHH, WNT2, BMP4, GLI3 SOSI, RPS6KA3, EIF4E DSP, APC, LMNA, MAPT, BCL2, SPTAN1, TP53 IL2RG, BCL2, SOSI PTPN11, PRKAG2, IKBKG, STK11 3 IL2RG, INPPL1, SOS1 0.048181081 BIOCARTA IL2RB PATHWAY KEGG ADIPOCYTOKINE PATHWAY REACTOME IL 2 SIGNALING SIGNALING p-value 1.09E-07 0.000383858 0.000861447 0.001132398 0.005568586 0.006035527 0.039815059 0.044909084 Table 4.2: IBD-related pathways having significant overlap with ASD risk genes. functions, our ASD risk genes are over represented. Significant biological processes and molecular functions were selected based on the hypergeometric p-values (< 0.05 after Bonferroni correction). As expected, the candidate gene set was over represented in a number of developmental processes related to the nervous system and brain. However, it is interesting to see that the risk gene set is significantly involved in processes such as tissue, muscle, epidermis, and ectoderm development, and organ morphogenesis. This finding might be supporting evidence for the fact that muscular dystrophy is a comorbid condition in many ASD cases [55]. Our candidate gene set is also found to be involved in a number of molecular functions related to ion channel activity, protein dimerization and binding, gated channel activity, sodium and calcium channel activity, etc. Figures 4-1 and 4-2 show the significant biological processes and molecular functions found by our analysis. 4.5 Enrichment Analysis for Subnetworks Analysis for subnetworks was performed using QIAGEN's Ingenuity® Pathway Analysis (IPA® QIAGEN Redwood City, http: //www. qiagen. com/ingenuity). IPA assembled subnetworks based on gene-to-gene connectivity assuming that, the more connected a gene is, the more influence it has and the more "important" it is. IPA selected a set of seed genes 63 =40&~-Vatue) -- Rtflo of wuap 0.4s 0.35 0.3 0.25 is. 0.2w 0.13 0.M 5, z g= g Figure 4-1: Significant GO biological processes associated with ASD risk gene set. The primary vertical axis shows the negative log of hypergeometric p-values. The secondary vertical axis shows the ratio of overlapped genes. 64 ....... .......... .. . .... .... .... . -o e+- CL_ CL e9 m o- >n~- m' q m 0 '-1 TRANSCRIPTION ACTIVATOR ACTIVITY COPPER ION BINDING STRUCTURAL CONSTITUENT OF CYTOSKELETON PROTEIN COMPLEXBINDING CALCIUM CHANNEL ACTIVITY PROTEIN DOMAIN SPECFIC BINDING ADENYL NUCLEOTIDE BINDING ADENYL RIBONUCLEOTIDE BINDING VOLTAGE GATED CALCUM CHANNEL ACTIVITY ATP BINDING SODIUM CHANNEL ACTIVITY PROTEIN N TERMINUS BINDING TRANSM EMBRANE RECEPTOR PROTEIN KINASE ACTIVITY PROTEIN TYROSINE KINASE ACTIVITY CATION TRANSMEMBRANE TRANSPORTER ACTIVITY TRANSMEMBRANE RECEPTOR PROTEIN TYROSINE KINASE ACTIVITY PROTEIN DIMERIZATION ACTIVITY PROTEIN HOMODIMERIZATION ACTIVITY VOLTAGE GATED SODIUM CHANNEL ACTIVITY CATION CHANNEL ACTIVITY VOLTAGE GATED CHANNEL ACTIVITY SUBSTRATE SPEaRC CHANNEL ACTIVITY ION CHANNEL ACTIVITY VOLTAGE GATED CATION CHANNEL ACTIVITY ION TRANSMEMBRANE TRANSPORTER ACTIVITY METAL ION TRANSMEMBRANE TRANSPORTER ACTIVITY GATED CHANNEL ACTIVITY - - - - - 0 I m - - I -1 U' - f - U - - - - -- - T~ I - - Ratio of Overlap - -log(p-value) t P IA P a U S I A from our ASD risk gene set. Seeds with the most connections were then connected to other seeds to form a network. Non-seed genes as well as molecules from IPA Knowledge Base were added to the network to fill or join the areas lacking connectivity. For visualization purposes, we limited each subnetwork to a maximum of 35 nodes. Subnetworks were annotated with high level functional categories, scored and sorted in descending order of scores. IPA network analysis revealed 25 significant subnetworks in our supplied ASD risk gene set. Figure 4-3 shows the top 4 subnetworks. The topmost subnetwork is characterized by tissue morphology, and gastrointestinal disease terms. Nervous system development and function characterizes the second subnetwork. The third subnetwork is annotated by developmental, hereditary, and neurological disorders. Organismal injury and abnormalities as well as reproductive system disease characterizes the fourth subnetwork. The complete list of significant subnetworks is given in Appendix C. These findings strongly suggest the possibility of the existence of subclasses of ASDs, each characterized by one of the disorders such as, gastrointestinal disorders, developmental disorders, hereditary disorders, neurological disorders, organismal abnormalities, etc, and calls for further investigation. 4.6 Functional Analysis for Overlap with Diseases and Biofunctions We performed functional analysis on our ASD risk gene set using QIAGEN's Ingenuity@ Pathway Analysis (IPA® QIAGEN Redwood City, http://www.qiagen.com/ingenuity). With a goal of providing a molecular understanding or model that could explain the functionality of the provided gene set, IPA analyzed it for diseases and functions using high quality GO information, manually curated information on diseases and disorders, and normal processes in abnormal tissues available in IPA knowledge base. Significance of overlap between risk genes and genes in diseases and functions was calculated using Fisher's exact test. IPA functional analysis revealed that our ASD risk gene set is significantly overrepresented in a number of diseases under different disease categories including developmental and hereditary disorders, neurological disorders, connective tissue disorders, auditory disorders, gastrointestinal disorders, psychological disorders, dermatological disorders, inflammatory disorders, organismal abnormalities, cancers, etc. While overlap of ASD with neurologi- 66 FGR Nicotinic ace er Aigar--IL '* F SLI IjG :K 4A HS l4 k 5-9 LG 7 KR A 6- Network 1 KR3-3 KL2 B. Ss Network 2 (jamily) ated sodium channel K4C F Nc A 2 G1 !J2LX(o sterol In Tro Trpnsl1 I -- SOX2-O~t#-NANMG cyclooicmenase t In P 3 42 C G SIL AS 12 G 2 eo5 A6 CTNN -Y/LEF C. Network 3 D. Networ 4 Figure 4-3: Top four subnetworks in ASD risk gene set generated by QIAGEN's Ingenuityg Pathway Analysis (IPA). 67 cal, psychological, developmental, and hereditary disorders are obvious, its connection with gastrointestinal, auditory, and inflammatory disorders are not so obvious, hence more interesting for further investigation. The top 10 diseases having significant overlap with our risk gene set are shown in Table 4.3. Diseases Autosomal Dominant Disease Multiple Congenital Anomalies Congenital Anomaly of Musculoskeletal System Dysplasia Cognitive Impairment Autosomal Recessive disease Mental Retardation Congenital Anomaly of Limb Dysplasia of Skeleton Hypoplasia Categories Hereditary Disorder Developmental Disorder Developmental Disorder, Skeletal and Muscular Disorders p-Value 5.19E-66 1.14E-61 2.42E-54 # Genes 106 93 108 Developmental Disorder Neurological Disease Hereditary Disorder Developmental Disorder, Neurological Disease Developmental Disorder, Skeletal and Muscular Disorders Connective Tissue Disorders, Developmental Disorder, Skeletal and Muscular Disorders Developmental Disorder 3.90E-40 1.79E-38 7.52E-36 4.05E-33 3.41E-32 2.53E-29 59 61 95 47 43 37 3.89E-29 68 Table 4.3: Top 10 diseases having significant overlap with ASD risk genes found by QIAGEN's Ingenuity@ Pathway Analysis (IPA). Functions Organismal Death Differentiation of cells Morphology of head Cell Death Abnormal Morphology of head Morphology of Cells Apoptosis Morphology of Nervous System Development of Body Axis Development of Head Abnormal Morphology of Nervous System Development of Body Trunk Proliferation of Cells Quantity of Cells Development of Central Nervous System Development of Neurons Development of Brain Length of Animal Necrosis Size of Body Abnormal Morphology of Cells Microtubule Dynamics Morphology of Central Nervous System Behavior Morphology of Brain Organization of Cytoskeleton Abnormal Morphology of Brain Morphology of Bone Cell Movement Abnormal Morphology of Central Nervous System Categories Organismal Survival Cellular Development Organismal Development Cell Death and Survival Organismal Development Cell Morphology Cell Death and Survival Nervous System Development and Function Embryonic Development, Organismal Development Embryonic Development, Organismal Development Nervous System Development and Function p-Value 5.39E-61 2.69E-55 4.37E-54 2.76E-51 2.81E-51 3.08E-50 2.97E-49 3.70E-48 1.16E-44 3.62E-44 2.02E-43 # Genes 212 190 122 238 116 179 206 112 115 109 102 Embryonic Development, Organismal Development Cellular Growth and Proliferation Tissue Morphology Nervous System Development and Function 4.16E-41 1.31E-40 2.40E-40 6.36E-40 115 231 151 87 Cellular Development, Nervous System Development and Function, Tissue Development Embryonic Development, Nervous System Development and Function, Organ Development, Organismal Development, Tissue Development Organismal Development Cell Death and Survival Organismal Development Cell Morphology Cellular Assembly and Organization, Cellular Function and Maintenance Nervous System Development and Function Behavior Nervous System Development and Function, Organ Morphology, Organismal Development Cellular Assembly and Organization, Cellular Function and Maintenance Nervous System Development and Function, Organ Morphology, Organismal Development Connective Tissue Development and Function, Embryonic Development, Organ Development, Organ Morphology, Organismal Development, Skeletal and Muscular System Development and Function, Tissue Development Cellular Movement Nervous System Development and Function 1.45E-39 91 3.42E-39 76 5.02E-38 9.09E-38 2.12E-37 3.14E-37 8.73E-37 104 186 103 127 112 3.16E-35 77 3.95E-35 1.26E-34 103 73 1.15E-33 117 2.56E-33 70 4.06E-33 71 4.66E-33 6.86E-33 I_ 155 72 I Table 4.4: Top 30 functions having significant overlap with ASD risk genes found by QIAGEN's Ingenuity® Pathway Analysis (IPA). ASD risk genes are also over represented in a number of functional categories, including nervous system development, cell death and survival, cellular development, embryonic de68 velopment, organismal survival and development, cell and tissue morphology, behavior, etc. The top 30 functions having significant overlap with our risk gene set is shown in Table 4.4. 69 70 Chapter 5 Conclusion In this thesis, we have explored different computational approaches for addressing the classic problem of disease gene prediction and prioritization in the context of autism spectrum disorders (ASD). We have introduced three novel computational methods, one ASD-specific generalized Pagerank method, and a novel method that integrates the four, for solving the ASD gene prediction-prioritization problem. Our first method calculates information entropy based scores for all the genes that can be mapped to the copy number variations that have ever been observed in ASD population as well as appropriate control groups by taking into account their frequency of occurrence in ASD case-control groups. Ranking the genes in descending order of CNV-based scores helps us achieve an area of 59.81% under the ROC curve, and 2.3-fold enrichment of ASD genes in the top 2% of the ranklist. Our second method incorporates disease/phenotype similarity scores computed by van Driel et al. [112] and gene-phenotype relationships from the OMIM database. This method is seeded by high confidence ASD genes from the literature to identify ASD like phenotypes in OMIM. Genes involved in diseases with phenotypes similar to ASDs are scored highly by this algorithm. This method achieves an area of 55.96% under the ROC curve excluding the seed genes. We are able to achieve a 3.62-fold gain in ASD genes in the top 2% of the ranklist. In our third method, we introduce diffusion state ASD proximity (DSAP) for the proteins based on diffusion state distance (DSD) metric, which is superior to direct neighborhood and shortest path distances in capturing the functional association of proteins in the PPI 71 network. Genes axe ranked in descending order of their diffusion state proximity to ASD seed genes. DSAP-based prioritizer achieves an AUC of 54.05% under the ROC curve excluding the seed genes. Considering the top 4% of the ranklist accounts for 1.1-fold enrichment of ASD genes (excluding seed genes). However, inclusion of seed genes boosts this enrichment upto 5.7-fold. The fourth method we introduce is a generalization of Google's Pagerank algorithm for ASDs. This approach uses the global PPI network structure to simulate network crosstalk between the genes in the network and high confidence ASD seed genes. The simulated crosstalk gives a quantification of the functional association of ASD genes to the rest of the genes in the network. Genes are ranked in descending order of their association scores. We achieve an AUC of 56.11% under the ROC curve using this method. In the top 2% of the ranklist of genes, we achieve a 2.37-fold enrichment of ASD genes (excluding seeds). Considering the unbalanced nature of our dataset these methods can be considered to perform reasonably well, as we can achieve an AUC more than 50% using each of these methods. However, the performances of these methods axe limited in that none of them could give us an AUC more than 60%. Thus, to increase overall accuracy of ASD gene prediction we propose a novel integrative approach which incorporates not only CNV, phenotype similarity, connectivity, proximity and topological similarity in the PPI network, but also ASD pathway knowledge from available literature. Each gene is assigned an association probability based on a simple, yet powerful logistic regression model. Adaptive lasso penalization with cross validation is performed to avoid over-fitting of the model. Genes axe ranked in descending order of their association probabilities. This integrative approach significantly outperforms the above four individual methods achieving an AUC of 65.34% under the ROC curve using test data. The top 2% of the ranklist gives us 3-fold enrichment of ASD genes (excluding seeds) which increases upto 11.2-fold with the inclusion of seed genes. Thus we get a high quality candidate gene set for ASDs consisting of the top 2% genes of the ranklist. Our candidate gene set provides a number of interesting insights into the genetic background and pathophysiology of ASDs. Pathway enrichment analysis reveals that the candidate gene set is overrepresented in a number of signaling, cell adhesion and neurological pathways which can be used to explain the pathophysiology of ASDs better. We have been able to discover an interesting connection between ASDs and IBD by showing that, our candidate gene set has significant overlap with the majority of the IBD-related pathways. We 72 have also found several disjoint subnetworks in our candidate gene set characterized by different categories of diseases and bio-functions, which provide an indication of the existence of subclasses of disorders in the autism spectrum. The topmost subnetwork characterized by gastrointestinal disorders is particularly interesting and needs further investigation. Furthermore, we have identified a number of interesting molecular functions and biological processes by functional analysis and enrichment analysis on GO terms. For some of these (e.g., moleculax functions related to metabolism, organ and tissue morphology, muscle cell differentiation, etc.), connection to ASDs is not so obvious and thus worth further investigation. There is considerable room for the further development of more sophisticated computational integrative approaches for combining ASD-related omics data from different sources. These techniques will become important as the omics data related to ASD is growing at a fast rate, given that more and more studies are being performed on larger ASD cohorts. Thus, sophisticated computational analysis is key to understanding the mysterious dogma of ASDs. This thesis provides a significant step towards understanding the biological underpinnings of ASDs better. 73 74 Appendix A SFARI Genes for Autism Spectrum Disorders Table A.1 - ASD risk genes reported by SFARI gene module. 1 Gene Symbol Gene Name NRXN1 neurexin MECP2 Methyl CpG binding protein 2 Xq28 39 CNTNAP2 contactin associated protein-like 2 7q35-q36 38 SHANK3 SH3 and multiple ankyrin repeat domains 3 22q13.3 33 FMR1 fragile X mental retardation 1 Xq27.3 29 MET met proto-oncogene (hepatocyte growth factor receptor) 7q31 29 CACNA1C calcium channel, voltage-dependent, L type, alpha 12p13.3 27 Chromosomal Location 2 1 1C sub- p16.3 # Reports 51 unit RELN Reelin 7q22 27 FOXP2 forkhead box P2 7q31 26 OXTR oxytocin receptor 3p25 26 DISCI disrupted in schisophrenia 1 1q42.1 24 DMD dystrophin (muscular dystrophy, Duchenne and Becker types) Xp2l.2 22 NLGN3 neuroligin 3 Xql3.1 22 RBFOX1 RNA binding protein, fox-1 homolog (C. elegans) 22 PTEN phosphatase and tensin homolog (mutated in multiple ad- 16p13.3 10q23.3 1 21 vanced cancers 1) GABRB3 gamma-aminobutyric acid (GABA) A receptor, beta 3 15q11.2-q12 20 NLGN4X neuroligin 4, X-linked Xp22.32-p22.31 20 SYNGAPI. synaptic Rae GTPase activating protein 1 6p21.3 20 AUTS2 autism susceptibility candidate 2 7q11.22 19 SCNIA sodium channel, voltage-gated, type I, alpha subunit 2q24.3 19 SLC6A4 solute carrier family 6 (neurotransmitter transporter, sero- 17q11.l-q12 19 tonin), member 4 DPP6 dipeptidyl-peptidase 6 7q36.2 18 GRIN2B glutamate receptor, inotropic, N-methyl D-apartate 2B 12p12 18 GRIN2A glutamate receptor, ionotropic, N-methyl D-aspartate 2A i6p . 17 MBDS5 Methyl-CpG binding domain protein 5 2q23.1 17 EN2 engrailed homolog 2 7q36 16 CDKL5 cyclin-dependent kinase-like 5 Xp22 15 HOXAI homeobox Al. 7pl5.3 15 NFl neurofibromin 17q11.2 15 15 1 (neurofibromatosis, von Recklinghausen dis- 13 2 ease, Watson disease) SCN2A sodium channel, voltage-gated, type II, alpha subunit 2q23-q24 SHANK2 SH3 and multiple ankyrin repeat domains 2 11q13.3-q13.4 15 (contin ued on next page) 75 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports AHII Abelson helper integration site 1 6q23.3 14 CACNA1H calcium channel, voltage-dependent, alpha 1H subunit 16pl3.3 14 CNTN4 contactin 4 3p26-p25 14 ILlRAPL1 interleukin Xp22.1-p21.3 14 KCNMAI potassium large conductance calcium-activated channel, sub- 10q22.3 14 1 receptor accessory protein-like 1 family M, alpha member 1 RORA RAR-related orphan receptor A 15q22.2 14 SYNI Synapsin 1 Xp1l.23 14 TSC2 tuberous sclerosis 2 l6p13.3 14 MEF2C myocyte enhancer factor 2C 5q14 13 NLGN1 neuroligin 1 3q26.31 13 PCDH19 protocadherin 19 Xq13.3 13 SLC25A12 solute carrier family 25 (mitochondrial carrier, Aralar), mem- 2q24 13 9q34 15q11.2 13 12q14-q15 12 Xpll.22-pll.21 12 ber 12 TSC1 tuberous sclerosis 1 UBE3A ubiquitin protein ligase E3A AVPR1A arginine vasopressin receptor KDM5C Lysine (K)-specific demethylase NTRK3 neurotrophic tyrosine kinase, receptor, type 3 15q25 12 PARK2 Parkinson disease (autosomal recessive, juvenile) 2, parkin 6q25.2-q27 12 CACNA1G calcium channel, voltage-dependent, T type, alpha 17q22 11 1A 5C 1G sub- 13 unit DLX2 distal-less homeobox 2 2q32 11 ERBB4 v-erb-a erythroblastic leukemia viral oncogene homolog 4 2q33.3-q34 11 (avian) ITGB3 integrin, beta 3 (platelet glycoprotein Ilia, antigen CD61) 17q21.32 11 MACROD2 MACRO domain containing 2 20pl2.1 11 MAOA monoamine oxidase A Xpl1.3 11 MCPH1 microcephalin 8p23.1 11 MED12 mediator complex subunit 12 Xql3 11 MTHFR methylenetetrahydrofolate reductase (NAD(P)H) 1p36.3 11 RAPGEF4 Rap guanine nucleotide exchange factor (GEF) 4 2q31-q32 11 SLC1A1 solute carrier family 1 (neuronal/epithelial high affinity glu- 9p2 1 tamate transporter, system Xag), member 4 STXBP1 Syntaxin binding protein 1 9q34.1 TCF4 Transcription factor 4 18q21.1 ADRB2 adrenergic, beta-2-, receptor, surface 5q31-q32 AFF2 AF4/FMR2 family, member 2 Xq28 ANK3 Ankyrin 3, node of Ranvier (ankyrin G) 10q21 BAIAP2 BAll-associated protein 2 17q25 BCL2 B-cell EIF4E eukaryotic translation initiation factor 4E 4q21-q25 FOXPi forkhead box P1 3p14.1 GRIK2 glutamate receptor, ionotropic, kainate 2 6q16.3-q21 HDAC4 histone deacetylase 4 2q37.3 SYNEI spectrin repeat containing, nuclear envelope 1 ARID1B AT rich interactive domain ARNT2 aryl-hydrocarbon receptor nuclear translocator 2 15q24 ASTN2 astrotactin 2 9q33.1 DIAPH3 Diaphanous-related formin 3 13q21.2 DPP1O Dipeptidyl-peptidase 10 GRIPI glutamate receptor interacting protein IMMP2L IMP2 inner mitochondrial membrane peptidase-like (S. cere- CLL/lymphoma 11 1 2 18q21.3 1B (SWIl-like) 6q25 6q25.1 2q14.1 1 12q14.3 7q31 visiae) 1 OPHN1 oligophrenin SEMA5A sema domain, seven thrombospondin repeats (type 1 and type Xq12 9 5p15.2 9 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A TTN titin 2q31 9 (continued on next page) 76 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports WNT2 wingless-type MMTV integration site family member 2 7q31 9 ANKRD11 ankyrin repeat domain 11 16q24.3 8 ARX aristaless related homeobox Xp22.1-22.3 8 CADPS2 Ca2+-dependent activator protein for secretion 2 7q31.3 8 CHRNA7 cholinergic receptor, nicotinic, alpha 7 15q14 8 CTNNA3 catenin (cadherin-associated protein), alpha 3 10q22.2 8 DLX1 distal-less homeobox 1 2q32 8 DLX6 distal-less homeobox 6 7q22 8 ESR1 estrogen receptor 1 6q25.1 8 FHIT fragile histidine triad gene 3p14.2 8 FOXG1 Forkhead box G1 14q13 8 GLRA2 glycine receptor, alpha 2 Xp22.1-p21.3 8 HOXB1 homeobox Bi 17q21.3 8 HSD11B1 hydroxysteroid (11-beta) dehydrogenase 1q32-q41 8 NTNG1 netrin G1 lp13.3 8 PLCD1 phospholipase C, delta 3p22-p21.3 8 PTPRC protein tyrosine phosphatase, receptor type, C 1q31-q32 8 RAIl retinoic acid induced 8 RFWD2 ring finger and WD repeat domain 2 17plI.2 1q25.1-q25.2 RPLlO ribosomal protein L10 Xq28 8 SLC9A9 solute carrier family 9 (sodium/hydrogen exchanger), mem- 3q24 8 1 1 1 8 ber 9 SNDl staphylococcal nuclease and tudor domain containing 1 7q31.3 8 XPC xeroderma pigmentosum, complementation group C 3p25 8 ADORA2A adenosine A2a receptor 22q11.23 7 APC adenomatosis polyposis coli 5q21-q22 7 CADMI cell adhesion molecule 1 11q23.2 7 CDHIO cadherin 10, type 2 (T2-cadherin) 5p1 -p13 4 7 CHD7 chromodomain helicase DNA binding protein 7 Sq12.2 7 CNTNAP5 contactin associated protein-like 5 2ql4.3 7 CXCR3 chemokine (C-X-C motif) receptor 3 Xql3 7 DHCR7 7-dehydrocholesterol reductase 11q13.2-q13.5 7 DLGAP2 discs, large (Drosophila) homolog-associated protein 2 8p23 7 DLGAP3 Discs, large (Drosophila) homolog-associated protein 3 lp35.3-p34.1 7 DRD3 dopamine receptor D3 3q13.3 7 ESR2 estrogen receptor 2 (ER beta) 14q23.2 7 ESRRB estrogen-related receptor beta 12 41.0 cM 7 GPC6 glypican 6 13q32 7 GRPR Gastrin-releasing peptide receptor Xp22.2-p22.13 HRAS v-Ha-ras Harvey rat sarcoma viral oncogene homolog lip KCNJ10 Potassium inwardly-rectifying channel, subfamily J, member 1q23.2 7 1 5 5 . 7 7 10 MARK1 MAP/microtubule affinity-regulating kinase 1 1q41 7 MKL2 MKL/myocardin-like 2 16p13.12 7 NRXN3 neurexin 3 14q31 7 NTRK1 neurotrophic tyrosine kinase, receptor, type 1 1q21-q22 7 ROBO roundabout, axon guidance receptor, homolog 1 (Drosophila) 3p12 7 SLC9A6 solute carrier family 9 (sodium/hydrogen exchanger), mem- Xq26.3 7 ber 6 VPS13B vacuolar protein sorting 13 homolog B (yeast) 8q22.2 7 AFF4 AF4/FMR2 family, member 4 5q31 6 AR androgen receptor Xqll.2-q12 6 ATRX alpha thalassemia/mental retardation syndrome X-linked Xq2l.1 6 CA6 carbonic anhydrase VI 1p36.2 6 CACNA1D calcium channel, voltage-dependent, L type, alpha CDH8 cadherin 8, type 2 16q22.1 6 CDH9 cadherin 9, type 2 (Ti-cadherin) 5p14 6 CHD2 Chromodomain helicase DNA binding protein 2 15q26 6 CNR1 cannabinoid receptor 1 (brain) 1D 3 p 14 3 . 6ql4-q15 6 6 (continued on next page) 77 Table A.1 - Continu ed. Chromosomal Location # Reports 1p32-p31 6 doublecortex, lissencephaly, X-linked (doublecortin) Xq2i.3-q23 6 DPYD dihydropyrimidine dehydrogenase lp22 6 DYRK1A Dual-specificity tyrosine-(Y)-phosphorylation 21q22.13 6 Gene Symbol Gene Name DABI1 disabled homolog DCX 1 (Drosophila) regulated ki- nase 1A EPHA6 EPH receptor A6 3q11.2 6 GABRA4 gamma-aminobutyric acid (GABA) A receptor, alpha 4 4p12 6 GLO1 glyoxalase I 6p21.3-p21.l 6 GRID2 glutamate receptor, ionotropic, delta 2 4q22 6 GRM8 glutamate receptor, metabotropic 8 7q31.3-q32. 6 HTR1B 5-hydroxytryptamine (serotonin) receptor 1B 6q13 6 KCNQ2 Potassium voltage-gated channel, KQT-like subfamily, mem- 20q13.3 6 12q13-q14 6 2p25.3 6 ber 2 MYO1A myosin IA MYTlL Myelin transcription factor NOSIAP nitric oxide synthase 1 (neuronal) adaptor protein 1q23.3 6 NOS2A nitric oxide synthase 2A (inducible, hepatocytes) 17q11.2-q12 6 NRP2 neuropilin 2 2q33.3 6 PCDH9 protocadherin 9 13q21.32 6 PSMD1O proteasome (prosome, macropain) 26S subunit, non-ATPase, Xq22.3 6 1q23.1 6 Xp22.2-p22.1 6 Xq28 6 1-like 10 RGS7 regulator of G-protein signaling 7 RPS6KA3 Ribosomal protein SLC6A8 solute carrier family 6 (neurotransmitter transporter, crea- S6 kinase, 90kDa, polypeptide 3 tine), member 8 TBC1D5 TBC1 domain family, member 3p24.3 6 TH tyrosine hydroxylase 11p15.5 6 UPF3B UPF3 regulator of nonsense transcripts homolog B (yeast) Xq25-q26 6 WNK3 WNK lysine deficient protein kinase 3 Xpll.23-pll.21 6 ADA adenosine deaminase 20qI2-q13.11 5 AGAP1 ArfGAP with GTPase domain, ankyrin repeat and PH do- 2q37 5 main 1 aldehyde dehydrogenase 5 family, member Al semialdehyde dehydrogenase APBA2 (succinate- 6p22.2-p22.3 ) ALDH5A1 5 amyloid beta (A4) precursor protein-binding, family A, mem- 15q11-q12 ber 2 ARHGAP15 Rho GTPase activating protein 15 2q22.2-q22.3 5 BRAF v-raf murine sarcoma viral oncogene homolog B 7q34 5 C4B complement component 4B 6p21.3 5 CACNA1B Calcium channel, voltage-dependent, N type, alpha 9q34 5 1B sub- unit 18q12 CELF4 CUGBP, Elav-like family member 4 CTCF CCCTC-binding factor (sinc finger protein) CYFIP1 cytoplasmic FMR1 interacting protein DMPK dystrophia myotonica-protein kinase 19q13.3 DOCK4 Dedicator of cytokinesis 4 7q31.1 F13A1 coagulation factor XIII, Al polypeptide 6p25.3-p2 .3 FABP5 fatty acid binding protein 5 (psoriasis-associated) 8q21.13 GPXl glutathione peroxidase 1 3p21.3 GTF2I general transcription factor IIi 7q11.23 HEPACAM hepatic and glial cell adhesion molecule 11q24.2 HLA-A major histocompatibility complex, class I, A HS3ST5 heparan sulfate (glucosamine) 3-0-sulfotransferase 5 HTR2A 5-hydroxytryptamine (serotonin) receptor 2A HTR3C 5-hydroxytryptamine 16q21-q22.3 1 15q11 4 6 p21.3 6q21 13ql4-q21 (serotonin) receptor 3, family member 3q27.1 C HTR7 5-hydroxytryptamine (serotonin) receptor 7 (adenylate 10q21-q24 5 cyclase-coupled) IL1R2 interleukin 1 receptor, type II 2q12 5 (continued on next page) 78 Table A.l - Continued. Gene Symbol Gene Name Chromosomal Location # Reports ITGA4 integrin, alpha 4 (antigen CD49D, alpha 4 subunit of VLA-4 2q31.3 5 receptor) JARID2 Jumonji, AT rich interactive domain 2 6p24-p23 5 KCNQ3 Potassium voltage-gated channel, KQT-like subfamily, mem- 8q24 5 ber 3 LAMC3 laminin, gamma 3 9q31-q34 5 MAP2 microtubule-associated protein 2 2q34-q35 5 MBD1 methyl-CpG binding domain protein 1 18q21 5 MBD4 methyl-CpG binding domain protein 4 3q21-q22 5 MDGA2 MAM domain containing glycosylphosphatidylinositol anchor 14q21.3 5 MYO16 myosin XVI 13q33.3 5 NRCAM neuronal cell adhesion molecule 7q31.1-q31.2 5 PCDH10 protocadherin 10 4q28.3 5 PER1 period homolog 1 (Drosophila) l p PINX1 PIN2/TERF1 interacting, telomerase inhibitor 1 8p23 5 PITX1 paired-like homeodomain 1 5q31 5 PONI paraoxonase 1 7q21.3 5 PTCHD1 patched domain containing 1 Xp22.11 5 PTGS2 prostaglandin-endoperoxide synthase 2 (prostaglandin G/H 1q25.2-q25.3 5 7 13 .l-p12 5 synthase and cyclooxyge nase) SATB2 SATB homeobox 2 2q33 5 SCN8A sodium channel, voltage gated, type VIII, alpha subunit 12q13 5 SEZ6L2 SEZ6L2 seizure related 6 homolog (mouse)-like 2 16pll.2 5 SLC4A10 solute carrier family 4, sodium bicarbonate transporter-like, 2q23-q24 5 member 10 SNTG2 Syntrophin, gamma 2 2p25.3 ST8SIA2 ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 2 15q26 STK39 serine threonine kinase 39 (STE20/SPS1 homolog, yeast) 2q24.3 TBR1 T-box, brain, 1 2q24 TGM3 traneglutaminase 3 20q11.2 TSPAN7 tetraspanin 7 Xpll.4 VIP Vasoactive intestinal peptide 6q25 ABAT 4-aminobutyrate aminotransferase 16p 3. ACYl Aminoacylase 1 3p2l.1 ADSL adenylosuccinate lyase 22q13.1, 22q13.2 ALOX5AP arachidonate 5-lipoxygenase-activating protein 13qI2 AP1S2 Adaptor-related protein complex 1, sigma 2 subunit Xp22.2 ATP2B2 ATPase, Ca++ transporting, plasma membrane 2 BZRAPI. bensodiasapine receptor (peripheral) associated protein CASC4 cancer susceptibility candidate 4 15q15.3 CD38 CD38 molecule 4p15 CDH22 cadherin-like 22 20q13.1 CHD8 chromodomain helicase DNA binding protein 8 14q11.2 CREBBP CREB binding protein 16p13.3 CTTNBP2 cortactin binding protein 2 7q31 CUL3 Cullin 3 CYPIBI cytochrome P450, family 11, subfamily B, polypeptide EGR2 early growth response 2 (Krox-20 homolog, Drosophila) 10q21.1 EPC2 Enhancer of polycomb homolog 2 (Drosophila) 2q23.1 EPHB6 EPH receptor B6 FBXO40 F-box protein 40 7q33-q35 3q13.33 GABRB1 gamma-aminobutyric acid (GABA) A receptor, beta 1 GALNT13 UDP-N-acetyl-alpha-D-galactosamine:polypeptide 1 2 3p25.3 1 17q22-q23 2q36.2 1 8q21 4 N- p12 2q23.3-q24.1 acetylgalactosaminyltransferase 13 (GalNAc-T13) GNAS GNAS complex locus 20q13.3 4 GPHN Gephyrin 14q23.3 4 GRM Glutamate receptor, metabotropic 1 6q24 4 HLA-DRB1 major histocompatibility complex, class II, DR beta 1 6p2l.3 4 (continued on next page) 79 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports HUWE1 HECT, UBA and WWE domain containing 1, E3 ubiquitin Xp1l.22 4 protein ligase ICA1 islet cell autoantigen 1, 69kDa 7p22 4 JMJD1C jumonji domain containing 1C 10q21.2 4 KANK1 KN motif and ankyrin repeat domains 1 9p24.3 4 LRP2 Low density lipoprotein receptor-related protein 2 2q24-q31 4 LRRC1 leucine rich repeat containing 1 6p12.1 4 LZTS2 leucine 10q24 4 MBD3 methyl-CpG binding domain protein 3 19p13.3 4 MCC mutated in colorectal cancers 5q21 4 MTF1 metal-regulatory transcription factor 1 1p33 4 NBEA neurobeachin 13q13 4 NPAS2 neuronal PAS domain protein 2 2q11.2 4 NSD1 nuclear receptor binding SET domain protein 1 5q35 4 PHF8 PHD finger protein 8 Xp11.22 4 PIK3CG phosphoinositide-3-kinase, catalytic, gamma polypeptide 7q22.3 4 PLN phospholamban 6q22.1 4 PRICKLE1 Prickle homolog 12q12 4 PRKCB protein kinase C, beta 16p11.2 4 RAB39B RAB39B, member RAS oncogene family Xq28 4 SGSH N-sulfoglucosamine sulfohydrolase 17q25.3 4 SH3KBP1 SH3-domain kinase binding protein 1 Xp22.1-p21.3 4 SLC30A5 solute carrier family 30 5q12.1 4 SLC6A3 Solute carrier family 6 (neurotransmitter transporter), mem- 5pl5.3 4 sipper, putative tumor suppressor 2 1 (Drosophila) ber 3 SYN2 Synapsin II 3p25 4 TDO2 tryptophan 2,3-dioxygenase 4q31-q32 4 UBE3B ubiquitin protein ligase E3B 12q24.11 4 VASH1 vasohibin 1 14q24.3 4 ADNP Activity-dependent neuroprotector homeobox 20q13.13 3 AGBL4 ATP/GTP binding protein-like 4 1p AGTR2 angiotensin II receptor, type 2 Xq22-q23 3 ALDH1A3 Aldehyde dehydrogenase 15q26.3 3 APP Amyloid beta (A4) precursor protein 21q21.3 3 ASS1 argininosuccinate synthetase 9q34.1 3 BCKDK Branched chain ketoacid dehydrogenase kinase l6pl1.2 3 BIN1 Bridging integrator 1 2q14 3 C12orf57 Chromosome 12 open reading frame 57 12p13.31 3 C3orf58 chromosome 3 open reading frame 58 3q24 3 CAMTA1 calmodulin binding transcription activator 1p36.31-p36.23 3 CBS cystathionine beta-synthase 21q22.3 3 CD44 CD44 molecule (Indian blood group) 3 CEP290 Centrosomal protein 29OkDa lIp 12q21.32 CEP41 testis specific, 14 7q32 3 CMIP c-Maf inducing protein 16q23 3 DAPK1 death-associated protein kinase 1 9q34.1 3 DCTN5 dynactin 5 16pl2.2 3 DCUNID1 DCN1, defective in cullin neddylation 1, domain containing 3q26.3 3 1 1 family, member A3 1 33 13 3 3 (S. cerevisiae) DDX11 DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11 12pll 3 DDX53 DEAD (Asp-Glu-Ala-Asp) box polypeptide 53 Xp22.11 3 DRD1 Dopamine receptor D1 5q35.1 3 EHMT1 Euchromatic histone-lysine N-methyltransferase 1 9q34.3 3 EXT1 Exostosin 8q24.11 3 FATI FAT tumor suppressor homolog 1 (Drosophila) 4q35 3 FLT1 fms-related tyrosine kinase 1 (vascular endothelial growth 13q12 3 3 1 factor/vascular perme ability factor receptor) FRK fyn-related kinase 6q21-q22.3 FRMPD4 FERM and PDZ domain containing 4 Xp22.2 3 (continued on next page) 80 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports GPD2 Glycerol-3-phosphate dehydrogenase 2 (mitochondrial) 2q24.1 3 GRIDI Glutamate receptor, ionotropic, delta 10q22 3 GRM5 Glutamate receptor, metabotropic 5 11ql4.3 3 GSTM1 glutathione S-transferase M1 1p13.3 3 HCFC1 Host cell factor C1 (VP16-accessory protein) Xq28 3 HNRNPH2 heterogeneous nuclear ribonucleoprotein H2 (H') Xq22 3 HOMER1 Homer homolog 1 (Drosophila) 5ql4.2 3 INPP1 inositol polyphosphate-l-phosphatase 2q32 3 IQSEC2 IQ motif Xpl1.22 3 ITGB7 integrin, beta 7 12q13.13 3 KCND2 Potassium voltage-gated channel, 7q31 3 and 1 Sec7 domain 2 Shal-related subfamily, member 2 KIAA1586 KIAA1586 6p12.1 3 NDNL2 necdin-like 2 15q13.1 3 NDUFA5 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 5, l3kDa 7q32 3 1p31.3-p31.2 NFIA nuclear factor I/A NXF5 Nuclear RNA export factor 5 OPRM1 opioid receptor, mu OTX1 Orthodenticle homeobox 1 2p13 PCDHA11 Protocadherin alpha 11 PCDHA13 Protocadherin alpha 13 PCDHA2 Protocadherin alpha 2 5q31 5q31 5q31 PCDHA4 Protocadherin alpha 4 5q31 PCDHA5 Protocadherin alpha 5 5q31 PCDHA6 Protocadherin alpha 6 5q31 PCDHA7 Protocadherin alpha 7 PCDHA9 Protocadherin alpha 9 5q31 5q31 PDZD4 PDZ domain containing 4 PLCB1 phospholipase C, beta POGZ Pogo transposable element with ZNF domain 1q21.3 PRICKLE2 Prickle homolog 2 (Drosophila) 3p1 .1 PRUNE2 prune homolog 2 (Drosophila) 9q21.2 PSD3 pleckstrin and Sec7 domain containing 3 8p2l.3 RBlCC1 RB1-inducible coiled-coil 1 Sql1 REEP3 receptor accessory protein 3 10q21.3 RHOXF1 Rhox homeobox family, member 1 Xq24 RIMS3 regulating synaptic membrane exocytosis 3 lpter-p22.2 RPS6KA2 ribosomal protein S6 kinase, 9OkDa, polypeptide 2 6q27 SDC2 syndecan 2 (heparan sulfate proteoglycan 8q22-q23 Xq22 1 6q24-q25 1 Xq28 (phosphoinositide-specific) 20p12 4 1, cell surface- ) associated, fibroglycan SOX5 SRY (sex determining region Y)-box 5 STX1A Syntaxin SUCLG2 succinate-CoA ligase, GDP-forming, beta subunit 3p1 .1 TAFIL TAF1 RNA polymerase II 9p21.1 TBClD7 TBC1 domain family, member 7 6p TLK2 tousled-like kinase 2 17q23 TMLHE trimethyllysine hydroxylase, epsilon Xq28 TOP1 Topoisomerase (DNA) I 20ql2-q13.1 TOP3B Topoisomerase (DNA) III beta 22q11.22 TRIP12 Thyroid hormone receptor interactor 12 2q36.3 TSN translin 2q21.1 TUBGCP5 tubulin, gamma complex associated protein 5 15q11.2 WNT1 Wingless-type MMTV integration site family, member 1 12q13 ADARB1 Adenosine deaminase, RNA-specific, BI 21q22.3 ADCY5 Adenylate cyclase 5 3q21.1 ADORAS Adenosine A3 receptor lpl3.2 ANK2 Ankyrin 2, neuronal 4q25-q27 ASXL3 Additional sex combs like 3 (Drosophila) 18qil 1A 12p12.1 (brain) 7q11.23 4 24 .1 (continued on next page) 81 Table A.1 Gene - Symbol CACNA1I Continued. Gene Name Chromosomal Location # Reports Calcium channel, voltage-dependent, T type, alpha 11 sub- 22q13.1 2 unit 13 CAPRIN1 Cell cycle associated protein 1 Ilp CCDC64 coiled-coil domain containing 64 12q24. 3 2 CHRM3 Cholinergic receptor, muscarinic 3 1q43 2 CLTCL1 clathrin, heavy chain-like 1 22q11.21 2 CNTN3 contactin 3 (plasmacytoma associated) 3p12.3 2 CSMD1 CUB and Sushi multiple domains 1 8p23.2 2 CTNNB1 Catenin (cadherin-associated protein), beta 1, 88kDa 3p21 2 DDC Dopa decarboxylase (aromatic L-amino acid decarboxylase) 7pl2.2 2 DEPDC5 DEP domain containing 5 22ql2.3 2 DLG4 Discs, large homolog 4 (Drosophila) 17p13.1 2 DRD2 Dopamine receptor D2 11q23 2 EML1 echinoderm microtubule associated protein like 1 14q32 2 EP400 ElA binding protein p400 12q24.33 2 EPHB2 EPH receptor B2 1p36.1-p35 2 EXOC6B Exocyst complex component 6B 2pl3.2 2 FAM135B Family with sequence similarity 135, member B 8q24.23 2 FBXO33 F-box protein 33 14q21.1 2 FGD1 FYVE, RhoGEF and PH domain containing 1 Xpll.21 2 FOLHI Folate hydrolase (prostate-specific membrane antigen) 1 l1pI1.2 2 GALNT14 UDP-N-acetyl-alpha-D-galactosamine:polypeptide 2p23.1 2 3q13.3 2 15q13 2 2 2 N- acetylgalactosaminyltransferase 14 (GalNAc-T14) GSK3B Glycogen synthase kinase 3 beta HERC2 HECT and RLD domain containing E3 ubiquitin protein lig- ase 2 KATNAL2 Katanin p60 subunit A-like 2 18q21.1 2 KHDRBS2 KH domain containing, RNA binding, signal transduction as- 6q11.1 2 Xq13.3 2 2q23.1 2 sociated 2 KIAA2022 KIAA2022 KIF5C Kinesin family member LRPPRC Leucine-rich pentatricopeptide repeat containing 2p2I 2 MAOB Monoamine oxidase B Xp1l1.23 2 MC4R Melanocortin 4 receptor 18q22 2 NELL1 NEL-like 1 (chicken) 11p NIPA1 non imprinted in Prader-Willi/Angelman syndrome 1 15q11.2 2 NIPA2 non imprinted in Prader-Willi/Angelman syndrome 2 15q11.2 2 NIPBL Nipped-B homolog (Drosophila) 5pl3.2 2 NRXN2 neurexin 2 11q13 2 PCDH15 Protocadherin-related 15 10q21.1 2 PCDHAC2 Protocadherin alpha subfamily C, 2 5q31 2 PDE4A phosphodiesterase 4A, cAMP-specific 19pl3.2 2 PDE4B phosphodiesterase 4B, cAMP-specific lp3l 2 PEX7 peroxisomal biogenesis factor 7 6q23.3 POMGNT1 Protein 5C O-linked 1 mannose betal,2-N- 'p 5 1 . 34 1 2 2 . 2 acetylglucosaminyltransferase PTPRT protein tyrosine phosphatase, receptor type, T 20q12-ql3 2 PXDN Peroxidasin homolog (Drosophila) 2p25 2 RAB11FIP5 RABIl family interacting protein 5 2p13 2 RBM8A RNA binding motif protein 8A 1q21.1 2 SAE1 SUMO1 activating ensyme subunit 1 19q13.32 2 SBF1 SET binding factor 1 22q13.33 2 SDK1 Sidekick cell adhesion molecule 7p22.2 2 SERPINE1 Serpin peptidase inhibitor, clade E (nexin, plasminogen acti- 7q21.3-q22 2 1 vator inhibitor type 1), member 1 3 SETD2 SET domain containing 2 SETDB2 SET domain, bifurcated 2 13q14 2 SGSM3 Small G protein signaling modulator 3 22q13.1-q13.2 2 SLIT3 Slit homolog 3 (Drosophila) 5q35 p21.31 2 2 (continued on next page) 82 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports SODI Superoxide dismutase 1, soluble 21q22.11 2 ST7 suppression of tumorigenicity 7 7q31.1-q31.3 2 STXBP5 Syntaxin binding protein 5 (tomosyn) 6q24.3 2 SUV420H1 suppressor of variegation 4-20 homolog 1 (Drosophila) 11q13.2 2 SYAPI Synapse associated protein Xp22.2 2 SYT17 synaptotagmin 16p12.3 2 TBL1XR1 Transducin (beta)-like 1 X-linked receptor 1 3q26.32 2 TYR Tyrosinase (oculocutaneous albinism IA) 11ql4-q21 2 UBE3C Ubiquitin protein ligase E3C 7q36.3 2 YWHAE Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase ac- 17p13.3 2 1 XVII tivation protein, epsilon polypeptide ABCA7 ATP-binding cassette, sub-family A (ABC1), member 7 19p13.3 ADK adenosine kinase 10qI1-q24 ARHGAP24 Rho GTPase activating protein 24 4q22.1 ATRNL1 Attractin-like 1 10q26 1 ATXN7 Ataxin 7 3p21.1-p12 1 BBS4 Bardet-Biedl syndrome 4 15q22.3-q23 BRCA2 breast cancer 2, early onset 13q12.3 BTAF1 RNA polymerase II, B-TFIID transcription factor-associated, 10q22-q23 1 170kDa (Motl homolog, S. cerevisiae) C15orf43 chromosome 15 open reading frame 43 15q21.1 1 CAMK4 Calcium/calmodulin-dependent protein kinase IV 5q21.3 1 CAMSAP2 calmodulin lq32.1 1 regulated spectrin-associated protein family, member 2 CD99L2 CD99 molecule-like 2 CDKN1B Cyclin-dependent kinase inhibitor CECR2 Cat eye syndrome chromosome region, candidate 2 CLSTN3 Calsyntenin 3 12p13.31 CNTNAP3 contactin associated protein-like 3 9p13.1 CSNK1D casein kinase 1, delta 17q25 DAPPI1 Dual adaptor of phosphotyrosine and 3-phosphoinositides 4q25-q27 DNAJC19 DnaJ (Hsp40) homolog, subfamily C, member 19 3q26.33 DNM1L Dynamin 1-like 12pl1.21 DOCK10 Dedicator of cytokinesis 10 2q36.2 DOLK Dolichol kinase 9q34.11 DST Dystonin 6p12.1 DUSP22 dual specificity phosphatase 22 6p25.3 DYDC1 DPY30 domain containing 1 DYDC2 DPY30 domain containing 2 10q23.1 10q23.1 EIF4EBP2 Eukaryotic translation initiation factor 4E binding protein 2 10q21-q22 EP300 ElA binding protein p300 22q13.2 EPS8 Epidermal growth factor receptor pathway substrate 8 12p12.3 ERG v-ets erythroblastosis virus E26 oncogene homolog (avian) 21q22.3 FANI FANCD2/FANCI-associated nuclease FBXO15 F-box protein 15 18q22.3 FER Fer (fps/fes related) tyrosine kinase 5q21 FGA Fibrinogen alpha chain 4q28 GABRA3 Gamma-aminobutyric acid (GABA) A receptor, alpha 3 Xq28 GAN Gigaxonin 16q24.1 GAP43 Growth associated protein 43 3q13.1-q13.2 GAS2 Growth arrest-specific 2 llp4.3 GNA14 Guanine nucleotide binding protein (G protein), alpha 14 9q21 GNBIL guanine 22q11.2 nucleotide polypeptide GPR37 Xq28 binding 1B (p27, Kipl) 12p13.1-p1 1 protein 2 22q11.2 15q13.2-q13.3 (G protein), beta 1-like G protein-coupled receptor 37 (endothelin receptor type B- 7q31 like) GRM4 Glutamate receptor, metabotropic 4 6p2l.3 GSN Gelsolin 9q33 GUCY1A2 Guanylate cyclase 1, soluble, alpha 2 11q21-q22 1 1 (contin ued 83 on next page) Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports HDAC6 Histone deacetylase 6 Xpl1.23 1 HMGN1 high mobility group nucleosome binding domain 1 21q22. HYDIN HYDIN, axonemal central pair apparatus protein 16q22.2 INADL InaD-like (Drosophila) lp31.3 KCTD13 Potassium channel tetramerisation domain containing 13 16pll.2 KIT V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene ho- 4q11-q12 2 1 1 1 molog KLC2 Kinesin light chain 2 11q13.2 KPTN Kaptin (actin binding protein) 19q13.32 LAMA1 Laminin, alpha 1 LAMB1 laminin, beta LEP Leptin 7q31.3 LMX1B LIM homeobox transcription factor 1, beta 9q33.3 LRRC7 Leucine rich repeat containing 7 1p31.1 MAGED1 Melanoma antigen family D, 1 Xp1l.23 MAGEL2 MAGE-like 2 MAPK1 Mitogen-activated protein kinase MAPK3 mitogen-activated protein kinase 3 16p11.2 MAPK8IP2 Mitogen-activated protein kinase 8 interacting protein 2 22q13.33 MBD6 Methyl-CpG binding domain protein 6 12q13 MSN Moesin Xql1.1 MSR1 macrophage scavenger receptor 1 8p22 MTR 5-methyltetrahydrofolate-homocysteine methyltransferase 1q43 MTX2 Metaxin 2 2q31.1 MYH4 Myosin, heavy chain 4, skeletal muscle 17pl3.1 NCKAP5L NCK-associated protein 5-like 12q13.12 NCKAP5 NCK-associated protein 5 2q21.2 NEFL Neurofilament, light polypeptide 8p2l ODF3L2 outer dense fiber of sperm tails 3-like 2 19p13.3 OGT O-linked Xql3 PAH Phenylalanine hydroxylase 12q22-q24.2 PARD3B Par-3 partitioning defective 3 homolog B (C. elegans) 2q33.3 PCDH8 protocadherin 8 13q21.1 PCDHGA11 protocadherin gamma subfamily A, 11 5q31 PECR peroxisomal trans-2-enoyl-CoA reductase 2q35 PIK3R2 Phosphoinositide-3-kinase, regulatory subunit 2 (beta) 19q13.2-q13.4 PLAUR Plasminogen activator, urokinase receptor 19q13 POTI Protection of telomeres 1 homolog (S. pombe) 7q31.33 PPFIA1 Protein tyrosine phosphatase, receptor type, f polypeptide 11ql3.3 18pll.3 1 7q22 15q11-q12 1 22q11.21 N-acetylglucosamine (GlcNAc) transferase (PTPRF), interacting protein (liprin), alpha 1 1B 17q12 PPP1R1B Protein phosphatase 1, regulatory (inhibitor) subunit PRKD1 Protein kinase Dl 14q11 PTGER3 Prostaglandin E receptor 3 (subtype EP3) lp3l.2 PTPN11 protein tyrosine phosphatase, non-receptor type 11 12q24 PTPRB Protein Tyrosine Phosphatase, Receptor Type, B 12q15-q21 RASD1 RAS, dexamethasone-induced 1 17pll.2 RASSF5 Ras association (RalGDS/AF-6) domain family member 5 1q32.1 RERE Arginine-glutamic acid dipeptide (RE) repeats 1p36.23 RNPS1 RNA binding protein S1, serine-rich domain l6p13.3 ROBO2 Roundabout, axon guidance receptor, homolog 2 (Drosophila) 3p12.3 RPP25 Ribonuclease P/MRP 25kDa subunit 15q24.2 SCFD2 seci family domain containing 2 4q12 SETDB1 SET domain, bifurcated 1 1q21 SHANKi SH3 and multiple ankyrin repeat domains 1 19q13.3 SLC16A3 solute carrier family 16, member 3 (monocarboxylic acid 17q25 transporter 4) SLC16A7 Solute carrier family 16, member 7 (monocarboxylic acid 12q13 transporter 2) (continued on next page) 84 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports SLC25A14 Solute carrier family 25 (mitochondrial carrier, brain), mem- Xq24 1 lpl3.3 1 lp2l 1 1 ber 14 SLC25A24 Solute carrier family 25 (mitochondrial carrier; phosphate carrier), member 24 SLC35A3 Solute carrier family 35 (UDP-N-acetylglucosamine (UDP- GlcNAc) transporter), member A3 SLC38A1O solute carrier family 38, member 10 17q25.3 SLC39A11 Solute carrier family 39 (metal ion transporter), member 11 17q21.31 1 SMG6 Smg-6 homolog, nonsense mediated mRNA decay factor (C. 17p13.3 1 elegans) SNRPN small nuclear ribonucleoprotein polypeptide N 15q11.2 1 SNX19 Sorting nexin 19 11q25 1 SPAST Spastin 2p24-p21 1 SYN3 Synapsin III 22q12.3 1 SYT3 synaptotagmin III 19q13.33 TAFlC TATA box binding protein (TBP)-associated factor, RNA 16q24 1 1 polymerase I, C, 110kDa TBL1X transducin (beta)-like TBX1 T-box 1 1X-linked Xp22.3 1 22q11.21 1 1 THRA Thyroid hormone receptor, alpha 17q11.2 TM4SF20 Transmembrane 4 L six family member 20 2q36.3 1 TNIP2 TNFAIP3 interacting protein 2 4 TOMM20 Translocase of outer mitochondrial membrane 20 homolog 1q42 1 1 TPO (yeast) Thyroid peroxidase 2p25 1 TRIM33 Tripartite motif containing 33 lp13.1 1 TTI2 TELO2 interacting protein 2 8p12 1 UBA6 Ubiquitin-like modifier activating enzyme 6 4q13.2 1 UBE2H ubiquitin-conjugating enzyme E2H (UBC8 homolog, yeast) 7q32 1 UBL7 ubiquitin-like 7 (bone marrow stromal cell-derived) 15q24.1 1 UBR5 Ubiquitin protein ligase E3 component n-recognin 5 8q22 1 UBR7 ubiquitin protein ligase E3 component n-recognin 7 (puta- 14q32.12 1 p16.3 tive) UPF2 UPF2 regulator of nonsense transcripts homolog (yeast) lOpl4-pl3 1 USP9Y ubiquitin specific peptidase 9, Y-linked Yql1.2 1 XPO1 Exportin 1 (CRM1 homolog, yeast) 2p15 1 YEATS2 YEATS domain containing 2 3q27.1 1 YTHDC2 YTH domain containing 2 5q22.2 1 ZBTB16 Zinc finger and BTB domain containing 16 11q23.1 1 ZNF18 zinc finger protein 18 17pll.2 1 ZNF407 Zinc finger protein 407 18q23 1 ZNF827 Zinc finger protein 827 4q31.22 1 ZSWIM5 zinc finger, SWIM-type containing 5 lp34.1 1 1. We consider only the genes which can be mapped to the largest connected component of our PPI network. 85 86 Appendix B Risk Genes for ASDs Identified by Integrative Approach Table B.1 - Probabilities of association with ASDs for candidate genes identified by our integrative analysis approach. Gene Symbol Association Probability SHANK2 1.000000 HGF 1.000000 CACNA1H 1.000000 EN2 1.000000 MTHFR 1.000000 GRIN2A 1.000000 ANKRD11 1.000000 GATAD2B 1.000000 FBN1 1.000000 COBL 1.000000 BAIAP2 1.000000 TTN 1.000000 SLC1A1 1.000000 FOXP4 1.000000 GABRB3 1.000000 MACROD2 1.000000 TBX1 1.000000 STOXI 1.000000 TSC1 1.000000 SND1 1.000000 HSPC215 1.000000 GJB2 1.000000 STXBP1 1.000000 GAN2B 1.000000 KtAP5-9 1.000000 FAM1i54A 1.000000 RP11-220B22.3 1.000000 LGALS13 1.000000 HSPY1 1.000000 KRTAP26-1 1.000000 KRTAP3-3 1.000000 MIR1OA 1.000000 ARX 1.000000 (continued on next page) 87 Table B.1 - Continued. Gene Symbol Association Probability RP11-328M4.1 1.000000 SCT 1.000000 SCN1A 1.000000 NFl 1.000000 AVPR1A 1.000000 MIR223 1.000000 MIR181A1 1.000000 LMNA 1.000000 BCL2 1.000000 TCF4 1.000000 PAX5 1.000000 PAX6 1.000000 PTEN 1.000000 MEF2C 1.000000 MECP2 1.000000 MYH9 1.000000 TRPV4 1.000000 FGFR3 1.000000 NCKIPSD 1.000000 SYNI, 1.000000 COLlAl 1.000000 CTNNA3 1.000000 RAI 1.000000 MCPH1 1.000000 CPE 1.000000 RPL10 1.000000 TP63 1.000000 DIAPH3 1.000000 OPHN1 1.000000 MET 1.000000 SLC25A12 1.000000 SETD2 1.000000 DLX1 1.000000 HOXAI 1.000000 GLI3 1.000000 FGFR2 1.000000 COL2Al 1.000000 FLNA 1.000000 MED12 1.000000 REST 1.000000 GNAS 1.000000 MSX2 1.000000 CNTNAP2 1.000000 ANK3 1.000000 GRIK2 1.000000 NLGN3 1.000000 FOXP2 1.000000 GBA 1.000000 GJA1 1.000000 TWIST2 1.000000 TP53AIP1 1.000000 AVP 1.000000 SLC6A4 1.000000 FLNB 1.000000 DMD 1.000000 OXTR 1.000000 DLGAP3 0.999900 CHM 0.999900 ALPL 0.999900 (continued on next page) 88 Table B.1 - Continued. Gene Symbol Association Probability FGF14 0.999900 PAX2 0.999900 SNCG 0.999900 RFWD2 0.999900 FGFR1 0.999900 COL7A1 0.999900 LRP5 0.999800 SMN1 0.999800 AR 0.999800 PRNP 0.999800 GLB1 0.999800 SCN8A 0.999700 RNU5A-1 0.999700 BDNF 0.999700 AES 0.999700 DSP 0.999700 HOXD13 0.999700 ZMIZ1 0.999700 ITLN1 0.999700 PLCD1 0.999600 MAPT 0.999600 XPC 0.999600 PAX3 0.999600 SCN5A 0.999500 WAS 0.999500 PITX2 0.999500 RPl1-519K18.1 0.999500 MSX1 0.999500 UNQ640/PRO1270 0.999400 GATAI 0.999200 RELN 0.999100 CACNAlG 0.999100 RET 0.999100 MPZ 0.999000 KRT1 0.999000 CHRNA7 0.998800 RECQL4 0.998800 DLX2 0.998800 CNTN4 0.998700 ILlRAPLl 0.998700 DISCI. 0.998600 DLX5 0.998400 PELO 0.998400 EDA 0.998300 CACNAlC 0.998200 FAM108A1 0.998200 COLL1A2 0.998100 SLC26A2 0.998000 ABCC8 0.998000 NOG 0.997900 DLGAP1 0.997900 POLG 0.997800 MGAT3 0.997800 CHRNA1 0.997800 LICAM 0.997600 CGI-17 0.997600 IKBKG 0.997500 COLliAl 0.997300 SDC3 0.997300 (continued on next page) 89 Table B.1 - Continued. Association Probability Gene Symbol TDRD7 0.997200 TYR 0.997000 ASCL3 0.997000 CACNA1A 0.997000 GRIPI 0.996800 ESCO2 0.996700 FOXPI 0.996700 CADPS2 0.996700 AFF2 0.996600 FMR1 0.996500 LOC347475 0.996400 MYH7 0.996400 PARK2 0.996300 GNPTAB 0.996300 TP53 0.996300 FHIT 0.996300 RP11-298P3.3 0.996200 KDM5C 0.996200 WT1 0.996200 PRKAR1A 0.996100 RNU4-1 0.995900 CAPN3 0.995700 DNM2 0.995500 GP1BA 0.995500 ATP7A 0.995400 RORA 0.995400 RYR1 0.995300 RP11-394C23.1 0.995300 DPP1O 0.995200 SCN9A 0.995100 ARIDIB 0.995000 NCAPG2 0.994800 KCNMAI 0.994700 SYNE1 0.994700 AUTS2 0.994500 KRT14 0.994400 UGT1A1 0.994300 EFNB1 0.994300 AHIl 0.994200 ERBB4 0.994200 KLHL1 0.994100 SLC9A9 0.994000 TWISTI. 0.993700 PMP22 0.993600 PCDH19 0.993600 SEMA5A 0.993500 CPT2 0.993500 ATRX 0.992900 ARNT2 0.992800 ATR 0.992800 HDAC4 0.992200 LBR 0.991800 PTPN11 0.991700 KCTD3 0.991600 NTRK3 0.991400 NLRP3 0.991200 UBE3A 0.991000 VDR 0.990800 ERCC6 0.990800 (continued on next page) 90 Table B.1 Gene - Continued. Symbol Association Probability SRGAP2 0.990700 RP11-5F19.1 0.990600 NLGN1 0.990500 VLDLR 0.990300 HPHB2 0.990200 EDNRB 0.990100 GH1 0.990000 SCN2A 0.990000 NLGN4X 0.989500 CMCl. 0.989500 RUNX2 0.989400 APC 0.989300 HSD11B1 0.989300 PSEN1 0.988700 ADRB2 0.988600 SPSB3 0.988500 CBP 0.988300 GNE 0.988300 EIF4E 0.988100 PTPRC 0.987700 DLX6 0.987200 NRXN1 0.986700 NDRG1 0.986400 DUSP22 0.986400 ERCC6L2 0.986200 EVC 0.986000 RMRP 0.985600 MGAT5B 0.985300 ERAG 0.984900 4-OCT 0.984900 SPINK5 0.984400 OFD1 0.984200 MBD5 0.983700 TRPS1 0.983500 SYNE3 0.983100 BSCL2 0.982600 TSC2 0.982400 FH 0.982100 COQ2 0.981700 BRCA2 0.981600 SALL4 0.980500 NCS-1 0.980200 FDXR 0.979900 CTR9 0.979400 MAOA 0.979100 EYAl 0.979000 HR 0.978900 TBCE 0.978200 TR-B 0.977400 HOXA13 0.977100 SPSB4 0.976700 MBNL2 0.976700 PC 0.976400 ANKH 0.976300 PRPS1 0.976200 PHLDA3 0.975900 EPB41 L3 0.975800 HOXC8 0.975700 MYO7A 0.975400 (continued on next page) 91 Table B.1 Gene - Continued. Symbol Association Probability SHANK3 0.974700 CAV3 0.974400 MYO5A 0.973400 SCN4A 0.973200 IL2RG 0.972800 FAM111A 0.972300 NROB1 0.972200 DYSF 0.971900 HOXD3 0.971200 DPP6 0.971100 SOST 0.971000 RNU6-1 0.970500 DYTIO 0.970500 GNB2L1 0.970200 ALG6 0.969900 WFS1 0.969600 XPA 0.966300 BUB1B 0.966300 INPPL1 0.966200 CDKL5 0.966100 IL6 0.966000 ZEB2 0.965700 RB1 0.965700 ALS2 0.965400 PDP1 0.964600 ESR1 0.962300 INSR 0.962200 LRIG1 0.962200 SHH 0.962100 GDAP1 0.961900 CD96 0.960800 HHF1 0.960000 CTSC 0.958600 PTPN1 0.958200 KCND3 0.956900 TGFBI 0.956600 FRS2 0.955700 THTPA 0.955300 CACNA1S 0.955200 ITGA2B 0.953700 GDF5 0.953300 TACC3 0.952700 GHR 0.952500 COL4A3 0.952400 DLL3 0.950700 SLC17A5 0.950700 WNT2 0.949700 FOXG1 0.949300 PIGL 0.949200 AIMI 0.949100 TADA3 0.948600 EMD 0.948200 MIF 0.948000 NTF3 0.948000 IRF6 0.947700 PAX8 0.947400 AASS 0.947300 HMGA2 0.947200 NF2 0.946400 (continued on next page) 92 Table B.1 - Continued. Gene Symbol Association Probability NPHP1 0.946000 KCND2 0.945900 SMARCA2 0.945800 PEX5 0.945700 TBR1 0.944600 THRB 0.944600 SCARF2 0.944000 PANK2 0.943900 HSPG2 0.943600 ARSB 0.943100 FAM123B 0.942500 LAMA3 0.940800 SMS 0.940000 ABCA4 0.939900 SLC13A3 0.939100 KAT6B 0.938900 SMPD1 0.937900 ERCC2 0.937000 TREXI 0.936700 FLCN 0.936000 HOXB1 0.934500 SIM2 0.933800 SNTG1 0.933200 NKX2-1 0.933100 RP11-258C19.2 0.930600 PKHD1 0.929300 FCP1 0.928300 HPRT1 0.928300 ELANE 0.928100 NELL2 0.926300 PRRT2 0.925700 ROR2 0.925600 APC2 0.925500 FKRP 0.925300 OPAl 0.925200 SLC37A4 0.924400 MEF2A 0.923000 HBB 0.922400 STK11 0.922300 RAPGEF4 0.922000 EYA4 0.918500 CDH3 0.917900 ZNF81 0.917700 N4BP2L2 0.917500 DYM 0.917500 EGR2 0.917400 BINI 0.917000 HNFIA 0.915500 NTNG1 0.910400 CPs1 0.910100 KIT 0.909700 AHNAK 0.908500 CFTR 0.907800 TDO2 0.907000 TTR 0.906300 AVEN 0.906300 MITF 0.904200 SPTAN1 0.903100 TBPL1 0.902200 (continued on next page) 93 Table B.1 - Continued. Gene Symbol Association Probability GLRA2 0.901600 PTH 0.901600 SMOCi 0.900300 RPS6KA3 0.899000 ADCK4 0.897800 SEC23A 0.896300 ASTN2 0.892600 CYLD 0.892400 BMP4 0.892100 MBNL1 0.889700 FOXHI 0.888200 KCNQ1 0.887100 ATP2A2 0.885500 DES 0.884400 CASR 0.882600 FLT4 0.877300 BCL7B 0.874000 DAAP-218M18.8 0.872500 FOXO4 0.872500 SOX3 0.872500 SYNM 0.871800 RAB40B 0.871400 GNAQ 0.869900 RP11-419L10.1 0.869900 SMARCEI 0.869400 FAM189A1 0.868800 PHF11 0.866700 PICK1 0.866200 XIST 0.866100 GPC6 0.864900 ATP8B1 0.862900 HSD17B4 0.861100 TRIM25 0.861000 NTRK1 0.859300 PLP1 0.858700 TBC1D24 0.857300 PLCG1 0.857300 OPN1LW 0.857200 GBE1 0.855300 ELN 0.854300 DDX59 0.853800 FOXL2 0.853400 ABCC6 0.852900 LHCGR 0.848500 VWF 0.846900 NOS3 0.846500 TSSK2 0.846100 STIMI 0.845000 DDB2 0.844800 VHL 0.844800 ATP6VOA2 0.844700 PRKAG2 0.844700 NCOA2 0.844600 NPHP3 0.842200 SOS' 0.842000 ITM2B 0.841300 94 Appendix C Subnetworks in ASD Risk Gene Set T1able C.1 - Subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity 0 Pathway Analysis (IPA).1 ID Molecules in Network Score Seed Genes Top Diseases and Functions 1 ADRB2, Apl, ARNT2, ATP6VOA2, ATP8B1, CFTR, 46 28 Cancer, EGR2, FLNA, GBE1, GH1, GNB2L1, HBB, HNF1A, HPRT1, IL1RAPL1, INSR, Insulin, KIT, Mek, MET, S6k, p85 (pik3r), PAX6, PDGF SLC37A4, SLC9A9, SND1, SNTG1, p70 Tissue Morphology, Gastrointestinal Disease BB, Proinsulin, SPSB3, THRB, UGT1A1, VHL, WFS1, XIST 2 ABHD17A, DUSP22, ADCK4, BSCL2, CHRNA1, ERBB, ERK, HOXAI, FAM154A, HOXD3, HSFY1/HSFY2, KRTAP26-1, LAMA3, KRTAP3-3, CHRNA7, FDXR, 44 27 Hnf3, Igfbp, ITGA2B, KRTAP5-9, Cancer, Tissue Morphology, Nervous System Development and Function LiCAM, LGALS13, MGAT5B, N4BP2L2, Nicotinic acetylcholine receptor, NKX2-1, NRG (family), PAX8, PC, POLG, SERCA, sGC, SMOCI, SOST, TGFBI 3 AES, Akt, ANK3, CTNNSS-TCF/LEF, DLX6, FGF14, GABRB3, MSX1, FOXG1, HOXC8, MSX2, SCN4A, ARX, FOXHI, KDM5C, PMP22, SCN5A, Calbindin, CDKL5, CYPI9, DLX1, DLX2, Developmental Hereditary Disorder, Disorder, Neuro- logical Disease MIR124, SCNlA, SCN9A, 27 FOXO4, MECP2, REST, SCN8A, Foxo, 43 DLX5, SCN2A, SOX2-OCT4- NANOG, SYNM, voltage-gated sodium channel 4 ABCC6, ABCC8, CADPS2, DMD, CTR9, EMD, GDAP1, LDL-cholesterol, NCAPG2, SLC25A12, ARID1B, LMNA, OFD1, ATPase, cyclooxygenase, IL23, MEF2C, MYH7, PELO, SMARCEl, 41 26 DISCI, GNPTAB, P38 MAPK, SMARCA2, BCL7B, DES, Cancer, and Organismal Injury Abnormalities, Reproductive System Disease KAT6B, MYH9, PHLDA3, Spectrin, SP- TANI, SRGAP2, Tropomyosin, Troponin t, tubulin (family) 5 20s proteasome, plex), AHIl, AMER1, APC, APC (com- B-cell receptor, BUB1B, Ctbp, Eif4g, FH, GBA, GJB2, Glycogen synthase, Histone Hl, INPPL1, KRT1, KRT14, Lamin, Mapk, MBD5, MYO7A, NF2, 34 23 Hereditary tory Disorder, Disease, Disease Audi- Neurological NPHP1, NPHP3, OPA1, PRKAC, PRKARIA, Rab5, Snare, SNCG, STXBP1, SYN1, SYNEl, SYNE3, TACC3 (continued on next page) 95 Table C.1 - Continued. ID Molecules in Network Score Seed Genes Top Diseases and Functions 6 ASTN2, ATR, BRCA2, Cdc2, CNTNAP2, Cyclin B, 34 24 Cancer, NCOA2, Nuclear 1, factor PARP, Pde4, Dermatological Dis- eases and Conditions, Heredi- DDB2, ERCC2, ERCC6, MBNL1, MBNL2, MCPH1, tary Disorder PDPl, RECQL4, RFWD2, RNA poly- PHF11, PRKAG2, merase I, RNA polymerase II, Rnr, RPA, SETD2, TRIM25, Ube3, Beta Tubulin, BMP, TDO2, TFIIH, TP53, TP53AIP1, XPA, XPC, ZMIZ1 7 7S NGF, Gli, elastase, ETS, Integrin alpha 3 EDA, GAP3, BAIAP2, Arp2/3, 32 22 GRIN2B, Filamin, 1, beta Cell-To-Cell Signaling and Interaction, COBL, COL4A3, DIAPH3, DLGAP1, DL- CAPN3, Nervous G-Actin, Development KCND3, Behavior and System Function, LBR, MYO5A, NCKIPSD, NFkB (complex), NLGN1, RPS6KA3, RELN, RAPGEF4, Profilin, NLGN3, SHANK2, SHANKS, SMS, TBRI, Wave 8 28 Alpha tubulin, APC2, APC/APC2, BMP4, CK1, Cy- clin D, CYLD, Dgk, Dishevelled, 20 Embryonic Development, Organismal Development, Gene FLCN, Dynein, Expression GLI3, Hedgehog, IRS, Jnk, KCNQ1, LRP, LRP5, mir181, MPZ, PAX2, PAX3, PITX2, ROR2, Secretase gamma, SHH, SOX3, TBX1, TWIST2, Vdac, VLDL, VLDLR, Wnt, WNT2, ZEB2 9 14-3-3, ATP7A, c-Src, COL11A1, COL11A2, COLlAl, 28 21 Developmental Disorder, Tissue Disor- COL2A1, COL7A1, collagen, Collagen type II, Col- Connective lagen Type XI, Collagen(s), Cpla2, DLL3, ELANE, ders, Skeletal and Muscular FBN1, ELN, EPB41L3, rin, Fc gamma receptor, Fib- HSD17B4, mir- HSD11B1, HOXD13, GDF5, MUSCLE Notch, PEX5, SMOOTH 10, MTORC2, Disorders ACTIN, SPINK5, trypsin, TSC1, TSC2, Vegf, VWF 10 Cdk, Cyclin AR, ATRX, caspase, 26s Proteasome, 27 20 idase, EIF4E, ESCO2, GATAI, HDAC4, HOXA13, Hsp90, MED12, Mitochon- Hsp27, Hsp70, MAOA, Cancer, Cell Death and Survival, Cell Cycle E, Cytochrome bcl, cytochrome C, cytochrome-c ox- PARK2, PP2A, PRPS1, drial complex 1, NDRG1, Rb, SEC23A, SLC6A4, SRC (family), STK11, TBCE, TRPS1, TSSK2, TYR, Ubiquitin, WT1 11 ANKH, Atrial Peptide, Natriuretic Cacnal, cacn, 26 19 Molecular Transport, Cancer, CACNA1H, Organismal CACNA1S, CAV3, DPP6, DPP10, ERK1/2, GLRA2, normalities CACNA1A, CACNA1C, CACNA1G, KLHL1, GNAS, Homer, ITPR, KCND2, KCNMA1, rotrophin, MGAT3, Channel, L-type Calcium NOG, Injury and Ab- NELL2, Neu- NRXN1, Pka catalytic subunit, Pkg, Pki, potassium channel, Presenilin, Ryr, RYR1, T-type Calcium Channel, calcium voltage-gated channel 12 Ahr-aryl hydrocarbon-Arnt, ABCA4, Cyclin A, E2f, DSP, ARSB, Bcl9- Cbp/p300, CDH3, Cbp/p300-Ctnnbl-Lef/Tcf, ESR1, EN2, 24 20 h4, peroxidase, HMGA2, SEMA5A, NROB1, Smadi/5/8, SMN1/SMN2, TADA3, Hat, RNase A, RPL10, Devel- ismal Development SDC3, Smad2/3-Smad4, Smad2/3, TBPL1, Function, opment and Function, Organ- Histone HISTONE, and Reproductive System Esrl-Esrl- estrogen-estrogen, estrogen receptor, FMR1, FOXL2, glutathione Renal and Urological System Development CPE, TP63, TWIST1, UBE3A 13 AHNAK, ATP2A2, Cebp, CPT1, CPT2, DYM, FOXPI, FOXP2, FOXP4, GATAD2B, HR, IL6, INTERLEUKIN, IRF6, ITLN1, JUN/JUNB/JUND, cor, Na+, K+ -ATPase, Nrlh, PEPCK, 22 17 Cancer, Gastrointestinal Disease, Hematological Disease N- Pmca, PRKAA, PTH, RAIl, Rar, Rbp, Rxr, SALL4, STIM1, SWI-SNF, Tcf 1/3/4, thymidine kinase, thyroid hormone receptor, TTR, VitaminD3-VDR-RXR (continued on next page) 96 Table C.1 - Continued. ID Molecules in Network 14 ALS2, Ampa Receptor, CaMKII, Cofilin, Ctnna, Score Seed Genes Top Diseases and Functions 18 16 Cell Death and Survival, Can- EFNB1, EPHB2, F Actin, FHIT, FLNB, GNE, GRI, GRIK2, GRIN2A, GRIP1, Integrin alpha V beta cer, Gastrointestinal Disease 3, mGluR, Mic, Myosin, N-Cadherin, OPHN1, Pak, PCDH19, PICKI, Pkc(s), PLCDI, PPI protein complex group, Pp2b, Rabli, RAB40B, Rap, Rapi, SLC1A1, TSH, TTN 15 amylase, eratin, chymotrypsin, DNM2, Collagen type III, Cytok- EFNB, ENaC, Fgf, Fgfr, 17 14 FGFR1, Cancer, Cell Death and Survival, Cellular Function and FGFR2, FGFR3, FLT4, FRS2, Gap, GHR, GP1BA, Maintenance GPIIB-IIIA, growth factor receptor, Hspg, HSPG2, IRS1/2, NCK, NTRK1, NTRK3, Ntrk dimer, Pdgfr, P13K (complex), P13K p85, PLC gamma, PLCG1, PTPN11, RPS6KA, SCT, Vla-4 16 ABRACL, AFF2, ANKRDl1, BCL6, C9orf78, CHM, CPSI, CUTA, CWC27, DYSF, GRB2, KCTD3, LSM5, LSM6, LSM12, POU2F3, PRPF8, RNU6-1, SCARF2, MT-ATP8, RNU2-1, NOS2, 17 14 Cancer, and PANK2, Organismal Injury Abnormalities, Repro- ductive System Disease RNU4-1, SLC17A5, RNU5A-1, SNRNP25, SPSB4, TDRD7, TESPA1, TTYH2, UBC, XIRP2, ZNF443, ZNF609 17 Adaptor protein 2, ADCY, ADRB, AIM1, Ap2 alpha, ASCL3, BDNF, BIN1, Caveolin, 0k2, Clathrin, Creb, 15 13 Cell Death and Survival, Cellular DNA-methyltransferase, Dynamin, GLB1, Gm-csf, Go- Development, Cellular Growth and Proliferation coupled receptor, GTPase, Hdac, mGLUR Group I, MITF, MITF-p300/CBP, NFl, NTF3, Pias, PKHD1, Ppp2c, Ras, RET, Rsk, SIM2, SLC26A2, Syntaxin, TCF, TCF4 18 Actin, calpain, CD3, Cg, CNTN4, Cel, DPY19L3, E130116L18Rik, ERBB4, FAM111A, FSH, 14 14 Cell Morphology, Cellular Assembly and Organization, GJA1, HIAT1, Histone h3, I kappa b kinase, Ikb, IKBKG, IKK (complex), Integrin, MAPT, mir-223, MTORC1, Cellular Development NLRP3, PRNP, PSENI, PTEN, RB1, RORA, STAT, STEAP1, STOXI, SUN5, TCR, Tnf receptor, ZNF211 19 AUTS2, AVEN, BCL2, C1q, CTNNA3, CTSC, Ifn, IFN Beta, IFN type 1, Iga, Ige, 12 11 IFN alpha/beta, IgG, IgG1, Igg3, Igm, IL-2R, IL12 (complex), liferation, IL12 Growth and Pro- Lymphoid Tissue Structure and Development, (family), IL2RG, Immunoglobulin, Interferon alpha, ITM2B, Ldh (complex), LRIG1, mediator, MHC Class II Cellular Organ Morphology (complex), MHC II, MIR101, PAX5, PLP1, snRNP, STATUa/b, Tlr, TREX1 20 ALG6, Baspl, CCNDI, CD96, COQ2, DAGI, DONSON, DPH1, GLI1, GPC6, MACROD2, EPB41L4B, H2AFY2, PKD1, ESCO2, HACLI, POPI, EVC, HRAS, POP4, 12 11 FKRP, IMPA2, PPCS, Cancer, Developmental Disorder, Cellular Growth and Proliferation PRE- LID1, PTTG1IP, RAB23, RASSF6, RMRP, SFXN3, SLC13A3, 21 TBC1D24, TENM3, THG1L, UBC, ZNF711 ALPL, BCR (complex), Calcineurin A, Calcineurin 10 11 Post-Translational Modifica- protein(s), EYA1, EYA4, Fcerl, Gsk3, HOXB1, JAK, tion, Lh, MAP2K1/2, MEF2, MEF2A, NFAT (complex), Organismal Development Nfat (family), Pdgf (complex), phosphatase, Organismal Survival, PISK (family), Pka, Ptk, PTPase, PTPN1, PTPRC, Raf, Shc, SHP, Sod, Sos, SOS1, SYK/ZAP, THTPA, TRPV4, tyrosine kinase, WAS (continued on next page) 97 Table C.1 - Continued. ID Molecules in Network 22 AASS, CNP, ALDOC, ATP2B2, DLG4, EPB41L3, BAIAP2, GRID2, CIT, Grik, CMC1, Score Seed Genes 8 9 Top Diseases and Functions Cell-To-Cell Signaling and Interaction, GRIK2, Nervous GRIK5, GRIN2C, GRIN2D, HTT, HUNK, KCNA1, Development KCNAB1, Cancer KCNJ2, KCNJ4, KCNJ12, MAP3K10, System and Function, MBTPS1, NLGN4X, NRXN1, PCLO, PLP1, PRELP, SDHA, SFXN3, SRGAP3, STX1B, STXBP1, TYRO3 23 Alp, Alpha catenin, ALT, AMPK, C/ebp, Collagen Alphal, Collagen type I, 8 8 Cell Death and Survival, Cellular Growth and Prolifera- Collagen type IV, crea- tion, Tissue Development tine kinase, CYP, Fibrinogen, Focal adhesion kinase, Growth hormone, HDL, HDL-cholesterol, hemoglobin, HGF, Ifn gamma, ILl, JINK1/2, Laminin, LDL, MIF, MTHFR, Nos, NOS3, Pro-inflammatory Cytokine, Rock, RUNX2, SCARF2, Smad, SMPD1, Tgf beta, Tnf (family), VDR 24 Alpha Actinin, AVP, AVPRlA, Angiotensin II receptor type 1, 5 8 Beta Arrestin, Calmodulin, CASR, Carbohydrate Metabolism, Molecular Transport, chemokine, EDNRB, Endothelin, G protein, G pro- Small Molecule Biochemistry tein alpha, G protein alphai, G protein beta gamma, G-protein beta, GNAQ, GNRH, Gpcr, IgG2a, IgG2b, LHCGR, Metalloprotease, Mmp, NMDA Receptor, OPN1LW, OXTR, PLC, PId, Rac, Ras homolog, Relaxin, Sapk, Sfk, Trk Receptor, tubulin (complex) 25 2 ERCC6L2, NEK6 1 Cell Cycle, Cellular Move- ment, Cell Morphology 1. IPAO QIAGEN Redwood City, http://wwv.qiagen.con/ingenuity 98 Bibliography [1] Online Mendelian Inheritance in Man, OMIM®, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), March 2014. World Wide Web URL: http://omim.org/. [2] Brett S Abrahams and Daniel H Geschwind. Advances in autism genetics: on the threshold of a new neurobiology. Nature Reviews Genetics, 9(5):341-355, 2008. [3] Stein Aerts, Diether Lambrechts, Sunit Maity, Peter Van Loo, Bert Coessens, Frederik De Smet, Leon-Charles Tranchevent, Bart De Moor, Peter Marynen, Bassem Hassan, et al. Gene prioritization through genomic data fusion. Nature Biotechnology, 24(5):537-544, 2006. [4] David Altshuler, Mark Daly, and Leonid Kruglyak. Genetics, 26(2):135-138, 2000. Guilt by association. Nature [5] David Altshuler, Mark J Daly, and Eric S Lander. Genetic mapping in human disease. Science, 322(5903):881-888, 2008. [6] JY An, AS Cristino, Q Zhao, J Edson, SM Williams, D Ravine, J Wray, VM Marshall, A Hunt, AJO Whitehouse, et al. Towards a molecular characterization of autism spectrum disorders: an exome sequencing and systems approach. TranslationalPsychiatry, 4(6):e394, 2014. [7] Richard Anney, Lambertus Klei, Dalila Pinto, Joana Almeida, Elena Bacchelli, Gillian Baird, Nadia Bolshakova, Sven B6lte, Patrick F Bolton, Thomas Bourgeron, et al. Individual common variants exert weak effects on the risk for autism spectrum disorders. Human Molecular Genetics, 21(21):4781-4792, 2012. [8] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25-29, 2000. [9] American Psychiatric Association. The Diagnostic and Statistical Manual of Mental Disorders (5th ed.). American Psychiatric Publishing, 2013. & [10 Samy A Azer. Overview of molecular pathways in inflammatory bowel disease associated with colorectal cancer development. European Journal of Gastroenterology Hepatology, 25(3):271-281, 2013. [11] Albert-Ldszl6 Barab~si, Natali Gulbahce, and Joseph Loscalzo. Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1):56-68, 2011. 99 [12] Colin A Baron, Clifford G Tepper, Stephenie Y Liu, Ryan R Davis, Nicholas J Wang, N Carolyn Schanen, and Jeffrey P Gregg. Genomic and functional profiling of duplicated chromosome 15 cell lines reveal regulatory alterations in UBE3A-associated ubiquitin-proteasome pathway processes. Human Molecular Genetics, 15(6):853-869, 2006. [13] Saumyendra N Basu, Ravi Kollu, and Sharmila Banerjee-Basu. AutDB: a gene reference resource for autism research. Nucleic Acids Research, 37(suppl 1):D832-D836, 2009. [14] Brent R Bill and Daniel H Geschwind. Genetic advances in autism: heterogeneity and convergence on shared pathways. Current Opinion in Genetics & Development, 19(3):271-278, 2009. [15] Douglas C Bittel, Nataliya Kibiryeva, and Merlin G Butler. Whole genome microarray analysis of gene expression in subjects with fragile X syndrome. Genetics in Medicine, 9(7):464-472, 2007. [16] Hans K Blomquist, Michael Bohman, Sven Olof Edvinsson, Christopher Gillberg, Karl-Henrik Gustavson, Gdsta Holmgren, Jan Wahlstr6m, et al. Frequency of the fragile X syndrome in infantile autism. Clinical Genetics, 27(2):113-117, 1985. [17] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1):107-117, 1998. [18] OJ Broom, B Widjaya, J Troelsen, Jorgen Olsen, and OH Nielsen. Mitogen activated protein kinases: a role in inflammatory bowel disease? Clinical & Experimental Immunology, 158(3):272-280, 2009. [19] Andrej Bugrim, Tatiana Nikolskaya, and Yuri Nikolsky. Early prediction of drug metabolism and toxicity: systems biology approach and modeling. Drug Discovery Today, 9(3):127-135, 2004. [20] Joseph D Buxbaum. Multiple rare variants in the etiology of autism spectrum disorders. Dialogues in Clinical Neuroscience, 11(1):35, 2009. [21] Malcolm G Campbell, Isaac S Kohane, and Sek Won Kong. Pathway-based outlier method reveals heterogeneous genomic structure of autism in blood transcriptome. BMC Medical Genomics, 6(1):34, 2013. [22] Rita M Cantor, Naoko Kono, Jackie A Duvall, Ana Alvarez-Retuerto, Jennifer L Stone, Maricela Alarc6n, Stanley F Nelson, and Daniel H Geschwind. Replication of autism linkage: fine-mapping peak at 17q21. The American Journal of Human Genetics, 76(6):1050-1056, 2005. [23] Mengfei Cao, Hao Zhang, Jisoo Park, Noah M Daniels, Mark E Crovella, Lenore J Cowen, and Benjamin Hescott. Going the Distance for Protein Function Prediction: A New Distance Metric for Protein Interaction Networks. PloS One, 8(10):e76339, 2013. [24] MA Care, JR Bradford, CJ Needham, AJ Bulpitt, and DR Westhead. Combining the interactome and deleterious SNP predictions to improve disease gene identification. Human Mutation, 30(3):485-492, 2009. 100 [25] Wenjun Chang, Liye Ma, Liping Lin, Liqiang Gu, Xiaokang Liu, Hui Cai, Yongwei Yu, Xiaojie Tan, Yujia Zhai, Xingxing Xu, et al. Identification of novel hub genes associated with liver metastasis of gastric cancer. International Journal of Cancer, 125(12):2844-2853, 2009. [26] Yanqing Chen, Jun Zhu, Pek Yee Lum, Xia Yang, Shirly Pinto, Douglas J MacNeil, Chunsheng Zhang, John Lamb, Stephen Edwards, Solveig K Sieberts, et al. Variations in DNA elucidate molecular networks that cause disease. Nature, 452(7186):429-435, 2008. [27] David Croft, Antonio Fabregat Mundo, Robin Haw, Marija Milacic, Joel Weiser, Guanming Wu, Michael Caudy, Phani Garapati, Marc Gillespie, Maulik R Kamdar, et al. The Reactome pathway knowledgebase. Nucleic Acids Research, 42(D1):D472-D477, 2014. [28] Disabilities Monitoring Network Surveillance Year Developmental, 2010 Principal Investigators, et al. Prevalence of autism spectrum disorder among children aged 8 yearsautism and developmental disabilities monitoring network, 11 sites, United States, 2010. Morbidity and Mortality Weekly Report. Surveillance Summaries (Washington, DC: 2002), 63:1, 2014. [29] Bernie Devlin, Nadine Melhem, and Kathryn Roeder. Do common variants play a role in risk for autism? Evidence and theoretical musings. Brain Research, 1380:78-84, 2011. [30] ZoltAn Dezsd, Yuri Nikolsky, Tatiana Nikolskaya, Jeremy Miller, David Cherba, Craig Webb, and Andrej Bugrim. Identifying disease-specific genes based on their topological significance in protein networks. BMC Systems Biology, 3(1):36, 2009. [31] Annette J Dobson. An introduction to generalized linear models. CRC press, 2001. [32] Lynnette R Ferguson. Nutrigenetics, nutrigenomics and inflammatory bowel diseases. Expert Review of Clinical Immunology, 9(8):717-726, 2013. [33] Eric Fombonne. Epidemiology of autistic disorder and other pervasive developmental disorders. The Journal of Clinical Psychiatry, 66:3-8, 2004. [34] Lude Franke, Harm van Bakel, Like Fokkens, Edwin D De Jong, Michael EgmontPetersen, and Cisca Wijmenga. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. The American Journal of Human Genetics, 78(6):1011-1025, 2006. [35] Christine M Freitag. The genetics of autistic disorders and its clinical relevance: a review of the literature. Molecular Psychiatry, 12(1):2-22, 2006. [36] Richard A George, Jason Y Liu, Lina L Feng, Robert J Bryson-Richardson, Diane Fatkin, and Merridee A Wouters. Analysis of proein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Research, 34(19):e130-e130, 2006. [37] Daniel H Geschwind, Janice Sowinski, Catherine Lord, Portia Iversen, Jonathan Shestack, Patrick Jones, Lee Ducat, Sarah J Spence, AGRE Steering Committee, et al. The autism genetic resource exchange: a resource for the study of autism and related neuropsychiatric conditions. American Journal of Human Genetics, 69(2):463, 2001. 101 [38] Kwang-Il Goh, Michael E Cusick, David Valle, Barton Childs, Marc Vidal, and AlbertLAszl6 BarabAsi. The human disease network. Proceedings of the National Academy of Sciences, 104(21):8685-8690, 2007. [39] Jeffrey P Gregg, Lisa Lit, Colin A Baron, Irva Hertz-Picciotto, Wynn Walker, Ryan A Davis, Lisa A Croen, Sally Ozonoff, Robin Hansen, Isaac N Pessah, et al. Gene expression changes in children with autism. Genomics, 91(1):22-29, 2008. [40] Joachim Hallmayer, Sue Cleveland, Andrea Torres, Jennifer Phillips, Brianne Cohen, Tiffany Torigoe, Janet Miller, Angie Fedele, Jack Collins, Karen Smith, et al. Genetic heritability and shared environmental factors among twin pairs with autism. Archives of General Psychiatry, 68(11):1095-1102, 2011. [41] Valerie W Hu, Bryan C Frank, Shannon Heine, Norman H Lee, and John Quackenbush. Gene expression profiling of lymphoblastoid cell lines from monozygotic twins discordant in severity of autism reveals differential regulation of neurologically relevant genes. BMC Genomics, 7(1):118, 2006. [42] KR Hughes, F Sablitzky, and YR Mahida. Expression profiling of Wnt family of genes in normal and inflammatory bowel disease primary human intestinal myofibroblasts and normal human colonic crypt epithelial cells. Inflammatory Bowel Diseases, 17(1):213-220, 2011. [43] Daehee Hwang, Inyoul Y Lee, Hyuntae Yoo, Nils Gehlenborg, Ji-Hoon Cho, Brianne Petritis, David Baxter, Rose Pitstick, Rebecca Young, Doug Spicer, et al. A systems approach to prion disease. Molecular Systems Biology, 5(1), 2009. [44] Sohyun Hwang, Seung-Woo Son, Sang Cheol Kim, Young Joo Kim, Hawoong Jeong, and Doheon Lee. A protein interaction network associated with asthma. Journal of Theoretical Biology, 252(4):722-731, 2008. [45] Ronald Jansen, Haiyuan Yu, Dov Greenbaum, Yuval Kluger, Nevan J Krogan, Sambath Chung, Andrew Emili, Michael Snyder, Jack F Greenblatt, and Mark Gerstein. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644):449-453, 2003. [46] LB Jorde, SJ Hasstedt, ER Ritvo, A Mason-Brothers, BJ Freeman, C Pingree, WM McMahon, B Petersen, WR Jenson, and A Mo. Complex segregation analysis of autism. The American Journalof Human Genetics, 49(5):932, 1991. [47] Minoru Kanehisa and Susumu Goto. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27-30, 2000. [48] Minoru Kanehisa, Susumu Goto, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao Tanabe. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research, 42(D1):D199-D205, 2014. [49] Shaul Karni, Hermona Soreq, and Roded Sharan. A network-based method for predicting disease-causing genes. Journal of ComputationalBiology, 16(2):181-189, 2009. [50] Arthur Kaser and Herbert Tilg. "Metabolic aspects" in inflammatory bowel diseases. Current Drug Delivery, 9(4):326-332, 2012. 102 [51] Paul Julian Kersey, James E Allen, Mikkel Christensen, Paul Davis, Lee J Falin, Christoph Grabmueller, Daniel Seth Toney Hughes, Jay Humphrey, Arnaud Kerhornou, Julia Khobova, et al. Ensembl Genomes 2013: scaling up access to genome- wide data. Nucleic Acids Research, 42(D1):D546-D552, 2014. [52] Yoo-Ah Kim, Stefan Wuchty, and Teresa M Przytycka. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Computational Biology, 7(3):e1001095, 2011. [53] Young Shin Kim, Bennett L Leventhal, Yun-Joo Koh, Eric Fombonne, Eugene Laska, Eun-Chung Lim, KeuA-Ah Cheon, Soo-Jeong Kim, Young-Key Kim, HyunKyung Lee, et al. Prevalence of autism spectrum disorders in a total population sample. American Journal of Psychiatry, 168(9):904-912, 2011. [54] Michael D Kogan, Stephen J Blumberg, Laura A Schieve, Coleen A Boyle, James M Perrin, Reem M Ghandour, Gopal K Singh, Bonnie B Strickland, Edwin Trevathan, and Peter C van Dyck. Prevalence of parent-reported diagnosis of autism spectrum disorder among children in the US, 2007. Pediatrics, 124(5):1395-1403, 2009. [55] Isaac S Kohane, Andrew McMurry, Griffin Weber, Douglas MacFadden, Leonard Rappaport, Louis Kunkel, Jonathan Bickel, Nich Wattanasin, Sarah Spence, Shawn Murphy, et al. The co-morbidity burden of children and young adults with autism spectrum disorders. PloS One, 7(4):e33224, 2012. [56] Sebastian K6hler, Sebastian Bauer, Denise Horn, and Peter N Robinson. Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics, 82(4):949-958, 2008. [57] Michael Krauthammer, Charles A Kaufmann, T Conrad Gilliam, and Andrey Rzhetsky. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease. Proceedings of the National Academy of Sciences of the United States of America, 101(42):15148-15153, 2004. [58] Kasper Lage, E Olof Karlberg, Zenia M Storling, PAl I Olason, Anders G Pedersen, Olga Rigina, Anders M Hinsby, Zeynep Tiimer, Flemming Pociot, Niels Tommerup, et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotechnology, 25(3):309-316, 2007. [59] Eunjung Lee, Hyunchul Jung, Predrag Radivojac, Jong-Won Kim, and Doheon Lee. Analysis of AML genes in dysregulated molecular networks. BMC Bioinformatics, 10(Suppl 9):S2, 2009. [601 Charles William Lees. Role of the hedgehog signalling pathway in inflammatory bowel disease. PhD thesis, University of Edinburgh, 2009. [61] CW Lees, JC Barrett, M Parkes, and J Satsangi. New IBD genetics: common pathways with other diseases. Gut, 60(12):1739-1753, 2011. [62] Dan Levy, Michael Ronemus, Boris Yamrom, Yoon-ha Lee, Anthony Leotta, Jude Kendall, Steven Marks, B Lakshmi, Deepa Pai, Kenny Ye, et al. Rare de novo and transmitted copy-number variation in autistic spectrum disorders. Neuron, 70(5):886- 897, 2011. 103 [63] Yongjin Li and Jagdish C Patra. Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics, 26(9):1219-1224, 2010. [64] Luana Licata, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta lannuccelli, Eugenia Galeota, Francesca Sacco, Anita Palma, Aurelio Pio Nardozza, Elena Santonico, et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Research, 40(D1):D857-D861, 2012. [65] Bolan Linghu, Evan S Snitkin, Zhenjun Hu, Yu Xia, and Charles DeLisi. Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biology, 10(9):R91, 2009. [66] Li Liu, Jing Lei, Stephan J Sanders, Arthur Jeremy Willsey, Yan Kou, Abdullah Ercument Cicek, Lambertus Klei, Cong Lu, Xin He, Mingfeng Li, et al. DAWN: a framework to identify autism genes and subnetworks using gene expression and genetics. Molecular Autism, 5(1):22, 2014. [67] Manway Liu, Arthur Liberzon, Sek Won Kong, Weil R Lai, Peter J Park, Isaac S Kohane, and Simon Kasif. Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genetics, 3(6):e96, 2007. [68] Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. Entrez Gene: genecentered information at NCBI. Nucleic Acids Research, 39(suppl 1):D52-D57, 2011. [69] Christian R Marshall and Stephen W Scherer. Detection and characterization of copy number variation in autism spectrum disorder. In Genomic Structural Variants, pages 115-135. Springer, 2012. [70] Douglas R Mathern, Avantika Chitre, Lloyd Mayer, and Stephanie Dahan. The Notch signaling pathway mediates tight junction protein stoichiometry in IBD: P-203. Inflammatory Bowel Diseases, 17:S72, 2011. [71] Hans-Werner Mewes, Sabine Dietmann, Dmitrij Frishman, Richard Gregory, Gertrud Mannhaupt, Klaus FX Mayer, Martin Miinsterk6tter, Andreas Ruepp, Manuel Spannagl, Volker Stimpflen, et al. MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Research, 36(suppl 1):D196-D201, 2008. [72] Marcela K Monaco, Joshua Stein, Sushma Naithani, Sharon Wei, Palitha Dharmawardhana, Sunita Kumari, Vindhya Amarasinghe, Ken Youens-Clark, James Thomason, Justin Preece, et al. Gramene 2013: comparative plant genomics resources. Nucleic Acids Research, 42(D1):D1193-D1199, 2014. [73] Linda B Moran and Manuel B Graeber. Towards a pathway definition of ParkinsonSs disease: a complex disorder with links to cancer, diabetes and inflammation. Neurogenetics, 9(1):1-13, 2008. [74] Eric M Morrow, Seung-Yun Yoo, Steven W Flavell, Tae-Kyung Kim, Yingxi Lin, Robert Sean Hill, Nahit M Mukaddes, Soher Balkhy, Generoso Gascon, Asif Hashmi, et al. Identifying autism loci and genes by tracing recent shared ancestry. Science, 321(5886):218-223, 2008. [751 Saket Navlakha and Carl Kingsford. The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26(8):1057-1063, 2010. 104 [76] Rod K Nibbe, Mehmet Koyutiirk, and Mark R Chance. An integrative-omics approach to identify functional sub-networks in human colorectal cancer. PLoS Computational Biology, 6(1):e1000639, 2010. [77] Rod K Nibbe, Sanford Markowitz, Lois Myeroff, Rob Ewing, and Mark R Chance. Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer. Molecular & Cellular Proteomics, 8(4):827-845, 2009. [78] Yuhei Nishimura, Christa L Martin, Araceli Vazquez-Lopez, Sarah J Spence, Ana Isabel Alvarez-Retuerto, Marian Sigman, Corinna Steindler, Sandra Pellegrini, N Carolyn Schanen, Stephen T Warren, et al. Genome-wide expression profiling of lymphoblastoid cell lines distinguishes different forms of autism and reveals shared pathwaysF. Human Molecular Genetics, 16(14):1682-1698, 2007. [79] Tiago Nunes, Claudio Bernardazzi, and Heitor S de Souza. Cell Death and Inflammatory Bowel Diseases: Apoptosis, Necrosis, and Autophagy in the Intestinal Epithelium. BioMed Research International, 2014, 2014. [80] International Molecular Genetic Study of Autism Consortium et al. A full genome screen for autism with evidence for linkage to a region on chromosome 7q. Human Molecular Genetics, 7(3), 1998. [81] International Molecular Genetic Study of Autism Consortium et al. A genomewide screen for autism: strong evidence for linkage to chromosomes 2q, 7q, and 16p. American Journal of Human Genetics, 69(3):570, 2001. [82] Stephen Oliver. Proteomics: guilt-by-association goes global. Nature, 403(6770):601- 603, 2000. [83] Brian J O'Roak, Laura Vives, Santhosh Girirajan, Emre Karakoc, Niklas Krumm, Bradley P Coe, Roie Levy, Arthur Ko, Choli Lee, Joshua D Smith, et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature, 485(7397):246-250, 2012. [84] Martin Oti and Han G Brunner. The modular nature of genetic diseases. Genetics, 71(1):1-11, 2007. Clinical [85] Martin Oti, Berend Snel, Martijn A Huynen, and Han G Brunner. Predicting disease genes using protein-protein interactions. Journal of Medical Genetics, 43(8):691-698, 2006. [86] Sally Ozonoff, Gregory S Young, Alice Carter, Daniel Messinger, Nurit Yirmiya, Lonnie Zwaigenbaum, Susan Bryson, Leslie J Carver, John N Constantino, Karen Dobkins, et al. Recurrence risk for autism spectrum disorders: a Baby Siblings Research Consortium study. Pediatrics, 128(3):e488-e495, 2011. [87] Neelroop N Parikshak, Rui Luo, Alice Zhang, Hyejung Won, Jennifer K Lowe, Vijayendran Chandran, Steve Horvath, and Daniel H Geschwind. Integrative functional genomic analyses implicate specific molecular pathways and circuits in autism. Cell, 155(5):1008-1021, 2013. 105 [88] Luca Pastorelli, Carlo De Salvo, Marissa A Cominelli, Maurizio Vecchi, and Theresa T Pizarro. Novel cytokine signaling pathways in inflammatory bowel disease: insight into the dichotomous functions of IL-33 during chronic intestinal inflammation. Therapeutic Advances in Gastroenterology, 4(5):311-323, 2011. [891 Carolina Perez-Iratxeta, Peer Bork, and Miguel A Andrade-Navarro. Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Research, 35(suppl 2):W212-W216, 2007. [90] Dalila Pinto, Alistair T Pagnamenta, Lambertus Klei, Richard Anney, Daniele Merico, Regina Regan, Judith Conroy, Tiago R Magalhaes, Catarina Correia, Brett S Abrahams, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature, 466(7304):368-372, 2010. [91] G Poelmans, B Franke, DL Pauls, JC Glennon, and JK Buitelaar. AKAPs integrate genetic findings for autism spectrum disorders. Translational Psychiatry, 3(6):e270, 2013. [92] Predrag Radivojac, Kang Peng, Wyatt T Clark, Brandon J Peters, Amrita Mohan, Sean M Boyle, and Sean D Mooney. An integrated approach to inferring genedisease associations in humans. Proteins: Structure, Function, and Bioinformatics, 72(3):1030-1037, 2008. [931 Monika Ray, Jianhua Ruan, and Weixiong Zhang. Variations in the transcriptome of Alzheimer's disease reveal molecular networks involved in cardiovascular diseases. Genome Biology, 9(10):R148, 2008. [94] Richard Redon, Shumpei Ishikawa, Karen R Fitch, Lars Feuk, George H Perry, T Daniel Andrews, Heike Fiegler, Michael H Shapero, Andrew R Carson, Wenwei Chen, et al. Global variation in copy number in the human genome. Nature, 444(7118):444-454, 2006. [95] Angelica Ronald, Francesca Happ6, Patrick Bolton, Lee M Butcher, Thomas S Price, Sally Wheelwright, Simon Baron-Cohen, and Robert Plomin. Genetic heterogeneity between the three components of the autism spectrum: a twin study. Journal of the American Academy of Child & Adolescent Psychiatry, 45(6):691-699, 2006. [96] Rebecca E Rosenberg, J Kiely Law, Gayane Yenokyan, John McGready, Walter E Kaufmann, and Paul A Law. Characteristics and concordance of autism spectrum disorders among 277 twin pairs. Archives of Pediatrics & Adolescent Medicine, 163(10):907-914, 2009. [97] Lukasz Salwinski, Christopher S Miller, Adam J Smith, Frank K Pettit, James U Bowie, and David Eisenberg. The database of interacting proteins: 2004 update. Nucleic Acids Research, 32(suppl 1):D449-D451, 2004. [98] Rodney C Samaco, Amber Hogart, and Janine M LaSalle. Epigenetic overlap in autism-spectrum neurodevelopmental disorders: MECP2 deficiency causes reduced expression of UBE3A and GABRB3. Human Molecular Genetics, 14(4):483-492, 2005. [99] Carl F Schaefer, Kira Anthony, Shiva Krupa, Jeffrey Buchoff, Matthew Day, Timo Hannay, and Kenneth H Buetow. PID: the pathway interaction database. Nucleic Acids Research, 37(suppl 1):D674-D679, 2009. 106 [100] Patrick R Schmid, Nathan P Palmer, Isaac S Kohane, and Bonnie Berger. Making sense out of massive data by going beyond differential expression. Proceedings of the National Academy of Sciences, 109(15):5594-5599, 2012. [101] Jonathan Sebat, B Lakshmi, Dheeraj Malhotra, Jennifer Troge, Christa Lese-Martin, Tom Walsh, Boris Yamrom, Seungtai Yoon, Alex Krasnitz, Jude Kendall, et al. Strong association of de novo copy number mutations with autism. Science, 316(5823):445- 449, 2007. [102] David Q Shih and Stephan R Targan. Insights into IBD pathogenesis. Current Gastroenterology Reports, 11(6):473-480, 2009. [103] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and Mike Tyers. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34(suppl 1):D535-D539, 2006. [104] Jennifer L Stone, Barry Merriman, Rita M Cantor, Amanda L Yonan, T Conrad Gilliam, Daniel H Geschwind, and Stanley F Nelson. Evidence for sex-specific risk alleles in autism spectrum disorder. American Journalof Human Genetics, 75(6):1117- 1123, 2004. [105] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545-15550, 2005. [1061 Satoshi Sumi, Hiroko Taniai, Taishi Miyachi, and Mitsuyo Tanemura. Sibling risk of pervasive developmental disorder estimated by means of an epidemiologic survey in Nagoya, Japan. Journal of Human Genetics, 51(6):518-522, 2006. [107] Peter Szatmari, Andrew D Paterson, Lonnie Zwaigenbaum, Wendy Roberts, Jessica Brian, Xiao-Qing Liu, John B Vincent, Jennifer L Skaug, Ann P Thompson, Lill Senman, et al. Mapping autism risk loci using genetic linkage and chromosomal rearrangements. Nature Genetics, 39(3):319-328, 2007. [108] Hiroko Taniai, Takeshi Nishiyama, Taishi Miyachi, Masayuki Imaeda, and Satoshi Sumi. Genetic influences on the broad spectrum of autism: Study of probandascertained twins. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 147(6):844-849, 2008. [109] Ian W Taylor, Rune Linding, David Warde-Farley, Yongmei Liu, Catia Pesquita, Daniel Faria, Shelley Bull, Tony Pawson, Quaid Morris, and Jeffrey L Wrana. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nature Biotechnology, 27(2):199-204, 2009. [110] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267-288, 1996. [111] Nicki Tiffin, Euan Adie, Frances Turner, Han G Brunner, Marc A van Driel, Martin Oti, Nuria Lopez-Bigas, Christos Ouzounis, Carolina Perez-Iratxeta, Miguel A 107 Andrade-Navarro, et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Research, 34(10):3067-3081, 2006. [112] Marc A van Driel, Jorn Bruggeman, Gert Vriend, Han G Brunner, and Jack AM Leunissen. A text-mining analysis of the human phenome. European Journalof Human Genetics, 14(5):535-542, 2006. [113] Oron Vanunu, Oded Magger, Eytan Ruppin, Tomer Shlomi, and Roded Sharan. Associating genes and protein complexes with disease via network propagation. PLoS ComputationalBiology, 6(1):e1000641, 2010. [114] Marc Vidal. A unifying view of 21st century systems biology. FEBS Letters, 583(24):3891-3894, 2009. [115] Christian Von Mering, Lars J Jensen, Michael Kuhn, Samuel Chaffron, Tobias Doerks, Beate Kruger, Berend Snel, and Peer Bork. STRING 7-recent developments in the integration and prediction of protein interactions. Nucleic Acids Research, 35(suppl 1):D358-D362, 2007. [116] Kai Wang, Haitao Zhang, Deqiong Ma, Maja Bucan, Joseph T Glessner, Brett S Abrahams, Daia Salyakina, Marcin Imielinski, Jonathan P Bradfield, Patrick MA Sleiman, et al. Common genetic variants on 5p14. 1 associate with autism spectrum disorders. Nature, 459(7246):528-533, 2009. [117] Xiujuan Wang, Natali Gulbahce, and Haiyuan Yu. Network-based methods for human disease gene prediction. Briefings in Functional Genomics, 10(5):280-293, 2011. [118] Jia Wei and Jiexiong Feng. Signaling pathways associated with inflammatory bowel disease. Recent Patents on Inflammation & Allergy Drug Discovery, 4(2):105-117, 2010. [119] A Jeremy Willsey, Stephan J Sanders, Mingfeng Li, Shan Dong, Andrew T Tebbenkamp, Rebecca A Muhle, Steven K Reilly, Leon Lin, Sofia Fertuzinhos, Jeremy A Miller, et al. Coexpression networks implicate human midfetal deep cortical projection neurons in the pathogenesis of autism. Cell, 155(5):997-1007, 2013. [120] Christof Winter, Glen Kristiansen, Stephan Kersting, Janine Roy, Daniela Aust, Thomas Kn6sel, Petra Rimmele, Beatrix Jahnke, Vera Hentrich, Felix Rickert, et al. Google goes cancer: improving outcome prediction for cancer patients by networkbased ranking of marker genes. PLoS Computational Biology, 8(5):e1002511, 2012. [121] Xuebing Wu, Rui Jiang, Michael Q Zhang, and Shao Li. Network-based global inference of human disease genes. Molecular Systems Biology, 4(1), 2008. [122] Xuebing Wu, Qifang Liu, and Rui Jiang. Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics, 25(1):98-104, 2009. [123] Amanda L Yonan, Maricela Alarcon, Rong Cheng, Patrik KE Magnusson, Sarah J Spence, Abraham A Palmer, Adina Grunn, Suh-Hang Hank Juo, Joseph D Terwilliger, Jianjun Liu, et al. A genomewide screen of 345 families for autism-susceptibility loci. The American Journal of Human Genetics, 73(4):886-897, 2003. 108 [124] Mengjin Zhu and Shuhong Zhao. Candidate gene identification approach: progress and challenges. InternationalJournal of Biological Sciences, 3(7):420, 2007. 109