Document 10915526

advertisement
Integrative Analysis of Heterogeneous Genomic Datasets to
Discover Genetic Etiology of Autism Spectrum Disorders
by
Sumaiya Nazeen
B.Sc. in Computer Science and Engineering, Bangladesh University of Engineering
and Technology (2011)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
MASSACHI1-g 516
O TECHNOLOGY
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
SEP 2 5 20%
LIBRARIES
Signature redacted
Author .................
...................
...
Department of Electrical Engineering and Computer Science
August 28, 2014
Certified by......Signature
........................
Bonnie A. Berger
Professor of Applied Mathematics and Computer Science
Thesis Supervisor
Accepted by .................
Signature redacted.......
/ )tOjie A. Kolodziejski
Professor of Electrical Engineering
Chair, Department Committee on Graduate Students
Integrative Analysis of Heterogeneous Genomic Datasets to Discover
Genetic Etiology of Autism Spectrum Disorders
by
Sumaiya Nazeen
Submitted to the Department of Electrical Engineering and Computer Science
on August 28, 2014, in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
Abstract
Understanding the genetic background of complex diseases is crucial to medical research,
with implications to diagnosis, treatment and drug development. As molecular approaches
to this challenge are time consuming and costly, computational approaches offer an efficient
alternative. Such approaches aim at predicting and prioritizing genes for a particular disease
of interest. State-of-the-art gene prediction and prioritization methods rely on the observation that disease-causing genes have some sort of functional similarity based on either
sequence, phenotype, protein-protein interaction (PPI) network, or functional annotation.
Another increasingly accepted view is that human diseases result from perturbations of
molecular networks, and genes causing the same or similar diseases tend to be close to one
another in molecular networks. Such observations have built the basis for a large collection
of computational approaches to find previously unknown genes associated with certain diseases. The majority of the methods are designed based on protein interactome networks,
with integration of other large-scale omics data, to infer how likely it is that a gene is
associated with a disease.
In this thesis, we set out to address this outstanding challenge of understanding the
genetic etiology of autism spectrum disorder (ASD), which refers to a group of complex
neurodevelopmental disorders sharing the common feature of dysfunctional reciprocal social interaction. We introduce three novel methods for computing how likely a given gene
is to be involved in ASDs based on copy number variations (CNVs), phenotype similarity, and protein interactome network topology. We also customize a random walk with
restarts algorithm for ASD gene prioritization for the first time. Finally, we provide a novel
integrative approach for combining CNV, phenotype similarity, and topology-related information with existing knowledge from literature. Our integrative approach outperforms the
individual schemes in identifying and ranking ASD related genes. Our candidate gene set
provides a number of interesting biological insights in that it is overrepresented in a number
of interesting signaling, cell-adhesion and neurological pathways, molecular functions, and
biological processes that are worth further investigation in connection with ASDs. We also
find evidence for an interesting connection between gastrointestinal disorders, particularly
inflammatory bowel diseases (IBD), and ASDs. The subnetworks we identify indicate the
possibility of existence of subclasses of disorders along the autism spectrum.
Thesis Supervisor: Bonnie A. Berger
Title: Professor of Applied Mathematics and Computer Science
3
4
Acknowledgments
This thesis owes its existence to Professor Bonnie Berger. It has been an amazing experience
to work with her. She has been an excellent source of encouragement and inspiration to me.
She has been incredibly patient with me and always put my personal growth as a researcher
first. I cannot thank her more for teaching me how to approach the process of learning and
research.
I am indebted to Dr. Rohit Singh for his constant help, advice, support, and mentorship
in all aspects of my thesis. This work would not have been possible without his invaluable
advice and support. I remember countless meetings with him in which I walked in frustrated,
yet walked out encouraged and excited again. I'd like to thank Rohit for his warm support
and patience in teaching me how to face the moments when progress seems slow.
I would like to thank Professor Isaac Kohane, Dr. Nathan Palmer, and Dr. Finale DoshiVelez for sharing their knowledge of autism spectrum disorders with me. Many thanks to
the members of Berger lab for sharing my exciting as well as frustrating moments. I'd like to
thank Patrice for lightening up my days with her warm greetings. I am grateful to George,
Hoon, Sean, and Christina for having discussions with me and encouraging me along in my
research. Thanks to Andrew, Deniz, Jian, Noah, and William for being there whenever I
needed help.
I owe my gratitude to the Bangladeshi Students Association at MIT, which has become
my family in Boston. As always, I am ever grateful to my parents and siblings for their love
and constant support. Finally, I express my utmost gratitude to my greatest supporter: to
the Almighty Allah, who has bestowed good health upon me, kept me free from anxiety,
and filled my everyday with joy and hope.
5
6
Contents
Abstract
3
Acknowledgments
5
List of Figures
11
List of Tables
13
1
15
M otivation. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 15
1.2
State of the art . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 18
1.2.1
General Trends in Disease Gene Prediction
. . . . . . . . . . 18
1.2.2
Computational Advances in ASD Gene Prediction
.
.
1.1
.
. . . .
. . . . . . . . . . 26
Contributions . . . . . . . . . .. . . . . . . . . . . . . . . .
. . . . . . . . . . 27
1.4
O utline
. . . . . . . . . . 29
.
1.3
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Predicting and Prioritizing Candidate Genes for ASD
2.1.1
Copy Number Variation (CNV) . . . . . . . . . . .
. . . . . . . . . . 32
2.1.2
Copy Number Variants in ASD . . . . . . . . . . .
.
. . . . . . . . . . 33
2.1.3
Calculating Information Entropy Score from CNVs
.
. . . . . . . . . . 34
2.1.4
Quality of CNV Information Entropy based Prioritization . . . . . . . 35
.
.
. . . . . . . . . . 32
ASD Similarity based Prioritizer
. . . . . . . . . . 36
2.2.1
Similarity of Phenotypes or Diseases . . . . . . . .
. . . . . . . . . . 36
2.2.2
Gene-Phenotype Association Data . . . . . . . . .
. . . . . . . . . . 38
2.2.3
Calculating ASD Similarity Scores . . . . . . . . .
. . . . . . . . . . 38
.
. . . . . . . . . . . . . .
.
2.2
CNV Information Entropy based Prioritizer . . . . . . . .
.
2.1
31
.
2
Introduction
7
2.2.4
2.3
2.4
3
Diffusion State ASD Proximity based Prioritizer . . . . . . . . . . . . . . . . . 40
2.3.1
Diffusion State Distance (DSD) in PPI Network . . . . . . . . . . . . . 40
2.3.2
Calculating Diffusion State ASD Proximity (DSAP) of Genes . . . . . 42
2.3.3
Quality of DSAP-based Ranking
. . . . . . . . . . . . . . . . . . . . . 42
Network Crosstalk based Prioritizer . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.1
Motivation
2.4.2
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.3
Calculating Network Crosstalk Scores
2.4.4
Dealing with Statistical Bias
2.4.5
Performance of Network Crosstalk based Prioritizer . . . . . . . . . . . 47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
. . . . . . . . . . . . . . . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . 46
3.2
3.3
49
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1
Lasso-penalized Logistic Regression . . . . . . . . . . . . . . . . . . . . 50
Predicting ASD Association via Logistic Regression based Integrative Approach 50
3.2.1
Preparing Data for Training and Validation . . . . . . . . . . . . . . . 50
3.2.2
Constructing Lasso-regularized Binomial Regression Model
3.2.3
Selecting Model Coefficients . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.4
Creating Regularized Model and Making Predictions . . . . . . . . . . 51
. . . . . . 50
Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ASD Genetics: Implications from Candidate ASD Risk Genes
57
4.1
Gene Sets for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2
Hypergeometric Test for Enrichment . . . . . . . . . . . . . . . . . . . . . . . 58
4.3
Pathway Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1
5
. . . . . . . . . . . . 38
Integrative Approach for Identifying ASD Risk Genes
3.1
4
Performance of ASD Similarity based Prioritizer
An Interesting Connection with Inflammatory Bowel Disease (IBD) . . 62
4.4
Enrichment Analysis on GO gene sets
4.5
Enrichment Analysis for Subnetworks . . . . . . . . . . . . . . . . . . . . . . . 63
4.6
Functional Analysis for Overlap with Diseases and Bio-functions
. . . . . . . . . . . . . . . . . . . . . . 62
Conclusion
. . . . . . . 66
71
Appendix A SFARI Genes for Autism Spectrum Disorders
8
75
Appendix B Risk Genes for ASDs Identified by Integrative Approach
87
Appendix C Subnetworks in ASD Risk Gene Set
95
Bibliography
99
9
10
List of Figures
. . . . . . . . . . . . . . . 32
2-1
Copy number variations in a pair of chromosomes.
2-2
Steps in CNV-based prediction-prioritization of ASD genes.
2-3
Receiver operating characteristic curves for CNV-based prioritizer using dif-
. . . . . . . . . . 35
ferent scaling factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2-4
Lift chart for CNV-based prioritizer.
2-5
Receiver operating characteristic curve for ASD similarity based prioritizer.
2-6
Lift chart for ASD similarity based prioritizer. . . . . . . . . . . . . . . . . . . 40
2-7
Receiver operating characteristic curve for Diffusion State ASD Proximity
(DSAP) based prioritizer.
. . . . . . . . . . . . . . . . . . . . . . . 37
. 39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2-8
Lift chart for Diffusion State ASD Proximity (DSAP) based prioritizer. . . . . 43
2-9
Receiver operating characteristic curves for network crosstalk based prioritizer
using different restart probabilities (r). . . . . . . . . . . . . . . . . . . . . . . 48
2-10 Lift chart for network crosstalk- based prioritizer.
. . . . . . . . . . . . . . . . 48
3-1
Performance curves for integrative approach on training data. . . . . . . . . . 53
3-2
Receiver operating characteristics curves for different ASD gene predictionprioritization methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3-3
Lift chart of integrative approach for ASD.gene prediction-prioritization.
4-1
Significant GO biological processes associated with ASD risk gene set. ....
64
4-2
Significant GO molecular functions associated with ASD risk gene set. .....
65
4-3
Top four subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity@
. . . 55
Pathway Analysis (IPA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
11
12
List of Tables
1.1
Summary of general trends in disease gene prediction-prioritization methods.
3.1
Selected regression coefficients for the integrative approach from logistic regression in order of predictive value.
3.2
25
. . . . . . . . . . . . . . . . . . . . . . . 52
Selected logistic regression coefficients for integrating different ASD association scores in order of predictive value. . . . . . . . . . . . . . . . . . . . . . . 53
3.3
Selected logistic regression coefficients for integrating ASD-pathway membership information with weights in order of predictive value. . . . . . . . . . . . 53
4.1
Canonical pathways having significant overlap with ASD risk genes . . . . . . 62
4.2
IBD-related pathways having significant overlap with ASD risk genes.
4.3
Top 10 diseases having significant overlap with ASD risk genes found by
QIAGEN's Ingenuity® Pathway Analysis (IPA).
4.4
. . . . 63
. . . . . . . . . . . . . . . . 68
Top 30 functions having significant overlap with ASD risk genes found by
QIAGEN's Ingenuity@ Pathway Analysis (IPA).
. . . . . . . . . . . . . . . . 68
A.1 ASD risk genes reported by SFARI gene module.
. . . . . . . . . . . . . . . . 85
B.1 Probabilities of association with ASDs for candidate genes identified. by our
integrative analysis approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
C.1 Subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity® Pathway Analysis (IPA).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
13
14
Chapter 1
Introduction
1.1
Motivation
Identifying disease-causing genes is a fundamental challenge in human health with applications in understanding disease mechanisms, diagnosis, and therapy. Many approaches have
been adopted for discovery of candidate genes to date [124]. Traditional genetic mapping
methods include linkage analysis and genome-wide association studies (GWAS) of Mendelian
diseases and complex traits. While GWAS are powerful and effective, they face challenges in
narrowing down long lists of candidate genes [5]. Furthermore, diseases often do not follow
the simple genotype-phenotype model, but are rather the consequences of perturbations of
multiple genes connected in a molecular network, induced by various factors such as genetic
mutations, epigenetic changes, and pathogens [114]. Efforts towards discovering the properties of disease genes in molecular networks have shown that genes associated with the same
or similar diseases, tend to have some degree of functional similarity. Such similarity can be
based on sequence [36], functional annotation [89], protein-protein interactions [34,56,85],
etc. [84]. These findings became the basis for the development of computational approaches
for predicting and prioritizing disease genes. While traditional disease-causing gene identification methods are time-consuming and costly, these computational approaches offer an
efficient alternative.
Autism spectrum disorder (ASD) refers to a group of neurodevelopmental disorders
defined by three categories of deficits: abnormal development or impairment of social interaction, abnormal development or impairment of communication skills, and stereotypic
and repetitive behaviors [9]. Recent estimates show that ASDs are prevalent in 0.75% to
15
1% of the population [33,53, 54]. Among the conditions encompassed by ASDs, pervasive
developmental diseorder-not otherwise specified (PDD-NOS) and autistic disorder are the
most common, whereas Asperger syndrome appears less frequently. ASD is almost five times
more common among boys (1 in 42) than among girls (1 in 189) [28], an effect that becomes
even more pronounced in so-called high-functioning cases. Before the 1970s, autism was not
widely appreciated to have a strong genetic basis. Instead, various psychodynamic interpretations, including the role of a cold style of mothering, were considered as potential causes.
The importance of gentic contributions came into light in the 1980s, when the co-occurence
of chromosomal disorders and rare syndromes with ASDs were identified [161. Subsequent
twin and family studies provided support for a strong genetic component, but lack of uniform diagnostic criteria limited the power of those studies. The development of validated
diagnostic and assessment tools like ADI-R and ADOS for ASDs in 1990s has proven crucial
to the advancement of ASD research, and since then the diagnosis of ASDs has been gaining
in frequency. These tools in concert with important technological advances, has made it possible to carry out a range of studies such as, candidate gene association studies, resequencing
studies, genome-wide assessment of copy number variations (CNVs), etc. This ability has led
to identification of a large number of autism susceptibility genes and an increased attention
to the effects of de novo and inherited CNVs, thus supporting the notion that genetic factors
are a predominant cause of ASDs. Moreover, higher ASD concordance rates in monozygotic
twins (36-95%) compared to dizygotic twins (0-31%) [40,95,96, 108] and increased risk (at
least 2-18%) in families with a history of related disorders [46,86,106] also suggest a strong
genetic component behind ASD. However, genetic studies have been able to connect only
1-2% of autism cases to individual mutations in the autism susceptibility genes and loci,
and about 20% of cases to their combined effect [2].
One difficulty in studying genetic causes of ASDs is that different conditions are caused
by different genetic mutations. In addition, since a condition is caused by a combined effect of many mutations, the individual effects of each mutation are often small and thus
hard to detect. An additional difficulty in studying ASD relates to its heterogeneous nature. Specifically, the ASD population exhibits a wide range of conditions characterized by
impairments in reciprocal social interaction and communication, as well as restricted and
repetitive behaviors. Although some common pathways related to ASD have been identified [21,91, 98], this heterogeneity of ASDs makes things challenging. Furthermore, small
16
sample sizes in studies limit their statistical power in most cases. Thus, to comprehensively
identify risk genes and molecular pathways in ASDs, we need to perform either molecular
analysis with substantially larger sample sizes stratifying patients into more heterogeneous
groups by diagnostic criteria, sex, or family history; or more sophisticated computational
analysis.
Towards understanding the genetics of ASDs over the past two decades, researchers have
mainly focused on linkage studies, genome-wide association studies, and microarray gene
expression studies. Linkage studies aim at finding out the rough location of a disease gene
relative to another DNA sequence called a genetic marker, which has its position already
known.
Affected families are genotyped using a collection of genetic markers across the
genome, and how those genetic markers segregate with the disease across multiple families
is examined. Most autism-related linkage studies have identified linkage regions reaching the
threshold of suggestive linkage at best [35]. Loci on most chromosomes have been suggested
to harbor ASD risk, but only a few of them have been independently identified. To date, only
loci 7q22-23 [80,81] and 17q11-21 [22,104,123] have been replicated and considered significant
on a genome-wide scale. Currently, there are over 25 different loci that may be considered to
contain autism susceptibility candidate genes (ASCG), and many more complicated loci are
under observation [2]. The lack of genome-wide significant results in most published linkage
studies is a consequence of small sample sizes.
Thus the establishment of collaborative
groups, such as the International Molecular Genetic Study of Autism Consortium (IMGSAC)
and Autism Genome Project (AGP) Consortium [80,107], and shared resources, such as the
Autism Genetic Research Exchange (AGRE) Consortium [37] have become important steps
in facilitating the identification of ASD candidate genes [14].
Unlike linkage studies, genome-wide association studies examine many common genetic
variants in different individuals (either in case-control groups or within families) to see if any
variant is associated with a disease. Association studies have identified a good number of
genome-wide significant chromosomal variations, including CNVs (copy number variations
- presence of variable number of copies of a particular gene in the genotype of an individual
compared to a reference genome), which play an important role in the etiology of ASD [101].
De novo CNVs, hypothesized to be ASD-specific, have been found in up to 7-10% of sporadic
ASD [14,74]. To date more than two thousand CNV loci, harboring both rare and common
variants [7,20, 29,62,83,90,116], have been identified in more than three hundred studies
17
attributing to an awful lot of candidate genes. The challenge for CNV studies is to narrow
down this list of candidate genes.
Besides linkage and association studies, microarray gene expression studies are also being
conducted to provide important insights into genes and pathways that might be dysregulated
across ASDs [12,15, 39,41,78] and within individual subtypes of pervasive developmental
disorders.
Gene expression studies measure the activity (i.e., expression) of thousands of
genes across the genome at once, to create a global picture of a specific cellular function or
disease. But these studies often suffer from the problem of small sample sizes, and probe
and platform specific artifacts [100].
However, availability of vast collections of omics data from all these different types of
studies suggests developing sophisticated computational approaches to extract knowledge
that will help us better understand the biological underpinnings of ASD. This goal is further
motivated by the recent successes of using computational methods in detecting and ranking
causal genes for various complex diseases, including Glioblastoma multiferome (GBM) [52],
pancreatic cancer [120], type 2 diabetes [111], and so on.
1.2
State of the art
In this section, we provide a brief overview of the computational methods currently available
for predicting and prioritizing genes for diseases in general and ASDs in particular. As the
challenge of predicting and prioritizing disease-causing genes is central to human health
research, a large collection of computational methods have been developed to solve the
general problem. The vast majority of these approaches are based on the human proteinprotein interaction (PPI) network. We describe the main themes of these approaches as
well as some representative methods. On the other hand, researchers have started to design
other methods for the problem of ASD gene prediction and prioritization. We discuss the
most recent work here.
1.2.1
General Trends in Disease Gene Prediction
General trends in designing computational methods for disease gene prediction-prioritization
can be grouped loosely under four categories as discussed below (Table 1.1).
18
Methods Based on Protein Proximity in PPI Networks
Many of the current approaches for disease gene prioritization are based on the proximity of
candidate genes to known disease genes within interactome networks using different scoring
schemes. The intuition behind this is the 'guilt-by-association' hypothesis, which suggests
that genes that are physically or functionally close to each other tend to be involved in the
same biological pathways and have similar phenotypic effects [4,82]. Thus a key step in
these approaches is to measure the distance between candidate genes and known disease
genes in the PPI network. Approaches to measure proximity of elements in the PPI network
are based on direct neighborhood, shortest path length, diffusion kernel, random walk with
restart, propagation flow, etc. [117]
Oti et al. [85] predicted disease-causing genes in known disease loci by counting the
number of known causative genes that are direct network neighbors (Table 1.1). The authors achieved approximately 10-fold enrichment by comparing their candidates to a random
selection of candidate genes at the same locus. Krauthammer et al. [57] assigned known disease genes as seed nodes and computed the shortest path length between these and other
nodes in the network. A node that has close proximity to multiple seed nodes receives a
higher score as a candidate disease gene. However, K6hler et al. [56] demonstrated that the
closeness of two proteins cannot be fully captured by their shortest path length. Different
network structures surrounding two proteins imply different degrees of closeness between
them. This can be captured by global distance measures, such as random walk with restarts
and similarity-based diffusion kernel, by allowing equal probability of each protein diffusing
along the links of the PPI network. The authors tested 783 genes under 110 disease families
and achieved an area under the Receiver Operating Characteristic (ROC) curve up to 98%
on simulated linkage intervals containing 100 genes. Navlakha and Kingsford [75] compared
the performance of disease gene prediction using different proximity measurements including
network neighbors, random walk with restarts, propagation flow, unsupervised graph partitioning, Markov clustering, or semi-supervised graph partitioning. They reported random
walk with restarts to give the best performance in terms of precision and recall. They also
proposed a consensus method combining all closness measures, which could capture different topological properties of the PPIs and yielded better performance than the individual
measures.
19
Methods Integrating Large-scale Genomic Data
In addition to being proximal in the interactome network, disease genes are assumed to
share common features in gene ontology annotations, gene expression, protein sequences,
and domains and are likely to be involved in similar biological and functional pathways [38].
Thus, a number of computational methods have been designed to integrate genomic data
from multiple sources to achieve better performance [45].
Endeavour, a prioritization algorithm through genomic data fusion, integrates functional
annotations, microarray expression, expressed sequence tag (EST) expression, literature,
protein domains, PPIs, pathway membership, cis-regulatory modules, transcriptional motifs,
sequence similarity, and user-data and ranks the candidate genes based on their similarity
to known disease genes for each of these features [3]. A global ranking to prioritize candidate genes is generated by combining the ranks of individual features using order statistics.
Prioritizer, a Bayesian classifier based tool, consolidates data from different sources, such
as gene ontology, gene expression, and PPIs onto functional networks [34]. The closeness
in the functional network of a candidate gene in one susceptible locus to genes residing in
another locus was assessed and assigned a higher score for a shorter distance. Prioritizer
achieves 2.8-fold enrichment compared to random selection. While at least two susceptible
loci are desired by Prioritizer, Linghu et al. [65] performed genome-wide prioritization by
constructing an evidence-weighted functional linkage network of 21657 genes based on 16
data sources using a naive Bayes classifier. Candidate genes were assigned a score based on
the sum of the weights of the network links to known disease genes. The method was able to
achieve a 62% success rate on monogenic, polygenic, and cancer disease families which was
a marked improvement over the 44% success rate achieved by PPI network-only methods,
confirming the importance of data integration in prioritizing disease genes.
Methods Integrating Phenotype Similarity
Disease with similar phenotypes often share either a common set of underlying genes or
functionally related genes [38]. Several studies reported that the integration of disease phenotype networks and PPI networks outperform other approaches in the gene prioritization
task [24,36,58,63,113,121,122].
Wu et al. [121] used a simple linear regression method called
CIPHER (Correlating protein Interaction network and PHEnotype network to pRecdict disease
20
genes) to model the correlation between the phenotype similarity profile and closeness profile
in the PPI network. The algorithm used the phenotype similarity data from van Driel et
al.'s [1121 text mining results along with curated PPIs from the Human Protein Reference
Database (HPRD), Biomolecular Interaction Network Database (BIND), Molecular Interaction Database (MINT), and Online Predicted Human Interaction Database (OPHID) to
calculate the Pearson correlation coefficient between the disease similarity profile and gene
closeness profile for each disease-gene pair which was recorded as a concordance score to
represent the association of a gene with a disease. CIPHER's performance was shown to be
reliable and comparable to Endeavour.
Based on the same phenotype similarity metric computed by van Driel et al. [112],
Vanunu et al. [113] developed a slightly different method named PRINCE (PRIoritizatioN
and Complex Elucidation). They calculated association between a query disease and a gene
with a known disease association using a logistic function dependent on the phenotype
similarity between the query disease and the known disease. This disease-gene association
was then used as prior knowledge in the prioritization function and iteratively smoothed
over the network using propagation flow. PRINCE showed superior performance over CIPHER
in prioritizing genes for 1369 diseases with a known causal gene by approximately 10%
in ranking the real disease gene as the top scoring one. Li and Patra
[63] constructed
a heterogeneous network by integrating, the PPI network and phenotype network based on
disease-gene relationships in the Online Mendelian Inheritance in Man (OMIM) database [1].
They developed an algorithm RWRH (Random Walk with Restart on Heterogeneous network)
which extends the random walk with restart algorithm from only PPI network to the entire
heterogeneous network of PPIs and phenotypes. The authors reported RWRH was superior
to CIPHER in prioritizing disease genes under three circumstances: known disease genes and
genetic loci, known disease genes but no known genetic loci, and no known disease genes or
loci.
21
L~3
-
integrating
in PPI
large-scale genomic data
Methods
network
tein proximity
et
et
et
al. [3]
Linghu
et
Radivojac
al. [92]
al. [65]
et
Franke at al. [34]
Aerts
Kingsford [751
and
al. [56]
al. [85]
Navlakha
Kohler
Oti
Reference
-
PhenoPred
Prioritiser
Endeavour
-
-
-
Name
Method
GO,
SEQ,
[1151,
DIP
BIND,
located in a locus known to be associated
Combine all 13 closeness measurements
disease genes for
disease genes are summed to score the
worm, fly, mouse-rat
and
component
tion
(molecular
funccellular
genomic
data. The weights of the links to known
candidate genes.
(continued on next page)
though integrating large-scale
[71],DIP,
MIPS
IntACT,
MINT [64],STRING, yeast,
GO
GN,
a weighted functional linkage network
BioGrid,
PG,
PDS,
TXT,
EXP,
DDI,
BIND,
Use a Bayesian classifier to construct
diction.
HPRD,
Direct neighbor
Masspec,
from
Employ support vector machines for pre-
using Gaussian kernel scoring function.
PPIs
are
Shortest path length
Curated
HPRD, OPHID
tive networks, then score candidate genes
grated network
baspd on distance to known disease genes
Use a Bayesian classifier to build integraShortest path length in the inte-
ing order statistics to obtain final rank.
each feature, then combine the ranks us-
similarity to known
Rank each candidate gene based on their
random forest classifier.
scale experiments
Direct neighbor
for the ensemble decision trees using a
surement score is above a threshold; (ii)
ants
partitioning,
Markov clustering and their vari-
graph
with the disease and the network mea-
neighbor,
clustering,
direct
(i)Predict a gene as disease gene if it is
flow,
agation
imity scores to known disease genes.
Rank candidate genes based on the prox-
locus lacking identified disease genes.
ease gene and resides in a known disease
if it directly interacts with a known dis-
Predict a candidate gene as disease gene
Prediction methods
Random walk with restart, prop-
walk with restart
length, diffusion kernel, random
[971,
mapped
Direct neighbor, shortest path
Direct neighbor
of
BIND, HPRD, and large-
BIND
HPRD, OPHID
fly, and yeast
measurement
elements In networks
Proximity
BioGrid,
fly,
from worm, mouse, fruit-
STRING
IntACT,
HPRD,
worm, yeast
HPRD, human Y2H,
PPI data source
Curated PPI, Y2H,
SEQ, DO
PPI, GO, Structure,
PPI, GO, EXP
and others
TRANSFAC,
TOUCAN,
PDS,
(microarray
TXT,
EST),
KEGG,
and
EXP
PPI,
PPI
PPI
PPI
Features
Summary of the trends in disease gene prediction-prioritisation methods.
Methods based on pro-
Category
Table 1.1
C~3
-
Continued.
al. [24]
Vanunu
et
al. [113]
Li and Patra [63]
et
al. [122]
et
Wu
Care
al. [121]
et
Wu
PRINCE
RWRH
-
AlignPI
CIPHER
-
PPI
PPI
PPI, SNPs
PPI
PPI
PPI
ease conditions
PPI, EXP under dis-
MINT,
BIND,
HPRD, Y2H
PPI data source
IntAct,
for each candidate
gene
Candilate
method to
candidate genes.
(continued on next page)
use the final converged flow as scores for
smooth flow in the PPI network and then
propagation
Use
evidence weighted PPI network
network
networks.
walker jump between PPI and phenotype
eases simultaneously by allowing random
and disUse RWRH
heterogeneous network
to score genes
and deleterious SNPs.
on PPI networks, phenotype similarities
ing the same learning approach based
forest, then predict disease genes us-
a positive prediction.
Predict deleterious SNPs using random
highest scoring sub-network is taken as
networks.The candidate gene with the
associated and used in constructing sub-
genes are first assumed to be disease-
high scoring sub-networks.
PPI and phenotype networks and obtain
Use NetworkBlast algorithm to align the
profile and PPI profile.
based on linear regression of phenotype
dance score
Use correlation coefficients as concor-
complex to the disease phenotype.
phenotype caused by the genes in the
gene by measuring the similarity between
Random walk with restart on
Direct neighbor
-
length
Direct neighbor, shortest path
weighted PPI network
tual pull down, then score the candidate
expectation gene cover algorithm.
Construct candidate complexes by vir-
(virtual
down) counting in the evidence-
neighbor
disease related genes using the Maximum
Find smallest set of genes that cover the
Prediction methods
pull
Direct
Shortest path length
of
periments
BioGrid,
BIND,
measurement
Network propagation flow in the
IntAct,
OPHID,
Proximity
elements in networks
HPRD and large scale ex-
HPRD
MINT
BIND,
HPRD
MINT
HPRD,
Reactome [27,72]
al. [58]
-
Features
tion
et
et at. [49]
Name
Method
Pprel [47,481, Ecrel [47,48],
Lage
Karni
Reference
ease phenotype informa-
Methods integrating dis-
Category
Table 1.1
-
methods
based
Continued.
Disease-module
Category
Table 1.1
-
-
Taylor et al. [109]
Chen et al. [261
Gentrepid
Name
Method
George et al. [36]
Reference
compar-
First, known
disease
Second,
net-
disease-
specific QTL
work,
co-expression
-
for each hub assess the average Pearson
path length
Identify causal subnetworks
(continued on next page)
works as causal genes.
traits. Mark genes in the causal subnet-
by testing for enrichment of expression
hood test.
wise regression and multivariate likeli-
pleiotropic effects using forward step-
works within these; Identify QTL with
works, identify highly connected subnet-
Combine the gene expression and genotype data to construct co-expression net-
and disease-related genes.
architecture, etc. For each hub, identify
tion, linear motifs, globularity, domain
ing hubs based on length, phosphoryla-
move insignificant hubs Classify remain-
for each interaction and the hub and re-
correlation coefficient of co-expression
Identify hubs in the global network, then
betweenness centrality, shortest
curated
tervals.
modules between proteins linking the in-
ease to find common pathways or shared
intervals associated with the same dis-
comparing all the genes in the multiple
candidate disease genes are predicted by
without knowledge of the disease genes,
ated with the same disease.
genes in chromosomal intervals associ-
genes are used to predict novel disease
ease intervals.
tion of disease genes within known dis-
ule profiling for the automated predic-
Combines two methods - common pathway scanning (CPS) and common mod-
protein domains
Prediction methods
Common pathway, similarity of
of
OPHID, yeast, literature-
OPHID
measurement
elements in networks
Proximity
ray)
pathways,
PPI data source
PPI, EXP (microar-
PPI
ison,
Domain
Features
01
-
Continued.
[67]
al. [301
al.
et
Liu et
Dess6
Reference
-
GNEA
Name
Method
-
Proximity
measurement
PPI, EXP
MetaCore [191
shortest path length
interactions
and identify
signifi-
Construct shortest path subnet-
network.
subnetwork as well as in the global PPI
shortest paths through the gene in the
the subnetwork based on the number of
late topological score for each gene in
shortest paths to disease genes. Calcu-
works containing only the nodes in the
work.
genes and map them on to PPI net-
background distribution.
Identify differentially expressed disease
enriched based on comparison against a
the number of conditions in which it was
For each gene set, assign a p-value to
sentation in each identified subnetwork
works. Test each gene set for over repre-
cantly transcriptionally affected subnet-
tein
tein in a global network of proteinOpro-
sets
gene in each insulin resistance or di-
Map the relative mRNA expression of ev-
Prediction methods
abetes condition to the associated pro-
of
ually curated gene
cumulative expression level
elements in networks
ery
HPRD
PPI data source
pression data, man-
PPI, GO, DGAP ex-
Features
genes/proteins; QTL, quantitative trait loci.
PPI, protein-protein interaction; Y2H, yeast two hybrid experiment; PDS, protein domain sharing; PG, phylogenetic profiles; GN, gene neighbor; GO, gene ontology; EXP, gene expression; KEGG, Kyoto
encyclopedia for genes and genomes for pathway membership; TOUCAN, cis-regulatory modules; TRANSFAC, transcriptional motifs; SEQ, sequence similarity; DO, disease ontology; TXT, literature text
mining; Masspec, mass spectrometry; DDI, domain-domain interactions; SNPs, single nucleotide polymorphisms; DIP, database of interacting proteins; STRING, search tool for the retrieval of interacting
Category
Table 1.1
Disease Module-based Methods
In addition to generic candidate gene prioritization methods, significant efforts have been
made towards the prediction of disease genes for individual diseases by constructing disease
modules [11]. These methods start with identifying the disease modules or subnetworks, in
which members would share similar functions, expression patterns or metabolic pathways
assuming that breakdown of one such module causes a disease. This concept has been applied
to a wide range of diseases, including several different types of cancers [25,59,77,109], type
2 diabetes [67], obesity [26], asthma [44], neurological diseases [43,73,93], and psoriasis [30].
Liu et al. [67] used a network based approach to identify an insulin signaling module
as well as a network of molecular receptors that play significant roles in type 2 diabetes.
Chen et al. [26] identified subnetworks in liver and adipose tissues that contain genes for
which variants associated with obesity and diabetes have been identified. Taylor et al. [109]
constructed disease-associated protein interaction modules for adenocarcinoma of the breast,
providing useful predictors for breast cancer outcome.
A slightly different approach was
developed to prioritize disease-specific genes by constructing disease- and condition-specific
subnetworks [30]. Disease-specific genes, differentially expressed under disease conditions,
were mapped to global PPI network.
The shortest path subnetwork was then built by
including only the nodes in the shortest path connecting the disease-specific genes. Each
node in this subnetwork was evaluated and assigned a topological score by comparing the
number of shortest paths through it in the subnetwork to the number of shortest paths in
the global network. This scheme was able to identify novel candidate genes for psoriasis.
1.2.2
Computational Advances in ASD Gene Prediction
To implicate ASD risk genes, recently, Liu et al. have developed an algorithm DAWN (for
Detecting Association With Networks) [66].
The algorithm is based on the intuition that
ASD genes cluster within a co-expression network [87,119]. DAWN uses two kinds of data:
rare variations from exome sequencing and gene co-expression in the mid-fetal prefrontal and
motor-somatosensory neocortex. The algorithm casts the ensemble data as a Markov random
field in which the graph structure is determined by gene co-expression and it combines these
interrelationships with node-specific observations, namely gene identity, expression, genetic
data, and the estimated effect on disease-risk.
26
The algorithm works as follows: first it
identifies 'hot spots' within the co-expression network at which multiple ASD risk genes
(identified from exome data) cluster together. For these hot spots, it uses evidence from
neighboring genes to reinforce ASD signal, while in 'cooler' regions the absence of neighboring
genes with evidence of ASD association downgrades the signal. By modeling this data, DAWN
was able to identify 127 ASD risk genes, many of which are novel. It was also successful in
predicting some known ASD genes, not included in the genetic data used to create the model.
In addition, the method was able to find three interesting sub-networks in support of the role
of abberant connectivity of neuronal circuits due to intrinsically abnormal synapses in ASD.
Although currently DAWN's findings are limited by the power of test statistics derived from
available samples with exome sequencing, its success shows that computational approaches
hold sufficient promise in identifying ASD associated genes.
1.3
Contributions
To address the classic problem of disease-gene prediction in the context of ASDs, this thesis
designs three novel computational methods, one modified random walk with restarts method,
and a novel integrative method for combining these four with prior knowledge. While the
recent computational approach for solving the problem of ASD gene prediction focuses
mainly on rare variations from exome sequencing and gene co-expression data, our methods
focus on computationally extracting knowledge from other data sources, including copy
number variations (CNVs), phenotype similarity to ASD, and proximity to ASD genes in
the PPI network.
Our first method utilizes the copy number variations that have ever been observed in
the ASD population as well as appropriate control groups. We calculate an information
entropy based score for all the genes that can be mapped to the reported CNV loci, taking
into account their frequency of occurrence in ASD case-control groups. To the best of our
knowledge, this is the first information theoretic approach to extract knowledge from disease
CNVs.
In our second method we incorporate phenotype similarity information to quantify functional association of ASD genes to the rest of the genes.
Our method incorporates dis-
ease/phenotype similarity scores computed by van Driel et al. [112] and gene-phenotype
relationships from the Online Mendelian Inheritance in Man (OMIM) database [1]. This
27
method is seeded by high confidence ASD genes from the literature to identify ASD-like
phenotypes in OMIM. Genes involved in diseases with phenotypes similar to ASDs are
ranked highly by this algorithm.
In our third method, we use the power of topological proximity in the network.
We
introduce a new diffusion based proximity metric for the proteins in the PPI network namely,
Diffusion State ASD Proximity (DSAP). DSAP is defined on diffusion state distances (DSDs)
in the PPI network which have supremacy over direct neighborhood and shortest path
distances in capturing the functional association of proteins in the PPI network. DSAP of
a gene is calculated based on its diffusion state distances to ASD seed genes.
Since random walks with restarts are one of the most effective approaches in solving
the generic disease-gene prediction problem, we customize this approach specifically for the
ASD context for the first time. Our approach uses the global PPI network structure and can
be considered as a generalization of Google's Pagerank algorithm. This method starts with
identifying high confidence ASD genes from the literature and simulates a random walk with
restarts on the connected PPI network to simulate network crosstalk between the genes in
the network. The simulated crosstalk gives a quantification of the functional association of
ASD genes to the rest of the genes in the network. All these methods are shown to perform
better than random selection.
Finally, we propose a novel integrative approach which incorporates CNV, phenotype
similarity, and connectivity, proximity, and topological similarity in the PPI network with
ASD-pathway knowledge from available literature.
Each gene is assigned an association
probability based on a logistic regression model. Lasso regularization with cross validation
is performed to avoid over-fitting of the model. We show that the integrative approach
significantly outperforms the above four methods.
We provide a number of interesting biological insights into the mechanism of ASDs by
performing a series of analyses on the candidate genes selected by our integrative method.
Pathway enrichment analysis reveals that, our candidate gene set is overrepresented in a
number of pathways related to signal transduction, cell adhesion, and nervous system development. These pathways can be useful in explaining the pathophysiology of ASDs. We also
find an interesting link between ASDs and Inflammatory Bowel Diseases (IBD) in that our
candidate gene set has significant overlap with the majority of the IBD-related pathways.
Furthermore, we identify a number of disjoint subnetworks in our candidate gene set, char28
acterized by different categories of diseases and bio-functions, which provide an indication
of the existence of subclasses of disorders in the autism spectrum. The topmost subnetwork
characterized by gastrointestinal disorders, is particularly interesting. Functional and gene
ontology enrichment analyses help us identify a number of interesting molecular functions
and biological processes in which the candidate genes are overrepresented. For some of these
terms, their connection to ASDs is not so obvious and thus worth further investigation.
1.4
Outline
In Chapter 2, we describe three novel computational methods for predicting and prioritizing
ASD genes. We also introduce a random walk based approach for solving the disease gene
prediction problem in the context of ASDs for the first time. In Chapter 3, we describe a
novel integrative analysis approach which outperforms the individual methods described in
the previous chapter in identifying and ranking ASD genes. We select a set of candidate
genes which are highly likely to be associated with ASDs. We perform a series of analyses
to find significant pathways, bio-functions, diseases and subnetworks in which the candidate
gene set is overrepresented. The methodology of the analyses as well as the results and their
biological implications are discussed in Chapter 4. Finally, we present closing remarks and
discussion in Chapter 5.
29
30
Chapter 2
Predicting and Prioritizing Candidate
Genes for ASD
In this chapter we introduce three novel methods for gene prediction-prioritization for ASDs.
The first one is based on the copy number variations observed in the ASD population as
well as appropriate control groups. The second method incorporates disease similarity information with gene-phenotype mappings for OMIM to quantify the association of a gene to
ASDs. The third method cbmputes functional association of ASD seed genes with the rest
of the genes in the network based on a new diffusion based proximity measure. Finally, we
customize a random walk with restarts based algorithm for ASDs which takes into consideration the proximity and connectivity information of the genes in the global PPI network
to quantify the ASD-association of genes in the network.
The landscape of genes for our methods covers the largest connected component of the
PPI network constructed using human PPIs collected from BioGRID [103] and ASD related
PPIs collected from the SFARI Autism PIN module [13]. It comprises of 22192 genes and
227341 interactions. In what follows we refer to this largest connected component of the
PPI network as "connected PPI network". For measuring the performance of our methods,
we need to consider a set of ASD genes as a gold standard. We collected a list of known
ASD genes from SFARI Human Gene Module [13]. As of June 2014, this module reported
606 known human genes in connection to ASDs, 548 of which reside in the largest connected
component of our PPI network. We use these genes as our gold standard (Appendix A).
31
2.1
2.1.1
CNV Information Entropy based Prioritizer
Copy Number Variation (CNV)
For decades, it has been known to researchers that chromosomal rearrangements can result
in a wide range of developmental disorders.
However, technological and computational
advances in the past decade have enabled the development of assays capable of identifying
submicroscopic structural changes in chromosomes that could not have been detected by
traditional cytogenetic analysis. Among the most heavily scrutinized of these structural
variants are copy number variants, or CNVs. CNVs refer to submicroscopic chromosomal
deletions and/or duplications that are typically defined as DNA segments of 1000 base pairs
or larger in size that are present in a varying (or zero) number of copies when compared to
a reference genome [94] (Figure 2-1).
Deletion
Duplication
Normal pair of
chromosomes
Pair of
chromosomes
with one
copy of "C"
Pair of
chromosomes
with three
copies of "C"
Figure 2-1: Copy number variations in a pair of chromosomes. The pair of normal
chromosomes (middle pair) each have sections A-B-C-D. However, the loss of section C from
one of the chromosomes results in an abnormal chromosome with only sections A-B-D (left
pair); an individual with this deletion has only one copy of section C in their chromosomes.
On the other hand, the gain of an extra copy of section C on one of the chromosomes results
in an abnormal chromosome with sections A-B-C-C-D (right pair); an individual with this
duplication has three copies of section C in their chromosomes. Thus, both of the individuals
(left and right) have CNVs involving section C - one has lost a copy, the other has gained a
copy, but both have a varied number of copies of C when compared to the reference pair of
chromosomes.
32
There are many CNVs throughout the human genome that have no adverse influence on
the individual(s) harboring them in the general population. However, there are also a large
number of CNVs that have been definitively linked with diseases. Evidence also indicates
that interaction with additional genetic or environmental factors may influence whether
CNVs have a detectable adverse effect on an individual.
2.1.2
Copy Number Variants in ASD
Analyses of large autistic populations over the past decade suggest that CNVs at specific
locations in the genome result in increased susceptibility to ASD [69]. It has been estimated
that 10-20% of ASD cases result from the presence of one or more pathogenic CNVs in an
affected individual [2]. This finding implicates that CNVs are one of the most, if not the
most, common genetic causes of ASD.
In 2003, Simons Foundation launched the project "Simons Foundation Autism Research
Initiative (SFARI)" to advance the research of autism spectrum disorders. SFARI Gene [13]
is a publicly available, curated, web-based, searchable, integrated resource, made available
to the autism research community by SFARI. This resource is built on information extracted
from the studies on molecular genetics and biology of ASD. SFARI Gene includes genetic,
proteomic, and structural variation data from linkage and association studies, cytogenetic
abnormalities, and specific mutations associated with ASD. The Copy Number Variant
(CNV) module of SFARI Gene is a comprehensive, up-to-date collection of all copy number
variants associated with autism spectrum disorders (ASD). The content of the CNV module
is compiled in a systematic way from available case studies, CNV studies, and large-scale,
genome-wide CNV screens. CNVs from autistic case cohorts and, when available, unaffected
control cohorts are reported by the module. CNVs in the module are organized based upon
the locus (chromosomal region or band) in which they were observed in each study. As of
March 2014, more than 1800 CNV loci have been reported in connection with ASDs. These
CNVs map to thousands of genes, which is too large a number to be useful. Thus, we sought
an intelligent approach to narrow down the number of ASD risk genes by utilizing the copy
number variants reported in ASDs.
33
Calculating Information Entropy Score from CNVs
2.1.3
We downloaded the CNV loci and corresponding case-control occurrence data from SFARI
CNV module [13]. We collected sideband annotations for chromosomes from Ensembl [51].
Human gene-locus mapping information was collected from Entrez [681. We designed a mapper that maps the CNVs to corresponding genes and calculates their frequency of occurrence
in cases and controls using the aforementioned information. Then, for each mapped gene g,
we calculated the information entropy score, pg using Formula 2.1. The work flow for our
CNV-based prioritizer is shown in Figure 2-2.
Pg =
Here,
fgy
Kg
(2.1)
x (1 - IEg) + offset
denotes the number of occurrences of gene g in disease group y, where y E
{case, control}; p9
denotes the probability of gene g occurring in ASD cases;
Kg
corresponds
to the scaling factor corresponding to gene g; IEg denotes the information entropy of gene
g. These terms are defined by Formula 2.2.
(
fcase +fontrol
V(f"ase)
Kg
-
IEg =
'I
2
t
+(fgon ro1)
2
(2.2a)
"-fc**r*
+fcontrol
-case
Pg log 2 (P9 ) - (1 - Pg) lo 2 (1
-
Pg)
(2.2b)
_g
fgase
Pg = fease + fcontrol
We calculate pg using three different
Kgs
and chose the one which gives largest area (AUC)
under the Receiver Operating Characteristic (ROC) curve. The selected scaling factor is:
K9
fcasefcontrol
V/cas
onero
as it gives an AUC of 59.81% (Figure 2-3). We chose a small positive
number such as le - 6 as offset. All the genes in the human PPI network to which no CNV
is mapped by the mapper were assigned a score equal to the offset value.
34
Chromosome Sidaband
Annotations from Ensembl
CV
loci
in
r
Scorer
cases
controls from SFARI
Mapper
tion
ed
Scores
16pl.-qI2.2 116 15
Gene-locus mappings
from Elitrez
Info
-MNE#Er
Gene frequencies in
cases & controls
ADA116 I
Ii
YWA
ABAT116I15
YWHAE, lp1.2,...
ABAT, 16q11.2,..
Figure 2-2: Steps in CNV-based prediction-prioritization of ASD genes. At first,
our custom-built mapper maps CNV loci in ASD case-control groups from SFARI CNV
module to genes using chromosome sideband annotations from Ensembl and gene-locus
mapping information from Entrez. The mapper also counts the numbers of occurrences of
each gene in the case group and control group separately. Next, the scorer calculates an
information entropy based score for each gene based on its frequency of occurrence in ASD
case-control groups. Genes are ranked in descending order of entropy based scores.
2.1.4
Quality of CNV Information Entropy based Prioritization
To measure the quality of our information entropy based ranking, we calculated the area (AUC)
under the Receiver Operating Characteristic (ROC) curve (Figure 2-3). The true positive
rate (TPR) or recall, and false positive rate (FPR) are calculated using Equations 2.3 and
2.4 respectively.
Using any of the scaling factors we get an AUC of approximately 59%,
which is better than the random case (AUC = 50%).
Since, we are more interested in identifying the ASD genes than the non-ASD ones, we
look at the lift chart for this method (Figure 2-4). The lift chart shows how much more likely
we are to identify ASD genes than if we make random guesses. For example, by considering
only the top 2% of genes in the ranklist found by our method, we are able to identify 2.3
times as many known ASD genes, in comparison to using no method.
This enrichment
indicates a reasonable improvement considering the unbalanced nature of our dataset with
ASD genes accounting for only 2.5% of the entire dataset. Here, the lift of a bucket, or a
group of genes in the dataset is calculated using Equation 2.5.
recall = TPR =
Number of ASD genes correctly identified by the method
Total number of ASD genes in the dataset
35
(23
(2.3)
FPR =Number of ASD genes wrongly identified by the method
Total number of non-ASD genes in the dataset
(2 4)
Percentage of true ASD genes in the bucket identified by the method
Percentage of ASD genes in the bucket selected randomly
lift of a bucket =
(2.5)
0.9
0.8
0.7
*. 0.5
0.45
c.0 0.4
03
-
Scaling Factor 1: AUC - 0.5930
--
Scaling Factor 2: AUC - 0.5981
Scaling Factor 3: AUC - 0.5977
Baseline: AUC -0.5000
0.2
0.1
n
0
0.1
0.2
0.3
OA
0.5
0.6
0.7
0.8
0.9
I
False Positive Rate (FPR)
Figure 2-3: Receiver operating characteristic curves for CNV-based prio ritizer using different
scaling factors.
2.2
2.2.1
ASD Similarity based Prioritizer
Similarity of Phenotypes or Diseases
Similarity between phenotypes reflects biological modules of interacting functionally-related
genes. These similarities are positively correlated with a number of measures of gene function, including relatedness at the level of protein sequence, protein motifs, functional annotation, and direct protein-protein interaction [112]. In fact, genes or proteins associated
with similar diseases or phenotypes lie in close proximity in the PPI network. Furthermore,
phenotype grouping reflects the modular nature of human disease genetics.
These facts
bring forth the idea of utilizing disease or phenotype similarity information for identification
36
I
I
0.05
0.1
I
I
0.15
0.2
I
I
I
I
I
I
I
I
I
I
0.7
0.75
I
I
I
I
0.8
025
0.9
0.95
2.2-
2-
1.6-
-
1A
12-
0*8,
0
0.25
0.3
0.35
0.45
0.5
0.55
0.6
%of Genes in the Rankist
0.4
0.5
1
Figure 2-4: Lift chart for CNV-based prioritizer.
of disease genes. In 2006, van Driel et al. [112] introduced a text mining algorithm to compute disease or phenotype similarity information for 5080 phenotypes collected from OMIM
database. The steps of the algorithm can be summarized as follows.
* At first, all the OMIM records are searched and the keywords are searched for presence in the anatomy (A) and the disease (C) sections of the Medical Subject Headings (MeSH) vocabulary. MeSH is a controlled vocabulary of U.S. National Library
of Medicine. It is specially useful for applications that use information that contains
different terminologies for identical medical concepts.
" Each OMIM record is then represented by a (0,1)-vector where each entry of the vector
corresponds to whether a term is present (denoted by 1) or absent (denoted by 0) in
the record.
" Similarity of two phenotypes is then computed by calculating the cosine of the angle
between their respective feature vectors. The similarity score ranges from 0 to 1.
We collected the phenotype similarity matrix computed by van Driel et al. which is available
through a web interface (http://www. cmbi.ru.nl/MimMiner/).
37
2.2.2
Gene-Phenotype Association Data
OMIM provides a publicly-accessible and comprehensive database of genotype-phenotype
relationships in humans.
We downloaded gene-phenotype relationship information from
OMIM database [1]. We retained only those gene-phenotype relationships where the phenotype also has a similarity score available in the disease similarity matrix computed by van
Driel et al. [112]. We then mapped the genes associated with those phenotypes onto the
connected PPI network. Note that multiple genes can be mapped to a single phenotype and
one gene can be involved in multiple phenotypes. Thus, after this step, we are left with a
total of 1474 genes mapped to 1999 OMIM phenotypes.
2.2.3
Calculating ASD Similarity Scores
We use the disease similarity matrix computed by van Driel et al. [112] and the genephenotype association data from OMIM database [1] to compute the association between
each gene and our disease of interest, ASD. We call this association the ASD similarity score
of the gene. Let 6 = {di, d 2, d3 , ...
, dt}
be the set of diseases for which similarity scores are
available, and q(di, dj) denote the similarity between diseases di and dj. Also let
6
g g
6
be the set of phenotypes associated with gene g. Let S be the set of seed genes which are
known to be associated with ASD with high confidence. We select the genes that appear
in eight or more ASD-related studies from our gold standard (Appendix A) as the seed set.
Thus our seed set S contains 106 genes. Let
6
s denote the set of phenotypes related to
ASD genes. We compute the association of a gene to ASD by looking at the similarity of
the phenotypes related to it to the ASD-like phenotypes, S (Equation 2.6). The association
score is normalized by the sum of pairwise similarity of ASD-like phenotypes. Thus we get
an ASD similarity score, VPg for each gene in the largest connected component in the PPI
network. Genes with no phenotype mapping receive an ASD similarity score of zero.
-0
2.2.4
di)
dj E 0
. Ed,
0.5 X c,,EOs Ed EOs 4dm, dn)
(2.6)
Performance of ASD Similarity based Prioritizer
By sorting the genes in descending order of ASD similarity scores, we obtain the ASD
similarity based ranking of genes. To measure the performance of ASD similarity based pri-
38
oritizer, we calculate the area (AUC) under the Receiver Operating Characteristic (ROC)
curve (Figure 2-5). The TPR and FPR are calculated using Equations 2.3 and 2.4 respectively as before. We measure the performance of this method on 22086 genes of the PPI
network.
These genes does not include the 106 ASD genes used in identifying ASD-like
phenotypes.
We achieve an AUC of 55.96% using this method, which is better than the
random case (AUC = 50%).
Figure 2-6 depicts the lift chart for this method which shows how much more likely we
are to identify ASD genes than if we make random guesses. By considering only the top 2%
of genes in the ranklist found by our method, we are able to identify 3.62 times as many
known ASD genes, in comparison to using no method. This enrichment indicates quite an
improvement considering the imbalanced nature of our dataset with ASD genes accounting
for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5.
I
0.9
6.6
c.
6.7
0
0
2
6.3
I-
6U
AUC
=0.5596
6.
0
6.1
6.2
6.3
0.4
0.5
6.6
6.7
6.3
6.3
I
False Positive Rate (FPR)
Figure 2-5: Receiver operating characteristic curve for ASD similarity based prioritizer.
39
I
I
I
I
I
I
I
I
I
I
33r
I
I
I
I
Exuding Seeds
-
incduding Seeds
--
~-BseIne
3
2.5
2
1.5
I
0
0.05
i
0.1
I
0.15
I
02
I
I
I
025
0.3
025
I
A
0.45
I
I
I
0.5
0.55
0.6
I
0.65
I
I
I
I
I
I
0.7
0.75
0.8
0.85
0.9
0.96
1
% of Genes in the Randist
Figure 2-6: Lift chart for ASD similarity based prioritizer.
2.3
2.3.1
Diffusion State ASD Proximity based Prioritizer
Diffusion State Distance (DSD) in PPI Network
As discussed in Section 1.2.1, functional similarity of genes or proteins in the PPI network
is often inferred based on direct interaction or some notion of network proximity in a local
neighborhood. Most of the disease gene prediction prioritization methods typically measure
local proximity based on either direct neighborhood, or shortest path distance, but this has
only a limited ability to capture fine-grained neighborhood distinctions because most proteins are close to each other, and there are many ties in proximity. Also, the accuracy of
these methods is often limited by the incomplete and noisy nature of the PPI data. Addressing these issues, Cao et al. [23] introduced Diffusion State Distance (DSD), a new distance
metric based on the graph diffusion property.
DSD captures fine-grained distinctions in
proximity for transfer of functional annotation in the PPI network and is able to perform
much better than the conventional distance metrics.
Definition of DSD Metric
Cao et al. [23] defined diffusion state distance (DSD) as follows: let G(V, E) be the undirected
connected PPI network, where V = {v1,
v2, V3,... , vn}
40
is the set of genes or proteins in the
network with IVi = n; E = {ei, e2, e3, . .. , em } is the set of interactions with ej = (vi, vj)
denoting the interaction between genes vi and vj. Let Hefk) (A, B) be the expected number
of times that a random walk starting at node A and proceeding for k steps, will visit node B.
Assuming k is fixed, Hejk}(A, B) can be simply denoted as He(A, B). The n-dimensional
vector He(vi), Vvi E V is defined as,
He(vi) = (He(vi, vi)He(vi, v 2 ),.. . ,He(vi, vn)).
Then, the DSD between two genes u and v, Vu, v E V is given by Equation 2.7.
DSD(u,v) = IIHe(u) - He(v)jji
where,
(2.7)
II.I|1 denotes the L, norm of a vector.
Cao et al. [23] proved three lemmas establishing the fact that, DSD is a metric which
is symmetric, positive definite and non-zero whenever u =
v, and it obeys the triangle
inequality. It also converges as k approaches infinity, and thus, can be defined independent
of k.
Lemma
1 DSD is a metric on V, where V is the vertex set of a simple connected graph
G(V, E).
Lemma 2 Let G be a connected graph whose random walk one-step transition probability
matrix P is diagonalizable and ergodic as a Markov chain, then for any u, v E V, DSD(u, v)
converges as k, the length of the random walk, approaches infinity.
Lemma 3 Let G be a connected graph whose random walk one-step transition probability
matrix P is diagonalizable and ergodic as a Markov chain, then for any u, v
limkoo DSD k} (U, v) = (bUT - b T)(I
-
E V, we have
P + W)-', where I is the identity matrix, W
is the constant matrix in which each row is a copy of riT,
1 .T
-being the unique steady state
distribution, and for any i E V, biT is the ith basis vector, i.e., the row vector of all zeros
except for a 1 in the
ith
position.
Proofs of these lemmas are out of the scope of this discussion, but can be found in [23].
41
2.3.2
Calculating Diffusion State ASD Proximity (DSAP) of Genes
A key step in our approach towards ASD gene prediction-prioritization is to measure the
proximity between candidate genes and known ASD genes in the connected PPI network.
For this purpose, we define a new proximity measure based on DSD and call it Diffusion
State ASD Proximity, or DSAP in short. Let DSD(u, v) denote the pairwise diffusion state
distance between any two nodes u, v E V in the connected PPI network G(V, E) which is
defined by Equation 2.7. Let S be the set of genes known to be associated with ASD with
high confidence. Out of the 548 genes in our gold standard (Appendix A), 106 genes appear
in eight or more ASD related studies. We build our ASD gene set S using these genes. We
define pairwise diffusion state proximity (DSP) of two nodes u, v e V by a Gaussian kernel
over DSD(u, v) as follows.
DSP(u, v) = e(
-DSD(-,,))
2
7,)
Here, we divide the DSD(u, v) by 7 not to let the DSP(u, v) value become too small,
given that the median DSD(u, v) for connected human PPI network is found to be approximately equal to 7. Then, we define the diffusion state ASD proximity of a gene, g E V by
Equation 2.8.
DSAP(g) = (
DSP(g, s)
(2.8)
We calculate DSAP scores for all the genes in the connected PPI network and sort them in
descending order of DSAP scores which gives us the DSAP-based ranking of genes.
2.3.3
Quality of DSAP-based Ranking
To measure the performance of the DSAP-based prioritizer, we calculate the area (AUC)
under the Receiver Operating Characteristic (ROC) curve (Figure 2-7). The TPR and FPR
are calculated using Equations 2.3 and 2.4 respectively as before. We measure the quality of
ranking on 22086 genes of the PPI network. These genes do not include the 106 ASD genes
used in measuring proximity to ASD. We achieve an AUC of 54.05% using this method,
which is better than the random case (AUC = 50%).
With this approach we achieve a lift of 1.1% of ASD genes (excluding seeds) in the top
4% of the ranklist over random selection. Although this measure is worse than the previous
two methods, it is still able to identify more non-seed ASD genes than random selection.
42
Inclusion of seed genes boost the lift up to 5.7-fold which means that the seed genes are very
close to each other in the network in terms of DSAP and hence make up a significant portion
of the top 4% of the ranklist. The lift chart including the seeds is shown in Figure 2-8. Here,
lift is calculated using Equation 2.5.
.I
0.9
0.7
0.7
0.2
6
-064406
6.4
0.1
0
6.2
0.3
M.
&.S
0.6
0.7
False Positive Rate (FPR)
6.8
1.9
Figure 2-7: Receiver operating characteristic curve for Diffusion State ASD Proximity
(DSAP) based prioritizer.
6
i
I
I
j
5.5
-E
5
Excluding Seeds
ncluding Seeds
laselin.
4
3.5
-J
3
2.5
2
1.5
1
0.
i
0
0.05
i
0.1
i
0.15
I
I
I
I
02
025
0.3
0.35
I
I
I
I
OA 0.45 0.5 0.55 0.6
% of Genes in the Rankdist
0.5
I
I
0.7
0.75
I
0.8
I
I
I
0.85
0.9
0.95
Figure 2-8: Lift chart for Diffusion State ASD Proximity (DSAP) based prioritizer.
43
1
2.4
Network Crosstalk based Prioritizer
2.4.1
Motivation
Functional association between genes or proteins in the PPI network are often measured using diffusion kernel, random walk with restart, or propagation flow based algorithms. These
approaches axe global in nature in that they consider multiple alternate paths and the whole
topology of the PPI network. The basic steps for most of these approaches are: first identify
seed genes that are significantly associated with the disease of interest. Next, map these
seed genes onto the PPI network. Finally, quantify the functional association between genes
in the PPI network and the seed genes based on network proximity and connectivity in a
global manner. As discussed in Section 1.2.1, these approaches have recently been successfully applied to identify genes for a number of complex diseases including different types
of cancers, type-2 diabetes, neurological disorders, psoriasis, asthma and so on. Motivated
by these successes, we aim to develop a global network-based scoring scheme to quantify
functional association between ASD seed genes and the rest of the genes in the human PPI
network.
We redefine the notion of network crosstalk introduced by Nibbe et al. [76] in
the context of ASDs and compute ASD association in an approach based on random walk
with restarts. To the best of our knowledge, this is a first attempt to capture functional
association of ASD genes via network connectivity and proximity in a global manner.
2.4.2
Problem Formulation
Following closely the approach adopted by Nibbe et al. [76] for identifying candidate genes
and subnetworks for human colorectal cancers, we reformulate the problem of disease gene
prediction-prioritization for ASDs. Let G = (V, E) be the connected PPI network, where
V consists of the genes in the network, and an undirected edge e(u, v) E E represents an
interaction between genes u E V and v E V. Let N(v) be the set of direct neighbors (i.e.,
interacting partners) of gene v E V, i.e., N(v) = {u E V : (u, v) E E}. Let S C V be the
set of genes known to be associated with ASD with high confidence. Among the 548 gold
standard ASD genes, 106 genes appear in eight or more ASD related studies. We build our
ASD gene set S using these genes. Our goal is to compute a score a(v) for each gene v E V,
to quantify network crosstalk between v and the genes in S, network crosstalk being the
indicator of functional association between genes.
44
In order to develop a biologically sound measure of network crosstalk, Nibbe et al. [76]
relied on two observations.
(i) Functional similarity between proteins is significantly correlated with their network
proximity, as measured by the number of hops between these proteins.
(ii) Existence of multiple alternate paths between two proteins is an indicator of their
functional association, since functional multiple paths are often conserved through
evolution owing to their contribution to robustness against perturbations, as well as
amplification of signals.
Like Nibbe et al. [76], we compute network crosstalk scores for genes in the PPI network
using an information flow approach based on random walks with restarts. This approach
incorporates both the number of hops and multiple alternate paths between genes into
the assessment and can be considered as a generalization of Google's well-known Pagerank
algorithm [17].
2.4.3
Calculating Network Crosstalk Scores
For a given ASD seed gene set, S, we calculate network crosstalk scores for all the genes
in the PPI network by simulating a random walk as follows. The random walk starts at a
randomly chosen gene in S. At each step, when the random walk is at some gene, v E V, it
either moves to a neighbor of v with a probability 1 - r, or it restarts at a gene in S with
probability r. Here, the parameter 0 < r < 1 is called the restart probability. For each
move, the neighbor to be moved to is chosen uniformly at random from N(v). Similarly, for
each restart the gene to be restarted from is selected uniformly at random from S.
The network crosstalk between the genes in S and each gene v E V can be computed as
the relative amount of time spent at v by such an infinite random walk, or equivalently, the
probability that the random walk will be at gene v at a randomly chosen time step after the
random walk proceeds for a sufficiently long time. Formally, let at be the |VI-dimensional
vector, such that at(v) denotes the probability that the random walk will be at gene v at
step t, where ||at|l1 = 1 (here, |1.11 denotes the L 1-norm of a vector). Let P denote the
stochastic matrix derived from network G = (V, E), i.e., P(u, v) = 1/IN(v)I if (u, v) E E, 0
45
otherwise. Then, at any step t +1, the crosstalk score vector can be defined by Equation 2.9.
at+1
=
(1 - r)(P)at + ry
(2.9)
where -y denotes the restart vector with -y(u) = 1/ISI for u E S, or 0 otherwise.
With
initial crosstalk scores set to ao = -y, the vector for final crosstalk scores for each gene in the
network is given by a = imta at. In our experiments, we stopped our iterations when we
.
encountered the criterion: IIat+1 - atI11 < 1e- 09
As we can see, when r = 0, a is equal to the eigenvector of P that corresponds to its
largest eigenvalue (with numerical value 1), i.e., a(v) is exactly equal to the page rank of
v in G for all v E V. Thus, the crosstalk score of a gene v is not only an indicator of its
connectivity and proximity of ASD genes, but it also considers the significance of centrality
of the gene in the network.
2.4.4
Dealing with Statistical Bias
PPI networks are often noisy in that well-studied proteins or genes are highly connected
having a lot of interactions, whereas less studied ones often miss interactions. Thus there is
a high probability the highly connected hub genes will be assigned artificially high crosstalk
scores just by chance, skewing the result towards well-studied genes.
However, we are
interested in finding those genes that are less characterized but may provide novel insights
into ASDs.
To correct for this bias, we assign significance scores to the crosstalk scores using Monte
Carlo simulations. We define a null model that accurately captures the degree distribution
of the ASD seed genes in S as follows. For a given ASD seed set S, in order to generate a
random instance S(i) representative of S, first, for every gene u E S, we create a bucket B(u)
of genes in the network, such that UUESB(u) = V and B(u) n B(u') = 0 for all u, u' E S.
A gene v e V is assigned to bucket B(u) if IN(v) - N(u) 5 1N(v) - N(u')I for all u' E S
and ties are broken randomly.
Next we choose one gene from each bucket uniformly at
random to construct S('), so that IS(i) = IS1. Note that each bucket consists of genes that
have similar number of interactions with a particular ASD seed gene; therefore each seed
gene is represented in S(i) by exactly one gene in terms of its number of neighbors. Thus,
the expected total degree of genes in S(') is likely to be very close to the total degree of
46
the genes in S. After generating a random instance S('), we compute the corresponding
crosstalk vector a(i) by letting y(i) = 1/IS) I for u E SW2, and 0 otherwise.
We repeat this procedure N times, where N is sufficiently large (we use N = 1000 in
our experiments) to obtain a sampling {ai, a 2 , a3,..., aN} of the null distribution of the
crosstalk scores, with respect to seed sets that are representative of S in terms of their
sizes and degree distributions.
dard deviation as
=
Next, we estimate the mean As =
<N-
S
N
and stan-
of the null distribution of crosstalk scores for S
using this sample. Finally, we compute the adjusted z-scores for the crosstalk scores using
Equation 2.10.
zS(v) = a(v)
-
ps(v)
E V
(2.10)
These adjusted crosstalk scores represent the statistical significance of the crosstalk between each gene and the genes in the ASD seed set, accounting for the centrality and degree
distribution of the genes in the PPI network. We sort the genes in our PPI network in descending order of the adjusted crosstalk scores, which gives us the network crosstalk based
ranking of genes.
2.4.5
Performance of Network Crosstalk based Prioritizer
To measure the performance of our network crosstalk based prioritizer, we calculate the
area under the Receiver Operating Characteristic (ROC) curve (AUC). As before, the TPR
and FPR are calculated using Equations 2.3 and 2.4 respectively. We measure the quality
of ranking on 22086 genes of the PPI network.
These genes does not include the 106
ASD genes used as seeds. We measure AUC for different values of the parameter r: r
=
{0, 0.25,0.5,0.75, 0.9} (Figure 2-9).
Figure 2-10 depicts the lift chart for this method which shows how much more likely we
are to identify ASD genes than if we make random guesses. By considering only the top 2%
of genes in the ranklist found by our method, we are able to identify 2.37 times as many
known ASD genes (excluding seeds), as if we selected randomly. This gain indicates quite an
improvement considering the unbalanced nature of our dataset with ASD genes accounting
for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5. Inclusion
of seed genes boost the gain up to 10.5-fold.
47
11
0.9
0.8
0.7
0.6
CL 0.5
0.4
.
0.2
r = 0.50: AUC = .5
~~~r =0.75: AUC = O.M57
2
r= 0.90: AUC = 0.5525
Baseline: AUC = 0.5000
0.1
u
0
0.1
0.2
0.3
0.7
0.4
0.5
0.6
False Positive Rate (FPR)
0.8
0.9
-
0.3
r = 0.00: AUC = 0.4474
r = 025: AUC = 0.5611
-
0
1
Figure 2-9: Receiver operating characteristic curves for network crosstalk based prioritizer
using different restart probabilities (r).
I
I
I
I
I
I
I
I
I
IIII
I
11
Excluding Seeds
10
Including Seeds
Baseline
9
8
7
IS
6
5
4
3
2
1
0
I
I
I
I
I
I
I
0.05
0.1
0.15
0.2
0.25
0.3
0.35
I
I
I
I
0.4 0.45 0.5 0.55 0.6
% Genes in the Ranklist
I
I
I
I
I
I
I
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Figure 2-10: Lift chart for network crosstalk based prioritizer.
48
1
Chapter 3
Integrative Approach for Identifying
ASD Risk Genes
3.1
Background
Just to recapitulate, we are interested in the problem of quantifying the association of a gene
with ASD and rank the genes based on the strength of association. Each of the methods
we have discussed so far focuses on a single aspect of functional similarity of genes which
is based on either sequence, phenotype, or topological similarity.
However, as discussed
in Section 1.2.1, there is plenty of evidence in the literature that an integrative approach
incorporating multiple aspects of functional similarity of genes simultaneously can perform
reasonably better in predicting and prioritizing disease genes, than the methods focusing on
a single aspect. Motivated by this fact, we propose a logistic regression based integrative
approach for solving this problem in the context of ASDs. We use lasso-penalized logistic
regression [31, 110] to develop a predictor that predicts the probability of a gene being
associated with ASDs. To avoid over-fitting the model, we used the adaptive lasso procedure,
which simultaneously identifies influential variables and provides the model parameters. Our
choice of variables include ASD association scores computed by the methods described in
Section 2 as well as information on ASD-pathway membership of genes.
49
3.1.1
Lasso-penalized Logistic Regression
Logistic regression measures the relationship between a categorical dependent variable and
one or more independent variables, which axe usually (but not necessarily) continuous, by
using probability scores as the predicted values of the dependent variable. It is one simple
but widely used approach for integrating predictors from multiple sources. It is often used
with lasso regularization. Lasso is a shrinkage estimator, often used to identify important
predictors, select among redundant parameters, and produce shrinkage estimates. Lasso
estimates have potentially lower predictive errors than an ordinary maximum likelihood
estimator. Thus, lasso is a useful alternative to stepwise regression and other dimensionality
reduction techniques.
3.2
Predicting ASD Association via Logistic Regression based
Integrative Approach
3.2.1
Preparing Data for Training and Validation
Our landscape of genes consists of all 22192 genes of the connected PPI network. Each
gene in the set is labeled as ASD gene if it belongs to the gold standard ASD gene set (Appendix A), or non-ASD otherwise. We establish a training set of 4292 genes. It consists of
106 high confidence ASD genes from the ASD gold standard gene set. These genes appear
in eight or more ASD-related studies. These 106 genes makeup roughly 19.34% of the total
ASD genes in the dataset. To retain this proportionality for non-ASD genes as well, we randomly select 4186 of the 21644 non-ASD genes in the connected PPI network for the training
set. The rest of the ASD and non-ASD genes are set aside for validating the performance
of the logistic regression based predictor. Thus, the validation set consists of 17900 genes
of which 442 are ASD genes and the rest are non-ASD genes. Note that our dataset under
consideration is a highly unbalanced one, and we are interested in accurately predicting the
ASD genes rather than the non-ASD genes.
3.2.2
Constructing Lasso-regularized Binomial Regression Model
We formulate the logistic regression based predictor as follows. Let V = {v1, v2, v3 ,..., v }
be the set of genes. Let the dependent variable p = {1i,
50
2, A3, ...
, An } be the vector of
predictions, where pi denotes the probability that gene vi is associated with ASDs. Pi can
be any real value between 0 and 1 inclusive. We construct the set of independent variables, X = {CNVIE, AutSim, DSAP, NetCrTk, NeuronPath, SkeletalPath, SynapsePath,
CaPath} with eight predictors, where CNVIE refers to CNV information entropy based
scores, AutSim, autism similarity based scores, DSAP, diffusion state ASD proximity scores,
NetCrTk, adjusted network crosstalk scores, and NeuronPath, SkeletalPath, SynapsePath,
and CaPath refer to the membership information of genes in neuron development pathway,
skeletal development pathway, synapse pathway, and Calcium (Ca) signaling pathway, respectively. These four pathways have been associated with ASDs in recent studies [21,91].
The gene membership information was extracted using the corresponding pathway gene sets
from Molecular Signatures Database (MSigDB) version 4.0 [105]. Here, X is an n x 8 matrix
where each row corresponds to the values of eight predictors for the corresponding gene.
We fit a lasso regularized weighted binomial regression model with the aforementioned
dependent and independent variables on the training data using 100 penalty terms, Lambda
and 10-fold cross validation. Cross validation is used to correct for potential over-fitting bias.
The weights are given by the number of ASD association studies related to each genes. If a
gene does not have any association study associated with it, it is given a very small weight
of le-0 6 . For each non-negative value A in Lambda, lasso tries to minimize the deviance
of the model (often estimated as the negative log-odds ratio) fit to the responses using the
predictor coefficients as well as a constant term. We use the lassogim function from Matlab
(version 2012a) to fit the lasso regularized binomial regression model.
3.2.3
Selecting Model Coefficients
We select the constant term as well as the set of predictor coefficients such that the deviance
of the model remains within one standard error of the minimum deviance found by lasso.
The selected model coefficients are shown in Table 3.1 in order of predictive value. According
to the fitted lasso penalized logistic regression model all the predictors are informative to
some extent.
3.2.4
Creating Regularized Model and Making Predictions
Let the constant term be denoted by /0 and the vector of lasso regularized predictor coefficients be denoted by
#
= {/1,
#2,
/3',...
, #8}".
51
The resulting regularized model is given by
Variable
AutSim
DSAP
SkeletalPath
NeuronPath
CaPath
SynapsePath
CNVIE
NetCrTk
Coefficient
65.4212
1.7637
1.4463
1.2254
1.1516
1.1487
0.3529
0.2223
Table 3.1: Selected regression coefficients for the integrative approach from logistic regression
in order of predictive value.
Equation 3.1.
logit(p) = log
=
X# + 3o
(3.1)
Thus, the predictions are given by Equation 3.2.
=
e(XP+00)M
+ e(XP+0o)
(3.2)
We evaluate the model predictions on the training and validation data using this equation.
The Matlab function glmval is used for that purpose.
3.3
Performance Analysis
First we assess the accuracy of the model on the training data by measuring the area under
the ROC curve as well as the area under the precision-recall curve. The TPR, or recall,
and FPR are calculated using Equations 2.3 and 2.4 respectively. Precision is given by
Equation 3.3.
Number of ASD genes correctly identified by the method
Number of genes identified as ASD genes by the method
Figure 3-1 shows the precision-recall curve and ROC curve for the model on training data.
We achieve an area of 99.54% under the ROC curve, and an area of 78.63% under the
precision-recall curve which indicate the high quality of the fit on training data. To assess the
overall accuracy of the model, we measure the AUC for ROC curve using the validation data
set. It achieves an AUC of 65.34% (Figure 3-2). To compare its performance, we also fit two
other lasso regularized logistic regression models - one integrating only the ASD association
52
A. Precdsion-Rocall Curve
B.
0.9
0.9
02
0.7
0.7
0.5
0.5
0.3
.
0.9
0.3
0.3
0.2
0.2
0.1
AUC
0
0
02
OA
0.
0.
0.A
ROC Curve
0.1
0
1
AUC
0
0.2
Recall
0.4
0*
=.M
0.
1
False Positive Rate (FPR)
Figure 3-1: Performance curves for integrative approach on training data.
Precision-recall curve with an AUC of 0.7863; B. ROC curve with an AUC of 0.9954.
A.
scores from the four methods described in Section 2, and the other integrating only the
gene membership information in the four pathways and the weights from the literature. The
standardized regression coefficients are listed in Tables 3.2 and 3.3.
Variable
AutSim
DSAP
CNVIE
NetCrTk
Coefficient
18.0409
3.1580
0.2005
0.1870
Table 3.2: Selected logistic regression coefficients for integrating different ASD association
scores in order of predictive value.
Variable
CaPath
SkeletalPath
NeuronPath
SynapsePath
Coefficient
2.6001
1.7088
0.9576
0.6344
Table 3.3: Selected logistic regression coefficients for integrating ASD-pathway membership
information with weights in order of predictive value.
We compute area under ROC curves (AUCs) for each of these models as well as the
methods from Section 2 using the validation dataset.
As we can see in Figure 3-2, our
integrative approach which uses both the ASD association scores from different methods
and the ASD-pathway membership information with weights, gives the best performance
53
among all of them.
0.9
0.8
0.7*0.6
I0.5-
IntApp: AUC =0.534
OA
-
2
U.4
IntMthd: AUC = 0.6173
lntPath: AUC = 0.5309
CNVIE: AUC = 0.5954
0.2
--
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
AutSim: AUC = 0.5596
DSAP: AUC = 0.5416
NetCrTk: AUC = 0.5625
Baseline: AUC = 0.5000
0.7
0.8
0.9
1
False Positive Rate (FPR)
Figure 3-2: Receiver operating characteristics curves for different ASD gene
prediction-prioritization methods. IntApp: Integrative approach incorporating different ASD association scores as well as ASD pathway membership information from the
literature; IntMthd: Integrative approach incorporating different ASD association scores
only; IntPath: Integrative approach incorporating only ASD pathway membership information from the literature; CNVIE: CNV Information Entropy based Prioritizer; AutSim:
Autism Similarity based Prioritizer; DSAP: Diffusion State ASD Proximity based Prioritizer;
NetCrTk: Network Crosstalk based Prioritizer.
Figure 3-3 depicts the lift chart for this method which shows how much more likely
we are to identify ASD genes than if we make random guesses.
By considering only top
2% of genes in the ranklist found by our method, we are able to identify 3 times as many
known ASD genes (excluding seeds), as if we selected randomly. This gain indicates quite an
improvement considering the imbalanced nature of our dataset with ASD genes accounting
for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5. Inclusion
of seed genes boosts the gain up to 11.2-fold. Thus, we construct our risk gene set for ASDs
using the genes from the top 2% of our ranklist which yields 443 genes. Note that among
these genes, 123 are known ASD genes (102 seeds and 21 non-seeds). Among the 21 non-seed
ASD genes identified by our integrative approach, DLGAP3, APC, GPC6, and NTRK1 have
appeared in seven ASD association studies; AR, ATRX, and RPS6KA3 appear in six studies;
54
112 I
I
I
I
I
I
I
i
i
i
101
Test dat moet
tInot
e
9.8
~Entire d
8.
-Baseine
7.
6.8
5.
4.8
3.
21
1.
0.
0
0.05
T
0.1
I
0.15
T
0.2
I
025
0.3
0.35
0.4
0A5
0.5
0.55
0.
0.5
0.7
0.75
0.8
0.5
0.
0.95
1
% Genes in the Ranklist
Figure 3-3: Lift chart of integrative approach for ASD.gene prediction-prioritization.
SCN8A and TBR1 appear in five studies; GNAS and EGR2 appear in four studies; KCND2
and BIN1 appear in three studies; SETD2, TYR, and EPHB2 appear in two studies; and
TBX1, PTPN11, DUSP22, BRCA2, and KIT appear in one study. Considering the high
gain of known ASD genes in the candidate set, we can hypothesize that the other genes in it
have a strong possibility of being associated with ASDs which are worth investigating. The
complete list of risk genes along with the probabilities of their association to ASDs is given
in Appendix B.
55
56
Chapter 4
ASD Genetics: Implications from
Candidate ASD Risk Genes
4.1
Gene Sets for Analysis
We downloaded prior knowledge-based gene sets consisting of gene symbols in Gene Matrix
Transposed (GMT) format from the MSigDB version 4.0 [105]. Of the available pathway gene
sets, we collected 1320 expert curated ones which include gene sets from Kyoto Encyclopedia
of Genes and Genomes (KEGG) pathways (http://www.genome.jp/kegg, [47,48]), Reactome pathways (http: //www. react ome. org/ [27, 72]), BioCarta pathways (http://www.
biocarta. com), Pathway Interaction Database (PID) pathways [99], SigmaAldrich gene
sets (http://www.sigmaaldrich.com/life-science.html),
Signaling Gateway gene sets
(http: //www. signaling-gateway. org), Signal Transduction KE gene sets (http: //stke.
sciencemag.org), and SuperArray gene sets (http://www.superarray.com).
From the
collection, we filtered out the disease and drug related gene sets. We also excluded very
large (>,300 genes) and very small (< 10 genes) gene sets. Thus we were left with 1221
gene sets as of June 2014.
From MSigDB version 4.0, we also collected Gene Ontology (GO) gene sets, which
are derived from the controlled vocabulary of the GO project: "The Gene Ontology Consortium" [8].
These gene sets are based on GO terms and their associations to human
genes. We collected gene sets belonging to two categories - C5:BP (biological process) and
C5:MF (molecular function). After filtering out very small and very large gene sets, we were
57
left with 763 and 382 gene sets respectively under these categories.
From the collection, we excluded the KEGG calcium signaling pathway, neuron development, skeletal development, and synapse gene sets since they have been used as prior
knowledge in our integrative approach. We also filter out the neurite development and axon
guidance gene sets as they are subsets of the neuron development pathway gene set.
4.2
Hypergeometric Test for Enrichment
We use a hypergeometric test to determine the statistical significance of overlap of a pathway
or GO gene set with our candidate gene set. The hypergeometric test uses the hypergeometric distribution to calculate the probability of more than k ASD risk genes (out of a set of n
ASD risk genes in a dataset of N total genes) appearing in a specific pathway or GO gene
set of size K. The probability mass function of the hypergeometric distribution is given by
the following expression.
(K)
k (N-K\
\n-k
n~)
This test helps identify whether the ASD risk gene set is overrepresented in a certain pathway
or GO gene set, and provides us with a p-value. We use the phyper function from R (version
2.15.1) for computing the hypergeometric p-values for the pathways and GO gene sets in
our gene set collection.
4.3
Pathway Enrichment Analysis
Performing hypergeometric enrichment analysis on pathway gene sets, we found a total of
32 canonical pathways in which ASD risk genes are overrepresented after Bonferroni correction (Table 4.1). Of note, there was an increased frequency of affected pathways associated
with signal transduction pertinent to brain, cellular assembly and communication, synaptic development, and neuronal development. Most of the affected signaling pathways (e.g.,
MAPK signaling, FGF signaling, SHP2 signaling etc.) are found to be highly involved in
processes such as cell growth and death, specifically neuron apoptosis, neurite outgrowth,
inter neurite cell adhesion etc. They also affect cell proliferation, differentiation, and migration processes. We found a number of affected pathways involving Li family cell adhesion
molecules (L1CAM) which play important roles in neuronal migration and synaptic forma-
58
tion [6]. Ankyrins (SHANK proteins) bind to Li proteins and couple them to ion channel
proteins, and thus mediate branching and synaptogenesis of cortical inhibitory neurons.
Pathways involving neural cell adhesion molecules (NCAM) play important roles in formation and maintenance of the nervous system. Clearly, incorrect synapse development, neuron
development, and erosion of synaptic function are widely considered to be key contributors
to ASDs.
Our results also show that pathways involved in immune response, protein catabolism
or modification, tissue and organ morphogenesis, muscle cell differentiation, inflammatory
response, etc. are associated with ASDs, and may be involved in immune related disorders,
developmental regression, metabolic abnormalities, morphological impairments, and sleep
disturbances.
Table 4.1
Canonical pathways having significant overlap with ASD risk genes.
-
Name
1
Category
Functions
# Genes
Genes
Signal Transduction
cell
21
MEF2C,
Adjusted
p-value
KEGG
MAPK
proliferation,
SIGNALING
differentiation,
PATHWAY
gration
mi-
FLNA,
BDNF,
NF1,
FLNB,
FGFR3, FGFR1,
FGF14,
IKBKG,
RPS6KA3,
MAPT,
CACNA1H,
CACNA1G,
CACNA1A,
CACNAlC,
CACNAIS,
SOS1,
1.00E-04
FGFR2,
NTF3,
TP53,
NTRK1
PID
FGF PATH-
Signal Transduction
WAY
cell
death,
neuron
apoptosis,
10
proteo-
somal
ubiquitin
dependent
FGFR1,
PLCG1, FGFR2,
0.000108898
FGFR3, RUNX2, PTPN11,
MET, FRS2, HGF, SOS1
protein
catabolism
PID SHP2 PATH-
Signal Transduction
WAY
activation
IL10,
of
IL2,
10
FGF, ERBB1
SOSi,
NOS3,
FRS2,
0.000184167
IL2RG, NTRK3, NTRK1,
signaling cascades
BDNF,
PTPN11,
IL6,
NTF3
REACTOME
IN-
ANK3,
LlCAM,
SCN2A,
TERACTION BE-
velopment, branching
SCN4A,
SCN5A,
SCN8A,
TWEEN Li AND
and synaptogenesis of
SPTAN1
ANKYRINS
cortical neurons
REACTOME
Cell Adhesion
system
de-
neural cell adhesion,
7
COLlAl,
COL2A1,
SIG-
formation and main-
COL4A3,
FGFR1,
NALING
FOR
tenance
PRNP,
NEURITE
OUT
system
NCAM
Cell Adhesion
nervous
of
10
nervous
GROWTH
PID
CATENIN
SOSI,
CACNA1S,
0.000204988
0.000480482
SPTAN1,
CACNAlH,
CACNA1G
BETANUC
Signal Transduction
cell proliferation, immune response
11
AR, KRT1, TCF4, SALL4,
PITX2,
PATHWAY
MITF,
0.000492712
MED12,
NCOA2, AES, CACNA1G,
APC
(continued on next page)
59
Table 4.1
Continued.
-
Name
Category
Functions
# Genes
Genes
Cell Adhesion
cell
17
HGF,
Adjusted
p-value
KEGG
FOCAL
communication,
motility,
ADHESION
tion,
prolifera-
differentiation,
survival
FLT4,
FLNA,
RELN,
FLNB,
COLilA1,
VWF,
0.00059351
MET,
COL11A2,
PTEN,
ITGA2B,
CAV3,
COL2A1,
COLiAl,
SOS1,
BCL2,
LAMA3
PID P53 DOWNSTREAM
Signal Transduction
cell growth and death
PATH-
FDXR,
TP63,
WAY
TSC2,
BCL2,
TP53,
0.000603742
RFWD2,
MET, PTEN, HGF, APC,
RB1,
VDR,
NDRG1,
DDB2
KEGG ECM RE-
Signal Transduction
tissue
and
organ
CEPTOR INTER-
morphogenesis,
ACTION
adhesion,
tion,
cell
prolifera-
differentiation
GP1BA,
VWF,
HSPG2,
0.000813133
ITGA2B, COL2A1, RELN,
COLlA1,
LAMA3, SDC3,
COL11Al, COL11A2
and morphogenesis
REACTOME
L1CAM
Cell Adhesion
INTER-
ACTIONS
neurite
outgrowth,
neurite
fascination,
DNM2,
FGFR1,
ANK3,
ITGA2B,
0.001033363
L1CAM,
inter neuronal adhe-
RPS6KA3,
sion
SCN4A,
SCN2A,
SCN5A, SCN8A,
SPTAN1
REACTOME
INSULIN
Signal Transduction
RE-
CEPTOR
SIG-
NALLING
CAS-
activation of MAPK,
FRS2,
Ras/RAF cascades
FGFR3,
EIF4E,
FGFR1,
FGFR2,
INSR,
SOS1,
STK11,
PRKAG2,
0.001161965
TSC1, TSC2
CADE
REACTOME
Signal Transduction
P13K CASCADE
insulin receptor sub-
FRS2,
strate mediated
FGFR3,
sig-
naling
EIF4E,
FGFR1,
FGFR2,
PRKAG2,
0.001293256
INSR,
STK11, TSC1,
TSC2
Signal Transduction
REACTOME
SIGNALING
insulin binding
FRS2,
BY
EIF4E,
FGFR1,
FGFR3,
0.001561111
FGFR2,
INSULIN RECEP-
ATP6VOA2,
TOR
PRKAG2,
INSR,
SOS1,
STK11,
TSC1, TSC2
PID
SYNDECAN
Signal Transduction
1 PATHWAY
tumor
necrosis
fac-
COL11A2, MET, COL7A1,
tor mediated signal-
COLIlAl,
ing,
COL4A3, HGF, COLlA1
protein ubiqui-
0.002906631
COL2A1,
tation, degradation
PID
SMAD
muscle cell differenti-
NCOA2, FOXO4, FOXH1,
NUCLEAR PATH-
2,3
Signal Transduction
ation, endothelial cell
MEF2C,
AR,
ESRI,
WAY
migration,
RUNX2,
DLX1,
FOXG1,
negative
regulation
PID TCR PATH-
Protein Catabolism,
protein
WAY
Signal Transduction
process,
PID
INTEGRIN1
Cell Adhesion
PATHWAY
0.004928594
VDR
catabolytic
IKBKG,
activation
PLCG1,
of calcium signaling,
PTPN11,
NFKB signaling
STIMI
integrin
family
surface
interactions
cell
for adhesion
WAS,
FLNA,
0.005740308
PTPRC,
PTEN,
COL11Al,
SOS1,
TGFBI,
COL7A1,
COL2A1,
COLlA1,
COL4A3,
0.005740308
LAMA3, FBN1, COL11A2
SIG
PIP3
NALING
CARDIAC
SIGIN
Signal Transduction
cell growth and sur-
RPS6KA3, MET, ERBB4,
vival
SOS1,
MY-
PTPN1,
0.00651532
TSC1,
INPPL1, TSC2, PTEN
OCTES
(continued on next page)
60
Table 4.1
Continued.
-
Name
Category
Functions
KEGG
NEU-
#
Nervous System
Genes
Genes
Adjusted
I p-value
I_
I
I
differentiation
and
12
FRS2,
BDNF,
BCL2,
ROTROPHIN
survival of neuronal
PLCG1,
SIGNALING
cells,
RPS6KA3, SOSI, PSEN1,
PATHWAY
memory
learning
and
0.007891206
PTPN11,
NTF3,
TP53,
NTRK1,
NTRK3
REACTOME
NCAMI
Cell Adhesion
neuronal
-cell
hesion,
INTER-
ACTIONS
ad-
COLIA1,
COL2A1,
cellular
COL4A3,
PRNP,
differ-
CACNAlS,
survival,
CACNA1G
migration,
entiation,
0.009759881
CACNA1H,
synaptic plasticity
ST
MYOCYTE
Signal Transduction
AD PATHWAY
formation of interface
APC,
between nervous sys-
PITX2, CAV3, RYR1
GNAQ,
EPHB2,
0.011606902
tem and cardiovascular system
BIOCARTA
GH
Signal Transduction
PATHWAY
growth
factor
diated
dwarfism,
tion
me-
signaling,
of
HNF1A,
GHR,
GH1,
0.014524747
MEF2A,
0.018499536
PLCG1, INSR, SOS1
activaJAK-STAT,
MAPK cascades
REACTOME NGF
Signal Transduction
neuronal
differentia-
SIGNALLING
tion in
VIA TRKA FROM
neurotrophins
THE
response
to
PTEN,
MEMBRANE
TEGRIN
IN-
cell adhesion to ECM,
RAPGEF4,
Cell Adhesion
protein
COL2A1, COL4A3, FBN1,
IN-
P73
catabolitic
process
ITGA2B,
PATH-
Transcription,
Pro-
tein Catabolism
apoptosis,
somal
proteo-
ubiquitation
dependent
DIFFER-
Signal Transduction
ENTIATION
PATHWAY
COLlA1,
PTPN1,
0.025394674
SOS1,
VWF
WAY
ST
SOSI,
Protein Catabolism,
TERACTIONS
PID
PRKAR1A,
RPS6KA3,
TSC2
CELL
SURFACE
DNM2,
MEF2C, FOXO4, NTRK1,
PLCGI,
PLASMA
REACTOME
FRS2,
protein
WT1,
GATAI,
BRCA2,
BIN1,
TP53AIP1, RB1, NTRK1,
catabolitic process
TP63
PC12 cell differentia-
PTPN11,
tion
RPS6KA3, GNAQ, FRS2,
IN
0.025394674
GNB2L1,
NTRK1,
0.025972966
EGR2, OPN1LW
PC12 CELLS
PID MET PATH-
Signal Transduction
WAY
growth factor medi-
PLCG1, PTPN1, PTPN11,
ated signaling
HGF,
EIF4E,
0.028115505
INPPL1,
APC, SOSI, MET
PID TRKR PATH-
Signal Transduction
WAY
growth factor medi-
NTRK3,
ated signaling
PLCGI,
SOS1,
FRS2,
0.028499715
NTRK1,
PTPN11, BDNF, NTF3
REACTOME
Signal Transduction
cell
proliferation,
FRS2,
FGFRI,
FGFR3,
DOWNSTREAM
differentiation,
mi-
FGFR2, FOXO4, PLCG1,
SIGNALING
gration, survival and
PRKARIA, PTEN, SOSI,
cell shape
TSC2
calcium signaling
TP63,
VDR,
DLX6,
BRCA2,
OF
ACTIVATED
0.028989079
FGFR
PID DELTA NP63
Signal Transduction
PATHWAY
GNB2LI,
0.03478047
KRT14,
DLX5
PID
NCAD-
HERIN
PATH-
WAY
(TOLL
PATHWAY)
Signal Transduction
inflammatory
sponse,
re-
interferon
production,
cell
proliferation
and
PTPN1,
PLCG1,
LRP5,
0.039242846
FGFR1, GJA1, PTPN11
migration
(continued on next page)
61
Table 4.1
-
Continued.
Name
Category
Functions
# Genes
Genes
REACTOME
Nervous System
cell
17
CHRNA1,
CHRNA7,
GABRB3,
GRIK2,
GRIN2A,
GRIN2B,
KCND3,
Adjusted
p-value
communication,
NEURONAL
neuronal
SYSTEM
ment
develop-
KCND2,
KCNQ1,
RPS6KA3,
0.046181684
MAOA,
SLC1A1,
STXBP1, ABCC8,
SYN1,
CACNA1A, PICKI
1. We excluded neuron development, neurite development, axon guidance, synapse development, and calcium signaling pathways
as they were used as input knowledge in our integrative approach.
4.3.1
An Interesting Connection with Inflammatory Bowel Disease (IBD)
The fact that ASD patients often suffer from chronic inflammation of gastrointestinal tracts
motivated us to look for possible shared pathogenesis between ASD and inflammatory bowel
disease (IBD). We first identified a number of pathways related to IBD from extensive literature review. These IBD pathways are often related to innate and adaptive immunity
(T-cell signaling, chemokine signaling, NOD2 signaling, NF-KB signaling, 1L23/Th17 signaling etc. [61]), autophagy (IL2 signaling, IL2RB signaling, IL10 signaling, IL6 signaling,
TGF-,6 signaling etc. [61,79]), necrosis (TNF signaling, TNFR1/2 signaling, etc.) and apoptosis (cytokine signaling [79,881. A number of signaling pathways such as, ERK-MAPK
signaling [18, 118], WNT signaling [42], Notch signaling [70], Adipocytokine signaling [50],
Integrin signaling, Hedgehog signaling [60], BMP signaling, Hippo signaling, JAK-STAT signaling [10,32,102] also have been mentioned in relation to IBD. We collected corresponding
pathway gene sets from MSigDB. When we looked at the overlap of these pathways with
our candidate gene set, we found that ASD risk genes are overrepresented in most of these
pathways. Table 4.2 lists the IBD-related pathways that have significant overlaps (p-value
< 0.05) with ASD risk genes. This clearly indicates that IBD and ASDs have some sort of
shared pathogenesis which is worth further investigation.
4.4
Enrichment Analysis on GO gene sets
We performed hypergeometric enrichment analysis on the gene sets under GO biological processes and molecular functions categories to find in which biological processes and molecular
62
Name
KEGG MAPK SIGNALING PATHWAY
# Genes
21
BIOCARTA ERK5 PATHWAY
BIOCARTA AKT PATHWAY
REACTOME CYTOKINE SIGNALING IN IMMUNE SYSTEM
4
4
14
REACTOME SIGNALLING TO ERKS
REACTOME PROLONGED ERK ACTIVATION EVENTS
REACTOME P13K AKT ACTIVATION
REACTOME ERK MAPK TARGETS
WNT SIGNALING
BIOCARTA IL6 PATHWAY
ST T CELL SIGNAL TRANSDUCTION
PID TNF PATHWAY
PID IL6 7PATHWAY
BIOCARTA TNFR1 PATHWAY
BIOCARTA TNFR1 PATHWAY
REACTOME
NFKB
ACTIVATION
THROUGH FADD RIP1 PATHWAY MEDIATED BY CASPASE 8 AND10
PID IL2 1PATHWAY
ST INTEGRIN SIGNALING PATHWAY
KEGG HEDGEHOG SIGNALING PATHWAY
ST ERKI ERK2 MAPK PATHWAY
REACTOME APOPTOSIS
4
3
Overlapped Genes
MEF2C, FLNA, FLNB, BDNF, NFl, FGFR2,
FGFR3, FGFR1, FGF14, RPS6KA3, IKBKG,
MAPT, CACNA1H, CACNAlG, CACNA1A,
CACNA1C, CACNAlS, SOS1, NTF3, TP53,
NTRK1
MEF2C, MEF2A, PLCGI, NTRK1
IKBKG, FOXO4, GHR, GH1
EIF4E, FLNB, GH1, GHR, HGF, IL2RG,
IL6, INPPL1, IRF6, PLCG1, PTPN1, SOSI,
TRIM25, IKBKG
FRS2, NTRK1, PLCG1, SOSI
FRS2, NTRK1, PLCGI
4
3
6
3
4
4
4
3
3
2
FOXO4, NTRK1, PTEN, TSC2
MEF2A, MEF2C, RPS6KA3
AES, APC, LRP5, PITX2, WNT2, HPRT1
PTPN11, IL6, SOS1
SOS1, PTPRC, PLCG1, EPHB2
SMPD1, GNB2L1, IKBKG, CYLD
PTPN11, IL6, MITF, SOSI
LMNA, SPTAN1, RB1
LMNA, SPTAN1, RB1
TRIM25, IKBKG
0.006763695
0.008043482
0.008835118
0.009177488
0.012244256
0.013204051
0.014210455
0.019653376
0.019653376
0.02298827
4
5
4
3
7
0.024014914
0.024148774
0.025467375
0.025536874
0.029317822
3
4
SOSI, PTPN11, BCL2, IL2RG
EPHB2, SOS1, PLCG1, WAS, PTEN
SHH, WNT2, BMP4, GLI3
SOSI, RPS6KA3, EIF4E
DSP, APC, LMNA, MAPT, BCL2, SPTAN1,
TP53
IL2RG, BCL2, SOSI
PTPN11, PRKAG2, IKBKG, STK11
3
IL2RG, INPPL1, SOS1
0.048181081
BIOCARTA IL2RB PATHWAY
KEGG
ADIPOCYTOKINE
PATHWAY
REACTOME IL 2 SIGNALING
SIGNALING
p-value
1.09E-07
0.000383858
0.000861447
0.001132398
0.005568586
0.006035527
0.039815059
0.044909084
Table 4.2: IBD-related pathways having significant overlap with ASD risk genes.
functions, our ASD risk genes are over represented.
Significant biological processes and
molecular functions were selected based on the hypergeometric p-values (< 0.05 after Bonferroni correction). As expected, the candidate gene set was over represented in a number of
developmental processes related to the nervous system and brain. However, it is interesting
to see that the risk gene set is significantly involved in processes such as tissue, muscle,
epidermis, and ectoderm development, and organ morphogenesis.
This finding might be
supporting evidence for the fact that muscular dystrophy is a comorbid condition in many
ASD cases [55]. Our candidate gene set is also found to be involved in a number of molecular
functions related to ion channel activity, protein dimerization and binding, gated channel
activity, sodium and calcium channel activity, etc. Figures 4-1 and 4-2 show the significant
biological processes and molecular functions found by our analysis.
4.5
Enrichment Analysis for Subnetworks
Analysis for subnetworks was performed using QIAGEN's Ingenuity® Pathway Analysis
(IPA® QIAGEN Redwood City, http: //www. qiagen. com/ingenuity). IPA assembled subnetworks based on gene-to-gene connectivity assuming that, the more connected a gene is,
the more influence it has and the more "important" it is. IPA selected a set of seed genes
63
=40&~-Vatue)
-- Rtflo of wuap
0.4s
0.35
0.3
0.25
is.
0.2w
0.13
0.M
5,
z
g= g
Figure 4-1: Significant GO biological processes associated with ASD risk gene set.
The primary vertical axis shows the negative log of hypergeometric p-values. The secondary
vertical axis shows the ratio of overlapped genes.
64
.......
..........
..
. ....
....
....
.
-o
e+-
CL_
CL
e9
m
o-
>n~-
m' q m
0
'-1
TRANSCRIPTION ACTIVATOR ACTIVITY
COPPER ION BINDING
STRUCTURAL CONSTITUENT OF CYTOSKELETON
PROTEIN COMPLEXBINDING
CALCIUM CHANNEL ACTIVITY
PROTEIN DOMAIN SPECFIC BINDING
ADENYL NUCLEOTIDE BINDING
ADENYL RIBONUCLEOTIDE BINDING
VOLTAGE GATED CALCUM CHANNEL ACTIVITY
ATP BINDING
SODIUM CHANNEL ACTIVITY
PROTEIN N TERMINUS BINDING
TRANSM EMBRANE RECEPTOR PROTEIN KINASE ACTIVITY
PROTEIN TYROSINE KINASE ACTIVITY
CATION TRANSMEMBRANE TRANSPORTER ACTIVITY
TRANSMEMBRANE RECEPTOR PROTEIN TYROSINE KINASE ACTIVITY
PROTEIN DIMERIZATION ACTIVITY
PROTEIN HOMODIMERIZATION ACTIVITY
VOLTAGE GATED SODIUM CHANNEL ACTIVITY
CATION CHANNEL ACTIVITY
VOLTAGE GATED CHANNEL ACTIVITY
SUBSTRATE SPEaRC CHANNEL ACTIVITY
ION CHANNEL ACTIVITY
VOLTAGE GATED CATION CHANNEL ACTIVITY
ION TRANSMEMBRANE TRANSPORTER ACTIVITY
METAL ION TRANSMEMBRANE TRANSPORTER ACTIVITY
GATED CHANNEL ACTIVITY
-
-
-
-
-
0
I
m
-
-
I
-1
U'
-
f
-
U
-
-
-
-
--
-
T~
I
-
-
Ratio of Overlap
-
-log(p-value)
t
P
IA
P
a
U
S
I
A
from our ASD risk gene set. Seeds with the most connections were then connected to other
seeds to form a network. Non-seed genes as well as molecules from IPA Knowledge Base were
added to the network to fill or join the areas lacking connectivity. For visualization purposes,
we limited each subnetwork to a maximum of 35 nodes. Subnetworks were annotated with
high level functional categories, scored and sorted in descending order of scores.
IPA network analysis revealed 25 significant subnetworks in our supplied ASD risk gene
set. Figure 4-3 shows the top 4 subnetworks.
The topmost subnetwork is characterized
by tissue morphology, and gastrointestinal disease terms.
Nervous system development
and function characterizes the second subnetwork. The third subnetwork is annotated by
developmental, hereditary, and neurological disorders. Organismal injury and abnormalities
as well as reproductive system disease characterizes the fourth subnetwork. The complete
list of significant subnetworks is given in Appendix C.
These findings strongly suggest
the possibility of the existence of subclasses of ASDs, each characterized by one of the
disorders such as, gastrointestinal disorders, developmental disorders, hereditary disorders,
neurological disorders, organismal abnormalities, etc, and calls for further investigation.
4.6
Functional Analysis for Overlap with Diseases and Biofunctions
We performed functional analysis on our ASD risk gene set using QIAGEN's Ingenuity@
Pathway Analysis (IPA® QIAGEN Redwood City, http://www.qiagen.com/ingenuity).
With a goal of providing a molecular understanding or model that could explain the functionality of the provided gene set, IPA analyzed it for diseases and functions using high
quality GO information, manually curated information on diseases and disorders, and normal processes in abnormal tissues available in IPA knowledge base. Significance of overlap
between risk genes and genes in diseases and functions was calculated using Fisher's exact
test.
IPA functional analysis revealed that our ASD risk gene set is significantly overrepresented in a number of diseases under different disease categories including developmental and
hereditary disorders, neurological disorders, connective tissue disorders, auditory disorders,
gastrointestinal disorders, psychological disorders, dermatological disorders, inflammatory
disorders, organismal abnormalities, cancers, etc. While overlap of ASD with neurologi-
66
FGR
Nicotinic ace
er
Aigar--IL
'*
F
SLI
IjG
:K
4A
HS
l4
k
5-9
LG
7
KR
A
6-
Network 1
KR3-3
KL2
B.
Ss
Network 2
(jamily)
ated sodium channel
K4C
F
Nc
A
2
G1
!J2LX(o sterol
In
Tro
Trpnsl1
I
--
SOX2-O~t#-NANMG
cyclooicmenase
t
In
P
3
42
C
G
SIL
AS
12
G
2
eo5
A6
CTNN -Y/LEF
C.
Network 3
D.
Networ
4
Figure 4-3: Top four subnetworks in ASD risk gene set generated by QIAGEN's Ingenuityg Pathway
Analysis (IPA).
67
cal, psychological, developmental, and hereditary disorders are obvious, its connection with
gastrointestinal, auditory, and inflammatory disorders are not so obvious, hence more interesting for further investigation. The top 10 diseases having significant overlap with our risk
gene set are shown in Table 4.3.
Diseases
Autosomal Dominant Disease
Multiple Congenital Anomalies
Congenital Anomaly of Musculoskeletal System
Dysplasia
Cognitive Impairment
Autosomal Recessive disease
Mental Retardation
Congenital Anomaly of Limb
Dysplasia of Skeleton
Hypoplasia
Categories
Hereditary Disorder
Developmental Disorder
Developmental Disorder, Skeletal and Muscular Disorders
p-Value
5.19E-66
1.14E-61
2.42E-54
# Genes
106
93
108
Developmental Disorder
Neurological Disease
Hereditary Disorder
Developmental Disorder, Neurological Disease
Developmental Disorder, Skeletal and Muscular Disorders
Connective Tissue Disorders, Developmental Disorder, Skeletal and Muscular Disorders
Developmental Disorder
3.90E-40
1.79E-38
7.52E-36
4.05E-33
3.41E-32
2.53E-29
59
61
95
47
43
37
3.89E-29
68
Table 4.3: Top 10 diseases having significant overlap with ASD risk genes found by QIAGEN's Ingenuity@ Pathway Analysis (IPA).
Functions
Organismal Death
Differentiation of cells
Morphology of head
Cell Death
Abnormal Morphology of head
Morphology of Cells
Apoptosis
Morphology of Nervous System
Development of Body Axis
Development of Head
Abnormal Morphology of Nervous
System
Development of Body Trunk
Proliferation of Cells
Quantity of Cells
Development of Central Nervous
System
Development of Neurons
Development of Brain
Length of Animal
Necrosis
Size of Body
Abnormal Morphology of Cells
Microtubule Dynamics
Morphology of Central Nervous
System
Behavior
Morphology of Brain
Organization of Cytoskeleton
Abnormal Morphology of Brain
Morphology of Bone
Cell Movement
Abnormal Morphology of Central
Nervous System
Categories
Organismal Survival
Cellular Development
Organismal Development
Cell Death and Survival
Organismal Development
Cell Morphology
Cell Death and Survival
Nervous System Development and Function
Embryonic Development, Organismal Development
Embryonic Development, Organismal Development
Nervous System Development and Function
p-Value
5.39E-61
2.69E-55
4.37E-54
2.76E-51
2.81E-51
3.08E-50
2.97E-49
3.70E-48
1.16E-44
3.62E-44
2.02E-43
# Genes
212
190
122
238
116
179
206
112
115
109
102
Embryonic Development, Organismal Development
Cellular Growth and Proliferation
Tissue Morphology
Nervous System Development and Function
4.16E-41
1.31E-40
2.40E-40
6.36E-40
115
231
151
87
Cellular Development, Nervous System Development and
Function, Tissue Development
Embryonic Development, Nervous System Development and
Function, Organ Development, Organismal Development,
Tissue Development
Organismal Development
Cell Death and Survival
Organismal Development
Cell Morphology
Cellular Assembly and Organization, Cellular Function and
Maintenance
Nervous System Development and Function
Behavior
Nervous System Development and Function, Organ Morphology, Organismal Development
Cellular Assembly and Organization, Cellular Function and
Maintenance
Nervous System Development and Function, Organ Morphology, Organismal Development
Connective Tissue Development and Function, Embryonic
Development, Organ Development, Organ Morphology, Organismal Development, Skeletal and Muscular System Development and Function, Tissue Development
Cellular Movement
Nervous System Development and Function
1.45E-39
91
3.42E-39
76
5.02E-38
9.09E-38
2.12E-37
3.14E-37
8.73E-37
104
186
103
127
112
3.16E-35
77
3.95E-35
1.26E-34
103
73
1.15E-33
117
2.56E-33
70
4.06E-33
71
4.66E-33
6.86E-33
I_
155
72
I
Table 4.4: Top 30 functions having significant overlap with ASD risk genes found by QIAGEN's Ingenuity® Pathway Analysis (IPA).
ASD risk genes are also over represented in a number of functional categories, including
nervous system development, cell death and survival, cellular development, embryonic de68
velopment, organismal survival and development, cell and tissue morphology, behavior, etc.
The top 30 functions having significant overlap with our risk gene set is shown in Table 4.4.
69
70
Chapter 5
Conclusion
In this thesis, we have explored different computational approaches for addressing the classic
problem of disease gene prediction and prioritization in the context of autism spectrum
disorders (ASD). We have introduced three novel computational methods, one ASD-specific
generalized Pagerank method, and a novel method that integrates the four, for solving the
ASD gene prediction-prioritization problem.
Our first method calculates information entropy based scores for all the genes that can
be mapped to the copy number variations that have ever been observed in ASD population
as well as appropriate control groups by taking into account their frequency of occurrence
in ASD case-control groups. Ranking the genes in descending order of CNV-based scores
helps us achieve an area of 59.81% under the ROC curve, and 2.3-fold enrichment of ASD
genes in the top 2% of the ranklist.
Our second method incorporates disease/phenotype similarity scores computed by van
Driel et al. [112] and gene-phenotype relationships from the OMIM database. This method
is seeded by high confidence ASD genes from the literature to identify ASD like phenotypes
in OMIM. Genes involved in diseases with phenotypes similar to ASDs are scored highly by
this algorithm. This method achieves an area of 55.96% under the ROC curve excluding
the seed genes. We are able to achieve a 3.62-fold gain in ASD genes in the top 2% of the
ranklist.
In our third method, we introduce diffusion state ASD proximity (DSAP) for the proteins
based on diffusion state distance (DSD) metric, which is superior to direct neighborhood
and shortest path distances in capturing the functional association of proteins in the PPI
71
network. Genes axe ranked in descending order of their diffusion state proximity to ASD seed
genes. DSAP-based prioritizer achieves an AUC of 54.05% under the ROC curve excluding
the seed genes. Considering the top 4% of the ranklist accounts for 1.1-fold enrichment of
ASD genes (excluding seed genes). However, inclusion of seed genes boosts this enrichment
upto 5.7-fold.
The fourth method we introduce is a generalization of Google's Pagerank algorithm for
ASDs. This approach uses the global PPI network structure to simulate network crosstalk
between the genes in the network and high confidence ASD seed genes.
The simulated
crosstalk gives a quantification of the functional association of ASD genes to the rest of the
genes in the network. Genes are ranked in descending order of their association scores. We
achieve an AUC of 56.11% under the ROC curve using this method. In the top 2% of the
ranklist of genes, we achieve a 2.37-fold enrichment of ASD genes (excluding seeds).
Considering the unbalanced nature of our dataset these methods can be considered to
perform reasonably well, as we can achieve an AUC more than 50% using each of these
methods. However, the performances of these methods axe limited in that none of them could
give us an AUC more than 60%. Thus, to increase overall accuracy of ASD gene prediction we
propose a novel integrative approach which incorporates not only CNV, phenotype similarity,
connectivity, proximity and topological similarity in the PPI network, but also ASD pathway
knowledge from available literature. Each gene is assigned an association probability based
on a simple, yet powerful logistic regression model. Adaptive lasso penalization with cross
validation is performed to avoid over-fitting of the model. Genes axe ranked in descending
order of their association probabilities. This integrative approach significantly outperforms
the above four individual methods achieving an AUC of 65.34% under the ROC curve using
test data. The top 2% of the ranklist gives us 3-fold enrichment of ASD genes (excluding
seeds) which increases upto 11.2-fold with the inclusion of seed genes. Thus we get a high
quality candidate gene set for ASDs consisting of the top 2% genes of the ranklist.
Our candidate gene set provides a number of interesting insights into the genetic background and pathophysiology of ASDs. Pathway enrichment analysis reveals that the candidate gene set is overrepresented in a number of signaling, cell adhesion and neurological
pathways which can be used to explain the pathophysiology of ASDs better. We have been
able to discover an interesting connection between ASDs and IBD by showing that, our candidate gene set has significant overlap with the majority of the IBD-related pathways. We
72
have also found several disjoint subnetworks in our candidate gene set characterized by different categories of diseases and bio-functions, which provide an indication of the existence
of subclasses of disorders in the autism spectrum. The topmost subnetwork characterized by
gastrointestinal disorders is particularly interesting and needs further investigation. Furthermore, we have identified a number of interesting molecular functions and biological processes
by functional analysis and enrichment analysis on GO terms. For some of these (e.g., moleculax functions related to metabolism, organ and tissue morphology, muscle cell differentiation,
etc.), connection to ASDs is not so obvious and thus worth further investigation.
There is considerable room for the further development of more sophisticated computational integrative approaches for combining ASD-related omics data from different sources.
These techniques will become important as the omics data related to ASD is growing at
a fast rate, given that more and more studies are being performed on larger ASD cohorts.
Thus, sophisticated computational analysis is key to understanding the mysterious dogma
of ASDs. This thesis provides a significant step towards understanding the biological underpinnings of ASDs better.
73
74
Appendix A
SFARI Genes for Autism Spectrum
Disorders
Table A.1 - ASD risk genes reported by SFARI gene module.
1
Gene Symbol
Gene Name
NRXN1
neurexin
MECP2
Methyl CpG binding protein 2
Xq28
39
CNTNAP2
contactin associated protein-like 2
7q35-q36
38
SHANK3
SH3 and multiple ankyrin repeat domains 3
22q13.3
33
FMR1
fragile X mental retardation 1
Xq27.3
29
MET
met proto-oncogene (hepatocyte growth factor receptor)
7q31
29
CACNA1C
calcium channel, voltage-dependent, L type, alpha
12p13.3
27
Chromosomal Location
2
1
1C
sub-
p16.3
# Reports
51
unit
RELN
Reelin
7q22
27
FOXP2
forkhead box P2
7q31
26
OXTR
oxytocin receptor
3p25
26
DISCI
disrupted in schisophrenia 1
1q42.1
24
DMD
dystrophin (muscular dystrophy, Duchenne and Becker types)
Xp2l.2
22
NLGN3
neuroligin 3
Xql3.1
22
RBFOX1
RNA binding protein, fox-1 homolog (C. elegans)
22
PTEN
phosphatase and tensin homolog (mutated in multiple ad-
16p13.3
10q23.3
1
21
vanced cancers 1)
GABRB3
gamma-aminobutyric acid (GABA) A receptor, beta 3
15q11.2-q12
20
NLGN4X
neuroligin 4, X-linked
Xp22.32-p22.31
20
SYNGAPI.
synaptic Rae GTPase activating protein 1
6p21.3
20
AUTS2
autism susceptibility candidate 2
7q11.22
19
SCNIA
sodium channel, voltage-gated, type I, alpha subunit
2q24.3
19
SLC6A4
solute carrier family 6 (neurotransmitter transporter, sero-
17q11.l-q12
19
tonin), member 4
DPP6
dipeptidyl-peptidase 6
7q36.2
18
GRIN2B
glutamate receptor, inotropic, N-methyl D-apartate 2B
12p12
18
GRIN2A
glutamate receptor, ionotropic, N-methyl D-aspartate 2A
i6p
.
17
MBDS5
Methyl-CpG binding domain protein 5
2q23.1
17
EN2
engrailed homolog 2
7q36
16
CDKL5
cyclin-dependent kinase-like 5
Xp22
15
HOXAI
homeobox Al.
7pl5.3
15
NFl
neurofibromin
17q11.2
15
15
1
(neurofibromatosis, von Recklinghausen dis-
13 2
ease, Watson disease)
SCN2A
sodium channel, voltage-gated, type II, alpha subunit
2q23-q24
SHANK2
SH3 and multiple ankyrin repeat domains 2
11q13.3-q13.4
15
(contin ued on next page)
75
Table A.1 - Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
AHII
Abelson helper integration site 1
6q23.3
14
CACNA1H
calcium channel, voltage-dependent, alpha 1H subunit
16pl3.3
14
CNTN4
contactin 4
3p26-p25
14
ILlRAPL1
interleukin
Xp22.1-p21.3
14
KCNMAI
potassium large conductance calcium-activated channel, sub-
10q22.3
14
1
receptor accessory protein-like 1
family M, alpha member
1
RORA
RAR-related orphan receptor A
15q22.2
14
SYNI
Synapsin 1
Xp1l.23
14
TSC2
tuberous sclerosis 2
l6p13.3
14
MEF2C
myocyte enhancer factor 2C
5q14
13
NLGN1
neuroligin 1
3q26.31
13
PCDH19
protocadherin 19
Xq13.3
13
SLC25A12
solute carrier family 25 (mitochondrial carrier, Aralar), mem-
2q24
13
9q34
15q11.2
13
12q14-q15
12
Xpll.22-pll.21
12
ber 12
TSC1
tuberous sclerosis 1
UBE3A
ubiquitin protein ligase E3A
AVPR1A
arginine vasopressin receptor
KDM5C
Lysine (K)-specific demethylase
NTRK3
neurotrophic tyrosine kinase, receptor, type 3
15q25
12
PARK2
Parkinson disease (autosomal recessive, juvenile) 2, parkin
6q25.2-q27
12
CACNA1G
calcium channel, voltage-dependent, T type, alpha
17q22
11
1A
5C
1G
sub-
13
unit
DLX2
distal-less homeobox 2
2q32
11
ERBB4
v-erb-a erythroblastic leukemia viral oncogene homolog 4
2q33.3-q34
11
(avian)
ITGB3
integrin, beta 3 (platelet glycoprotein Ilia, antigen CD61)
17q21.32
11
MACROD2
MACRO domain containing 2
20pl2.1
11
MAOA
monoamine oxidase A
Xpl1.3
11
MCPH1
microcephalin
8p23.1
11
MED12
mediator complex subunit 12
Xql3
11
MTHFR
methylenetetrahydrofolate reductase (NAD(P)H)
1p36.3
11
RAPGEF4
Rap guanine nucleotide exchange factor (GEF) 4
2q31-q32
11
SLC1A1
solute carrier family 1 (neuronal/epithelial high affinity glu-
9p2
1
tamate transporter, system Xag), member
4
STXBP1
Syntaxin binding protein 1
9q34.1
TCF4
Transcription factor 4
18q21.1
ADRB2
adrenergic, beta-2-, receptor, surface
5q31-q32
AFF2
AF4/FMR2 family, member 2
Xq28
ANK3
Ankyrin 3, node of Ranvier (ankyrin G)
10q21
BAIAP2
BAll-associated protein 2
17q25
BCL2
B-cell
EIF4E
eukaryotic translation initiation factor 4E
4q21-q25
FOXPi
forkhead box P1
3p14.1
GRIK2
glutamate receptor, ionotropic, kainate 2
6q16.3-q21
HDAC4
histone deacetylase 4
2q37.3
SYNEI
spectrin repeat containing, nuclear envelope 1
ARID1B
AT rich interactive domain
ARNT2
aryl-hydrocarbon receptor nuclear translocator 2
15q24
ASTN2
astrotactin 2
9q33.1
DIAPH3
Diaphanous-related formin 3
13q21.2
DPP1O
Dipeptidyl-peptidase 10
GRIPI
glutamate receptor interacting protein
IMMP2L
IMP2 inner mitochondrial membrane peptidase-like (S. cere-
CLL/lymphoma
11
1
2
18q21.3
1B
(SWIl-like)
6q25
6q25.1
2q14.1
1
12q14.3
7q31
visiae)
1
OPHN1
oligophrenin
SEMA5A
sema domain, seven thrombospondin repeats (type 1 and type
Xq12
9
5p15.2
9
1-like), transmembrane domain (TM) and short cytoplasmic
domain, (semaphorin) 5A
TTN
titin
2q31
9
(continued on next page)
76
Table A.1
-
Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
WNT2
wingless-type MMTV integration site family member 2
7q31
9
ANKRD11
ankyrin repeat domain 11
16q24.3
8
ARX
aristaless related homeobox
Xp22.1-22.3
8
CADPS2
Ca2+-dependent activator protein for secretion 2
7q31.3
8
CHRNA7
cholinergic receptor, nicotinic, alpha 7
15q14
8
CTNNA3
catenin (cadherin-associated protein), alpha 3
10q22.2
8
DLX1
distal-less homeobox 1
2q32
8
DLX6
distal-less homeobox 6
7q22
8
ESR1
estrogen receptor 1
6q25.1
8
FHIT
fragile histidine triad gene
3p14.2
8
FOXG1
Forkhead box G1
14q13
8
GLRA2
glycine receptor, alpha 2
Xp22.1-p21.3
8
HOXB1
homeobox Bi
17q21.3
8
HSD11B1
hydroxysteroid (11-beta) dehydrogenase
1q32-q41
8
NTNG1
netrin G1
lp13.3
8
PLCD1
phospholipase C, delta
3p22-p21.3
8
PTPRC
protein tyrosine phosphatase, receptor type, C
1q31-q32
8
RAIl
retinoic acid induced
8
RFWD2
ring finger and WD repeat domain 2
17plI.2
1q25.1-q25.2
RPLlO
ribosomal protein L10
Xq28
8
SLC9A9
solute carrier family 9 (sodium/hydrogen exchanger), mem-
3q24
8
1
1
1
8
ber 9
SNDl
staphylococcal nuclease and tudor domain containing 1
7q31.3
8
XPC
xeroderma pigmentosum, complementation group C
3p25
8
ADORA2A
adenosine A2a receptor
22q11.23
7
APC
adenomatosis polyposis coli
5q21-q22
7
CADMI
cell adhesion molecule 1
11q23.2
7
CDHIO
cadherin 10, type 2 (T2-cadherin)
5p1 -p13
4
7
CHD7
chromodomain helicase DNA binding protein 7
Sq12.2
7
CNTNAP5
contactin associated protein-like 5
2ql4.3
7
CXCR3
chemokine (C-X-C motif) receptor 3
Xql3
7
DHCR7
7-dehydrocholesterol reductase
11q13.2-q13.5
7
DLGAP2
discs, large (Drosophila) homolog-associated protein 2
8p23
7
DLGAP3
Discs, large (Drosophila) homolog-associated protein 3
lp35.3-p34.1
7
DRD3
dopamine receptor D3
3q13.3
7
ESR2
estrogen receptor 2 (ER beta)
14q23.2
7
ESRRB
estrogen-related receptor beta
12 41.0 cM
7
GPC6
glypican 6
13q32
7
GRPR
Gastrin-releasing peptide receptor
Xp22.2-p22.13
HRAS
v-Ha-ras Harvey rat sarcoma viral oncogene homolog
lip
KCNJ10
Potassium inwardly-rectifying channel, subfamily J, member
1q23.2
7
1 5 5
.
7
7
10
MARK1
MAP/microtubule affinity-regulating kinase 1
1q41
7
MKL2
MKL/myocardin-like 2
16p13.12
7
NRXN3
neurexin 3
14q31
7
NTRK1
neurotrophic tyrosine kinase, receptor, type 1
1q21-q22
7
ROBO
roundabout, axon guidance receptor, homolog 1 (Drosophila)
3p12
7
SLC9A6
solute carrier family 9 (sodium/hydrogen exchanger), mem-
Xq26.3
7
ber 6
VPS13B
vacuolar protein sorting 13 homolog B (yeast)
8q22.2
7
AFF4
AF4/FMR2 family, member 4
5q31
6
AR
androgen receptor
Xqll.2-q12
6
ATRX
alpha thalassemia/mental retardation syndrome X-linked
Xq2l.1
6
CA6
carbonic anhydrase VI
1p36.2
6
CACNA1D
calcium channel, voltage-dependent, L type, alpha
CDH8
cadherin 8, type 2
16q22.1
6
CDH9
cadherin 9, type 2 (Ti-cadherin)
5p14
6
CHD2
Chromodomain helicase DNA binding protein 2
15q26
6
CNR1
cannabinoid receptor
1
(brain)
1D
3
p
14
3
.
6ql4-q15
6
6
(continued on next page)
77
Table A.1 -
Continu ed.
Chromosomal Location
# Reports
1p32-p31
6
doublecortex, lissencephaly, X-linked (doublecortin)
Xq2i.3-q23
6
DPYD
dihydropyrimidine dehydrogenase
lp22
6
DYRK1A
Dual-specificity tyrosine-(Y)-phosphorylation
21q22.13
6
Gene Symbol
Gene Name
DABI1
disabled homolog
DCX
1
(Drosophila)
regulated ki-
nase 1A
EPHA6
EPH receptor A6
3q11.2
6
GABRA4
gamma-aminobutyric acid (GABA) A receptor, alpha 4
4p12
6
GLO1
glyoxalase I
6p21.3-p21.l
6
GRID2
glutamate receptor, ionotropic, delta 2
4q22
6
GRM8
glutamate receptor, metabotropic 8
7q31.3-q32.
6
HTR1B
5-hydroxytryptamine (serotonin) receptor 1B
6q13
6
KCNQ2
Potassium voltage-gated channel, KQT-like subfamily, mem-
20q13.3
6
12q13-q14
6
2p25.3
6
ber 2
MYO1A
myosin IA
MYTlL
Myelin transcription factor
NOSIAP
nitric oxide synthase 1 (neuronal) adaptor protein
1q23.3
6
NOS2A
nitric oxide synthase 2A (inducible, hepatocytes)
17q11.2-q12
6
NRP2
neuropilin 2
2q33.3
6
PCDH9
protocadherin 9
13q21.32
6
PSMD1O
proteasome (prosome, macropain) 26S subunit, non-ATPase,
Xq22.3
6
1q23.1
6
Xp22.2-p22.1
6
Xq28
6
1-like
10
RGS7
regulator of G-protein signaling 7
RPS6KA3
Ribosomal protein
SLC6A8
solute carrier family 6 (neurotransmitter transporter, crea-
S6
kinase, 90kDa, polypeptide 3
tine), member 8
TBC1D5
TBC1 domain family, member
3p24.3
6
TH
tyrosine hydroxylase
11p15.5
6
UPF3B
UPF3 regulator of nonsense transcripts homolog B (yeast)
Xq25-q26
6
WNK3
WNK lysine deficient protein kinase 3
Xpll.23-pll.21
6
ADA
adenosine deaminase
20qI2-q13.11
5
AGAP1
ArfGAP with GTPase domain, ankyrin repeat and PH do-
2q37
5
main
1
aldehyde dehydrogenase 5 family, member Al
semialdehyde dehydrogenase
APBA2
(succinate-
6p22.2-p22.3
)
ALDH5A1
5
amyloid beta (A4) precursor protein-binding, family A, mem-
15q11-q12
ber 2
ARHGAP15
Rho GTPase activating protein 15
2q22.2-q22.3
5
BRAF
v-raf murine sarcoma viral oncogene homolog B
7q34
5
C4B
complement component 4B
6p21.3
5
CACNA1B
Calcium channel, voltage-dependent, N type, alpha
9q34
5
1B
sub-
unit
18q12
CELF4
CUGBP, Elav-like family member 4
CTCF
CCCTC-binding factor (sinc finger protein)
CYFIP1
cytoplasmic FMR1 interacting protein
DMPK
dystrophia myotonica-protein kinase
19q13.3
DOCK4
Dedicator of cytokinesis 4
7q31.1
F13A1
coagulation factor XIII, Al polypeptide
6p25.3-p2 .3
FABP5
fatty acid binding protein 5 (psoriasis-associated)
8q21.13
GPXl
glutathione peroxidase 1
3p21.3
GTF2I
general transcription factor IIi
7q11.23
HEPACAM
hepatic and glial cell adhesion molecule
11q24.2
HLA-A
major histocompatibility complex, class I, A
HS3ST5
heparan sulfate (glucosamine) 3-0-sulfotransferase 5
HTR2A
5-hydroxytryptamine (serotonin) receptor 2A
HTR3C
5-hydroxytryptamine
16q21-q22.3
1
15q11
4
6
p21.3
6q21
13ql4-q21
(serotonin) receptor 3, family member
3q27.1
C
HTR7
5-hydroxytryptamine
(serotonin)
receptor
7
(adenylate
10q21-q24
5
cyclase-coupled)
IL1R2
interleukin 1 receptor, type II
2q12
5
(continued on next page)
78
Table A.l
-
Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
ITGA4
integrin, alpha 4 (antigen CD49D, alpha 4 subunit of VLA-4
2q31.3
5
receptor)
JARID2
Jumonji, AT rich interactive domain 2
6p24-p23
5
KCNQ3
Potassium voltage-gated channel, KQT-like subfamily, mem-
8q24
5
ber 3
LAMC3
laminin, gamma 3
9q31-q34
5
MAP2
microtubule-associated protein 2
2q34-q35
5
MBD1
methyl-CpG binding domain protein 1
18q21
5
MBD4
methyl-CpG binding domain protein 4
3q21-q22
5
MDGA2
MAM domain containing glycosylphosphatidylinositol anchor
14q21.3
5
MYO16
myosin XVI
13q33.3
5
NRCAM
neuronal cell adhesion molecule
7q31.1-q31.2
5
PCDH10
protocadherin 10
4q28.3
5
PER1
period homolog 1 (Drosophila)
l p
PINX1
PIN2/TERF1 interacting, telomerase inhibitor 1
8p23
5
PITX1
paired-like homeodomain 1
5q31
5
PONI
paraoxonase 1
7q21.3
5
PTCHD1
patched domain containing 1
Xp22.11
5
PTGS2
prostaglandin-endoperoxide synthase 2 (prostaglandin G/H
1q25.2-q25.3
5
7
13
.l-p12
5
synthase and cyclooxyge nase)
SATB2
SATB homeobox 2
2q33
5
SCN8A
sodium channel, voltage gated, type VIII, alpha subunit
12q13
5
SEZ6L2
SEZ6L2 seizure related 6 homolog (mouse)-like 2
16pll.2
5
SLC4A10
solute carrier family 4, sodium bicarbonate transporter-like,
2q23-q24
5
member 10
SNTG2
Syntrophin, gamma 2
2p25.3
ST8SIA2
ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 2
15q26
STK39
serine threonine kinase 39 (STE20/SPS1 homolog, yeast)
2q24.3
TBR1
T-box, brain, 1
2q24
TGM3
traneglutaminase 3
20q11.2
TSPAN7
tetraspanin 7
Xpll.4
VIP
Vasoactive intestinal peptide
6q25
ABAT
4-aminobutyrate aminotransferase
16p 3.
ACYl
Aminoacylase 1
3p2l.1
ADSL
adenylosuccinate lyase
22q13.1, 22q13.2
ALOX5AP
arachidonate 5-lipoxygenase-activating protein
13qI2
AP1S2
Adaptor-related protein complex 1, sigma 2 subunit
Xp22.2
ATP2B2
ATPase, Ca++ transporting, plasma membrane 2
BZRAPI.
bensodiasapine receptor (peripheral) associated protein
CASC4
cancer susceptibility candidate 4
15q15.3
CD38
CD38 molecule
4p15
CDH22
cadherin-like 22
20q13.1
CHD8
chromodomain helicase DNA binding protein 8
14q11.2
CREBBP
CREB binding protein
16p13.3
CTTNBP2
cortactin binding protein 2
7q31
CUL3
Cullin 3
CYPIBI
cytochrome P450, family 11, subfamily B, polypeptide
EGR2
early growth response 2 (Krox-20 homolog, Drosophila)
10q21.1
EPC2
Enhancer of polycomb homolog 2 (Drosophila)
2q23.1
EPHB6
EPH receptor B6
FBXO40
F-box protein 40
7q33-q35
3q13.33
GABRB1
gamma-aminobutyric acid (GABA) A receptor, beta 1
GALNT13
UDP-N-acetyl-alpha-D-galactosamine:polypeptide
1
2
3p25.3
1
17q22-q23
2q36.2
1
8q21
4
N-
p12
2q23.3-q24.1
acetylgalactosaminyltransferase 13 (GalNAc-T13)
GNAS
GNAS complex locus
20q13.3
4
GPHN
Gephyrin
14q23.3
4
GRM
Glutamate receptor, metabotropic 1
6q24
4
HLA-DRB1
major histocompatibility complex, class II, DR beta 1
6p2l.3
4
(continued on next page)
79
Table A.1 - Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
HUWE1
HECT, UBA and WWE domain containing 1, E3 ubiquitin
Xp1l.22
4
protein ligase
ICA1
islet cell autoantigen 1, 69kDa
7p22
4
JMJD1C
jumonji domain containing 1C
10q21.2
4
KANK1
KN motif and ankyrin repeat domains 1
9p24.3
4
LRP2
Low density lipoprotein receptor-related protein 2
2q24-q31
4
LRRC1
leucine rich repeat containing 1
6p12.1
4
LZTS2
leucine
10q24
4
MBD3
methyl-CpG binding domain protein 3
19p13.3
4
MCC
mutated in colorectal cancers
5q21
4
MTF1
metal-regulatory transcription factor 1
1p33
4
NBEA
neurobeachin
13q13
4
NPAS2
neuronal PAS domain protein 2
2q11.2
4
NSD1
nuclear receptor binding SET domain protein 1
5q35
4
PHF8
PHD finger protein 8
Xp11.22
4
PIK3CG
phosphoinositide-3-kinase, catalytic, gamma polypeptide
7q22.3
4
PLN
phospholamban
6q22.1
4
PRICKLE1
Prickle homolog
12q12
4
PRKCB
protein kinase C, beta
16p11.2
4
RAB39B
RAB39B, member RAS oncogene family
Xq28
4
SGSH
N-sulfoglucosamine sulfohydrolase
17q25.3
4
SH3KBP1
SH3-domain kinase binding protein 1
Xp22.1-p21.3
4
SLC30A5
solute carrier family 30
5q12.1
4
SLC6A3
Solute carrier family 6 (neurotransmitter transporter), mem-
5pl5.3
4
sipper,
putative tumor suppressor 2
1
(Drosophila)
ber 3
SYN2
Synapsin II
3p25
4
TDO2
tryptophan 2,3-dioxygenase
4q31-q32
4
UBE3B
ubiquitin protein ligase E3B
12q24.11
4
VASH1
vasohibin 1
14q24.3
4
ADNP
Activity-dependent neuroprotector homeobox
20q13.13
3
AGBL4
ATP/GTP binding protein-like 4
1p
AGTR2
angiotensin II receptor, type 2
Xq22-q23
3
ALDH1A3
Aldehyde dehydrogenase
15q26.3
3
APP
Amyloid beta (A4) precursor protein
21q21.3
3
ASS1
argininosuccinate synthetase
9q34.1
3
BCKDK
Branched chain ketoacid dehydrogenase kinase
l6pl1.2
3
BIN1
Bridging integrator 1
2q14
3
C12orf57
Chromosome 12 open reading frame 57
12p13.31
3
C3orf58
chromosome 3 open reading frame 58
3q24
3
CAMTA1
calmodulin binding transcription activator
1p36.31-p36.23
3
CBS
cystathionine beta-synthase
21q22.3
3
CD44
CD44 molecule (Indian blood group)
3
CEP290
Centrosomal protein 29OkDa
lIp
12q21.32
CEP41
testis specific, 14
7q32
3
CMIP
c-Maf inducing protein
16q23
3
DAPK1
death-associated protein kinase 1
9q34.1
3
DCTN5
dynactin 5
16pl2.2
3
DCUNID1
DCN1, defective in cullin neddylation 1, domain containing
3q26.3
3
1
1 family,
member A3
1
33
13
3
3
(S. cerevisiae)
DDX11
DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11
12pll
3
DDX53
DEAD (Asp-Glu-Ala-Asp) box polypeptide 53
Xp22.11
3
DRD1
Dopamine receptor D1
5q35.1
3
EHMT1
Euchromatic histone-lysine N-methyltransferase 1
9q34.3
3
EXT1
Exostosin
8q24.11
3
FATI
FAT tumor suppressor homolog 1 (Drosophila)
4q35
3
FLT1
fms-related tyrosine kinase 1 (vascular endothelial growth
13q12
3
3
1
factor/vascular perme ability factor receptor)
FRK
fyn-related kinase
6q21-q22.3
FRMPD4
FERM and PDZ domain containing 4
Xp22.2
3
(continued on next page)
80
Table A.1
-
Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
GPD2
Glycerol-3-phosphate dehydrogenase 2 (mitochondrial)
2q24.1
3
GRIDI
Glutamate receptor, ionotropic, delta
10q22
3
GRM5
Glutamate receptor, metabotropic 5
11ql4.3
3
GSTM1
glutathione S-transferase M1
1p13.3
3
HCFC1
Host cell factor C1 (VP16-accessory protein)
Xq28
3
HNRNPH2
heterogeneous nuclear ribonucleoprotein H2 (H')
Xq22
3
HOMER1
Homer homolog 1 (Drosophila)
5ql4.2
3
INPP1
inositol polyphosphate-l-phosphatase
2q32
3
IQSEC2
IQ motif
Xpl1.22
3
ITGB7
integrin, beta 7
12q13.13
3
KCND2
Potassium voltage-gated channel,
7q31
3
and
1
Sec7 domain 2
Shal-related subfamily,
member 2
KIAA1586
KIAA1586
6p12.1
3
NDNL2
necdin-like 2
15q13.1
3
NDUFA5
NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 5,
l3kDa
7q32
3
1p31.3-p31.2
NFIA
nuclear factor I/A
NXF5
Nuclear RNA export factor 5
OPRM1
opioid receptor, mu
OTX1
Orthodenticle homeobox 1
2p13
PCDHA11
Protocadherin alpha 11
PCDHA13
Protocadherin alpha 13
PCDHA2
Protocadherin alpha 2
5q31
5q31
5q31
PCDHA4
Protocadherin alpha 4
5q31
PCDHA5
Protocadherin alpha 5
5q31
PCDHA6
Protocadherin alpha 6
5q31
PCDHA7
Protocadherin alpha 7
PCDHA9
Protocadherin alpha 9
5q31
5q31
PDZD4
PDZ domain containing 4
PLCB1
phospholipase C, beta
POGZ
Pogo transposable element with ZNF domain
1q21.3
PRICKLE2
Prickle homolog 2 (Drosophila)
3p1 .1
PRUNE2
prune homolog 2 (Drosophila)
9q21.2
PSD3
pleckstrin and Sec7 domain containing 3
8p2l.3
RBlCC1
RB1-inducible coiled-coil 1
Sql1
REEP3
receptor accessory protein 3
10q21.3
RHOXF1
Rhox homeobox family, member 1
Xq24
RIMS3
regulating synaptic membrane exocytosis 3
lpter-p22.2
RPS6KA2
ribosomal protein S6 kinase, 9OkDa, polypeptide 2
6q27
SDC2
syndecan 2 (heparan sulfate proteoglycan
8q22-q23
Xq22
1
6q24-q25
1
Xq28
(phosphoinositide-specific)
20p12
4
1, cell surface-
)
associated, fibroglycan
SOX5
SRY (sex determining region Y)-box 5
STX1A
Syntaxin
SUCLG2
succinate-CoA ligase, GDP-forming, beta subunit
3p1 .1
TAFIL
TAF1 RNA polymerase II
9p21.1
TBClD7
TBC1 domain family, member 7
6p
TLK2
tousled-like kinase 2
17q23
TMLHE
trimethyllysine hydroxylase, epsilon
Xq28
TOP1
Topoisomerase (DNA) I
20ql2-q13.1
TOP3B
Topoisomerase (DNA) III beta
22q11.22
TRIP12
Thyroid hormone receptor interactor 12
2q36.3
TSN
translin
2q21.1
TUBGCP5
tubulin, gamma complex associated protein 5
15q11.2
WNT1
Wingless-type MMTV integration site family, member 1
12q13
ADARB1
Adenosine deaminase, RNA-specific, BI
21q22.3
ADCY5
Adenylate cyclase 5
3q21.1
ADORAS
Adenosine A3 receptor
lpl3.2
ANK2
Ankyrin 2, neuronal
4q25-q27
ASXL3
Additional sex combs like 3 (Drosophila)
18qil
1A
12p12.1
(brain)
7q11.23
4
24
.1
(continued on next page)
81
Table A.1
Gene
-
Symbol
CACNA1I
Continued.
Gene Name
Chromosomal Location
# Reports
Calcium channel, voltage-dependent, T type, alpha 11 sub-
22q13.1
2
unit
13
CAPRIN1
Cell cycle associated protein 1
Ilp
CCDC64
coiled-coil domain containing 64
12q24. 3
2
CHRM3
Cholinergic receptor, muscarinic 3
1q43
2
CLTCL1
clathrin, heavy chain-like 1
22q11.21
2
CNTN3
contactin 3 (plasmacytoma associated)
3p12.3
2
CSMD1
CUB and Sushi multiple domains 1
8p23.2
2
CTNNB1
Catenin (cadherin-associated protein), beta 1, 88kDa
3p21
2
DDC
Dopa decarboxylase (aromatic L-amino acid decarboxylase)
7pl2.2
2
DEPDC5
DEP domain containing 5
22ql2.3
2
DLG4
Discs, large homolog 4 (Drosophila)
17p13.1
2
DRD2
Dopamine receptor D2
11q23
2
EML1
echinoderm microtubule associated protein like 1
14q32
2
EP400
ElA binding protein p400
12q24.33
2
EPHB2
EPH receptor B2
1p36.1-p35
2
EXOC6B
Exocyst complex component 6B
2pl3.2
2
FAM135B
Family with sequence similarity 135, member B
8q24.23
2
FBXO33
F-box protein 33
14q21.1
2
FGD1
FYVE, RhoGEF and PH domain containing 1
Xpll.21
2
FOLHI
Folate hydrolase (prostate-specific membrane antigen) 1
l1pI1.2
2
GALNT14
UDP-N-acetyl-alpha-D-galactosamine:polypeptide
2p23.1
2
3q13.3
2
15q13
2
2
2
N-
acetylgalactosaminyltransferase 14 (GalNAc-T14)
GSK3B
Glycogen synthase kinase 3 beta
HERC2
HECT and RLD domain containing E3 ubiquitin protein
lig-
ase 2
KATNAL2
Katanin p60 subunit A-like 2
18q21.1
2
KHDRBS2
KH domain containing, RNA binding, signal transduction as-
6q11.1
2
Xq13.3
2
2q23.1
2
sociated 2
KIAA2022
KIAA2022
KIF5C
Kinesin family member
LRPPRC
Leucine-rich pentatricopeptide repeat containing
2p2I
2
MAOB
Monoamine oxidase B
Xp1l1.23
2
MC4R
Melanocortin 4 receptor
18q22
2
NELL1
NEL-like 1 (chicken)
11p
NIPA1
non imprinted in Prader-Willi/Angelman syndrome 1
15q11.2
2
NIPA2
non imprinted in Prader-Willi/Angelman syndrome 2
15q11.2
2
NIPBL
Nipped-B homolog (Drosophila)
5pl3.2
2
NRXN2
neurexin 2
11q13
2
PCDH15
Protocadherin-related 15
10q21.1
2
PCDHAC2
Protocadherin alpha subfamily C, 2
5q31
2
PDE4A
phosphodiesterase 4A, cAMP-specific
19pl3.2
2
PDE4B
phosphodiesterase 4B, cAMP-specific
lp3l
2
PEX7
peroxisomal biogenesis factor 7
6q23.3
POMGNT1
Protein
5C
O-linked
1
mannose
betal,2-N-
'p
5 1
.
34 1
2
2
.
2
acetylglucosaminyltransferase
PTPRT
protein tyrosine phosphatase, receptor type, T
20q12-ql3
2
PXDN
Peroxidasin homolog (Drosophila)
2p25
2
RAB11FIP5
RABIl family interacting protein 5
2p13
2
RBM8A
RNA binding motif protein 8A
1q21.1
2
SAE1
SUMO1 activating ensyme subunit 1
19q13.32
2
SBF1
SET binding factor 1
22q13.33
2
SDK1
Sidekick cell adhesion molecule
7p22.2
2
SERPINE1
Serpin peptidase inhibitor, clade E (nexin, plasminogen acti-
7q21.3-q22
2
1
vator inhibitor type 1), member 1
3
SETD2
SET domain containing 2
SETDB2
SET domain, bifurcated 2
13q14
2
SGSM3
Small G protein signaling modulator 3
22q13.1-q13.2
2
SLIT3
Slit homolog 3 (Drosophila)
5q35
p21.31
2
2
(continued on next page)
82
Table A.1
-
Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
SODI
Superoxide dismutase 1, soluble
21q22.11
2
ST7
suppression of tumorigenicity 7
7q31.1-q31.3
2
STXBP5
Syntaxin binding protein 5 (tomosyn)
6q24.3
2
SUV420H1
suppressor of variegation 4-20 homolog 1 (Drosophila)
11q13.2
2
SYAPI
Synapse associated protein
Xp22.2
2
SYT17
synaptotagmin
16p12.3
2
TBL1XR1
Transducin (beta)-like 1 X-linked receptor 1
3q26.32
2
TYR
Tyrosinase (oculocutaneous albinism IA)
11ql4-q21
2
UBE3C
Ubiquitin protein ligase E3C
7q36.3
2
YWHAE
Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase ac-
17p13.3
2
1
XVII
tivation protein, epsilon polypeptide
ABCA7
ATP-binding cassette, sub-family A (ABC1), member 7
19p13.3
ADK
adenosine kinase
10qI1-q24
ARHGAP24
Rho GTPase activating protein 24
4q22.1
ATRNL1
Attractin-like 1
10q26
1
ATXN7
Ataxin 7
3p21.1-p12
1
BBS4
Bardet-Biedl syndrome 4
15q22.3-q23
BRCA2
breast cancer 2, early onset
13q12.3
BTAF1
RNA polymerase II, B-TFIID transcription factor-associated,
10q22-q23
1
170kDa (Motl homolog, S. cerevisiae)
C15orf43
chromosome 15 open reading frame 43
15q21.1
1
CAMK4
Calcium/calmodulin-dependent protein kinase IV
5q21.3
1
CAMSAP2
calmodulin
lq32.1
1
regulated
spectrin-associated
protein
family,
member 2
CD99L2
CD99 molecule-like 2
CDKN1B
Cyclin-dependent kinase inhibitor
CECR2
Cat eye syndrome chromosome region, candidate 2
CLSTN3
Calsyntenin 3
12p13.31
CNTNAP3
contactin associated protein-like 3
9p13.1
CSNK1D
casein kinase 1, delta
17q25
DAPPI1
Dual adaptor of phosphotyrosine and 3-phosphoinositides
4q25-q27
DNAJC19
DnaJ (Hsp40) homolog, subfamily C, member 19
3q26.33
DNM1L
Dynamin 1-like
12pl1.21
DOCK10
Dedicator of cytokinesis 10
2q36.2
DOLK
Dolichol kinase
9q34.11
DST
Dystonin
6p12.1
DUSP22
dual specificity phosphatase 22
6p25.3
DYDC1
DPY30 domain containing 1
DYDC2
DPY30 domain containing 2
10q23.1
10q23.1
EIF4EBP2
Eukaryotic translation initiation factor 4E binding protein 2
10q21-q22
EP300
ElA binding protein p300
22q13.2
EPS8
Epidermal growth factor receptor pathway substrate 8
12p12.3
ERG
v-ets erythroblastosis virus E26 oncogene homolog (avian)
21q22.3
FANI
FANCD2/FANCI-associated nuclease
FBXO15
F-box protein 15
18q22.3
FER
Fer (fps/fes related) tyrosine kinase
5q21
FGA
Fibrinogen alpha chain
4q28
GABRA3
Gamma-aminobutyric acid (GABA) A receptor, alpha 3
Xq28
GAN
Gigaxonin
16q24.1
GAP43
Growth associated protein 43
3q13.1-q13.2
GAS2
Growth arrest-specific 2
llp4.3
GNA14
Guanine nucleotide binding protein (G protein), alpha 14
9q21
GNBIL
guanine
22q11.2
nucleotide
polypeptide
GPR37
Xq28
binding
1B
(p27, Kipl)
12p13.1-p1
1
protein
2
22q11.2
15q13.2-q13.3
(G
protein),
beta
1-like
G protein-coupled receptor 37 (endothelin receptor type B-
7q31
like)
GRM4
Glutamate receptor, metabotropic 4
6p2l.3
GSN
Gelsolin
9q33
GUCY1A2
Guanylate cyclase 1, soluble, alpha 2
11q21-q22
1
1
(contin ued
83
on next
page)
Table A.1
-
Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
HDAC6
Histone deacetylase 6
Xpl1.23
1
HMGN1
high mobility group nucleosome binding domain 1
21q22.
HYDIN
HYDIN, axonemal central pair apparatus protein
16q22.2
INADL
InaD-like (Drosophila)
lp31.3
KCTD13
Potassium channel tetramerisation domain containing 13
16pll.2
KIT
V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene ho-
4q11-q12
2
1
1
1
molog
KLC2
Kinesin light chain 2
11q13.2
KPTN
Kaptin (actin binding protein)
19q13.32
LAMA1
Laminin, alpha 1
LAMB1
laminin, beta
LEP
Leptin
7q31.3
LMX1B
LIM homeobox transcription factor 1, beta
9q33.3
LRRC7
Leucine rich repeat containing 7
1p31.1
MAGED1
Melanoma antigen family D, 1
Xp1l.23
MAGEL2
MAGE-like 2
MAPK1
Mitogen-activated protein kinase
MAPK3
mitogen-activated protein kinase 3
16p11.2
MAPK8IP2
Mitogen-activated protein kinase 8 interacting protein 2
22q13.33
MBD6
Methyl-CpG binding domain protein 6
12q13
MSN
Moesin
Xql1.1
MSR1
macrophage scavenger receptor 1
8p22
MTR
5-methyltetrahydrofolate-homocysteine methyltransferase
1q43
MTX2
Metaxin 2
2q31.1
MYH4
Myosin, heavy chain 4, skeletal muscle
17pl3.1
NCKAP5L
NCK-associated protein 5-like
12q13.12
NCKAP5
NCK-associated protein 5
2q21.2
NEFL
Neurofilament, light polypeptide
8p2l
ODF3L2
outer dense fiber of sperm tails 3-like 2
19p13.3
OGT
O-linked
Xql3
PAH
Phenylalanine hydroxylase
12q22-q24.2
PARD3B
Par-3 partitioning defective 3 homolog B (C. elegans)
2q33.3
PCDH8
protocadherin 8
13q21.1
PCDHGA11
protocadherin gamma subfamily A, 11
5q31
PECR
peroxisomal trans-2-enoyl-CoA reductase
2q35
PIK3R2
Phosphoinositide-3-kinase, regulatory subunit 2 (beta)
19q13.2-q13.4
PLAUR
Plasminogen activator, urokinase receptor
19q13
POTI
Protection of telomeres 1 homolog (S. pombe)
7q31.33
PPFIA1
Protein tyrosine phosphatase, receptor type, f polypeptide
11ql3.3
18pll.3
1
7q22
15q11-q12
1
22q11.21
N-acetylglucosamine (GlcNAc) transferase
(PTPRF), interacting protein (liprin), alpha
1
1B
17q12
PPP1R1B
Protein phosphatase 1, regulatory (inhibitor) subunit
PRKD1
Protein kinase Dl
14q11
PTGER3
Prostaglandin E receptor 3 (subtype EP3)
lp3l.2
PTPN11
protein tyrosine phosphatase, non-receptor type 11
12q24
PTPRB
Protein Tyrosine Phosphatase, Receptor Type, B
12q15-q21
RASD1
RAS, dexamethasone-induced 1
17pll.2
RASSF5
Ras association (RalGDS/AF-6) domain family member 5
1q32.1
RERE
Arginine-glutamic acid dipeptide (RE) repeats
1p36.23
RNPS1
RNA binding protein S1, serine-rich domain
l6p13.3
ROBO2
Roundabout, axon guidance receptor, homolog 2 (Drosophila)
3p12.3
RPP25
Ribonuclease P/MRP 25kDa subunit
15q24.2
SCFD2
seci family domain containing 2
4q12
SETDB1
SET domain, bifurcated 1
1q21
SHANKi
SH3 and multiple ankyrin repeat domains 1
19q13.3
SLC16A3
solute carrier family 16, member 3 (monocarboxylic acid
17q25
transporter 4)
SLC16A7
Solute carrier family 16, member 7 (monocarboxylic acid
12q13
transporter 2)
(continued on next page)
84
Table A.1
-
Continued.
Gene Symbol
Gene Name
Chromosomal Location
# Reports
SLC25A14
Solute carrier family 25 (mitochondrial carrier, brain), mem-
Xq24
1
lpl3.3
1
lp2l
1
1
ber 14
SLC25A24
Solute carrier family 25 (mitochondrial carrier; phosphate
carrier), member 24
SLC35A3
Solute carrier family 35 (UDP-N-acetylglucosamine
(UDP-
GlcNAc) transporter), member A3
SLC38A1O
solute carrier family 38, member 10
17q25.3
SLC39A11
Solute carrier family 39 (metal ion transporter), member 11
17q21.31
1
SMG6
Smg-6 homolog, nonsense mediated mRNA decay factor (C.
17p13.3
1
elegans)
SNRPN
small nuclear ribonucleoprotein polypeptide N
15q11.2
1
SNX19
Sorting nexin 19
11q25
1
SPAST
Spastin
2p24-p21
1
SYN3
Synapsin III
22q12.3
1
SYT3
synaptotagmin III
19q13.33
TAFlC
TATA box binding protein (TBP)-associated factor, RNA
16q24
1
1
polymerase I, C, 110kDa
TBL1X
transducin (beta)-like
TBX1
T-box 1
1X-linked
Xp22.3
1
22q11.21
1
1
THRA
Thyroid hormone receptor, alpha
17q11.2
TM4SF20
Transmembrane 4 L six family member 20
2q36.3
1
TNIP2
TNFAIP3 interacting protein 2
4
TOMM20
Translocase of outer mitochondrial membrane 20 homolog
1q42
1
1
TPO
(yeast)
Thyroid peroxidase
2p25
1
TRIM33
Tripartite motif containing 33
lp13.1
1
TTI2
TELO2 interacting protein 2
8p12
1
UBA6
Ubiquitin-like modifier activating enzyme 6
4q13.2
1
UBE2H
ubiquitin-conjugating enzyme E2H (UBC8 homolog, yeast)
7q32
1
UBL7
ubiquitin-like 7 (bone marrow stromal cell-derived)
15q24.1
1
UBR5
Ubiquitin protein ligase E3 component n-recognin 5
8q22
1
UBR7
ubiquitin protein ligase E3 component n-recognin 7 (puta-
14q32.12
1
p16.3
tive)
UPF2
UPF2 regulator of nonsense transcripts homolog (yeast)
lOpl4-pl3
1
USP9Y
ubiquitin specific peptidase 9, Y-linked
Yql1.2
1
XPO1
Exportin 1 (CRM1 homolog, yeast)
2p15
1
YEATS2
YEATS domain containing 2
3q27.1
1
YTHDC2
YTH domain containing 2
5q22.2
1
ZBTB16
Zinc finger and BTB domain containing 16
11q23.1
1
ZNF18
zinc finger protein 18
17pll.2
1
ZNF407
Zinc finger protein 407
18q23
1
ZNF827
Zinc finger protein 827
4q31.22
1
ZSWIM5
zinc finger, SWIM-type containing 5
lp34.1
1
1. We consider only the genes which can be mapped to the largest connected component of our PPI network.
85
86
Appendix B
Risk Genes for ASDs Identified by
Integrative Approach
Table B.1
-
Probabilities of association with ASDs for candidate genes identified by
our integrative analysis approach.
Gene
Symbol
Association Probability
SHANK2
1.000000
HGF
1.000000
CACNA1H
1.000000
EN2
1.000000
MTHFR
1.000000
GRIN2A
1.000000
ANKRD11
1.000000
GATAD2B
1.000000
FBN1
1.000000
COBL
1.000000
BAIAP2
1.000000
TTN
1.000000
SLC1A1
1.000000
FOXP4
1.000000
GABRB3
1.000000
MACROD2
1.000000
TBX1
1.000000
STOXI
1.000000
TSC1
1.000000
SND1
1.000000
HSPC215
1.000000
GJB2
1.000000
STXBP1
1.000000
GAN2B
1.000000
KtAP5-9
1.000000
FAM1i54A
1.000000
RP11-220B22.3
1.000000
LGALS13
1.000000
HSPY1
1.000000
KRTAP26-1
1.000000
KRTAP3-3
1.000000
MIR1OA
1.000000
ARX
1.000000
(continued on next page)
87
Table B.1 - Continued.
Gene
Symbol
Association
Probability
RP11-328M4.1
1.000000
SCT
1.000000
SCN1A
1.000000
NFl
1.000000
AVPR1A
1.000000
MIR223
1.000000
MIR181A1
1.000000
LMNA
1.000000
BCL2
1.000000
TCF4
1.000000
PAX5
1.000000
PAX6
1.000000
PTEN
1.000000
MEF2C
1.000000
MECP2
1.000000
MYH9
1.000000
TRPV4
1.000000
FGFR3
1.000000
NCKIPSD
1.000000
SYNI,
1.000000
COLlAl
1.000000
CTNNA3
1.000000
RAI
1.000000
MCPH1
1.000000
CPE
1.000000
RPL10
1.000000
TP63
1.000000
DIAPH3
1.000000
OPHN1
1.000000
MET
1.000000
SLC25A12
1.000000
SETD2
1.000000
DLX1
1.000000
HOXAI
1.000000
GLI3
1.000000
FGFR2
1.000000
COL2Al
1.000000
FLNA
1.000000
MED12
1.000000
REST
1.000000
GNAS
1.000000
MSX2
1.000000
CNTNAP2
1.000000
ANK3
1.000000
GRIK2
1.000000
NLGN3
1.000000
FOXP2
1.000000
GBA
1.000000
GJA1
1.000000
TWIST2
1.000000
TP53AIP1
1.000000
AVP
1.000000
SLC6A4
1.000000
FLNB
1.000000
DMD
1.000000
OXTR
1.000000
DLGAP3
0.999900
CHM
0.999900
ALPL
0.999900
(continued on next page)
88
Table B.1 - Continued.
Gene
Symbol
Association
Probability
FGF14
0.999900
PAX2
0.999900
SNCG
0.999900
RFWD2
0.999900
FGFR1
0.999900
COL7A1
0.999900
LRP5
0.999800
SMN1
0.999800
AR
0.999800
PRNP
0.999800
GLB1
0.999800
SCN8A
0.999700
RNU5A-1
0.999700
BDNF
0.999700
AES
0.999700
DSP
0.999700
HOXD13
0.999700
ZMIZ1
0.999700
ITLN1
0.999700
PLCD1
0.999600
MAPT
0.999600
XPC
0.999600
PAX3
0.999600
SCN5A
0.999500
WAS
0.999500
PITX2
0.999500
RPl1-519K18.1
0.999500
MSX1
0.999500
UNQ640/PRO1270
0.999400
GATAI
0.999200
RELN
0.999100
CACNAlG
0.999100
RET
0.999100
MPZ
0.999000
KRT1
0.999000
CHRNA7
0.998800
RECQL4
0.998800
DLX2
0.998800
CNTN4
0.998700
ILlRAPLl
0.998700
DISCI.
0.998600
DLX5
0.998400
PELO
0.998400
EDA
0.998300
CACNAlC
0.998200
FAM108A1
0.998200
COLL1A2
0.998100
SLC26A2
0.998000
ABCC8
0.998000
NOG
0.997900
DLGAP1
0.997900
POLG
0.997800
MGAT3
0.997800
CHRNA1
0.997800
LICAM
0.997600
CGI-17
0.997600
IKBKG
0.997500
COLliAl
0.997300
SDC3
0.997300
(continued on next page)
89
Table B.1
-
Continued.
Association Probability
Gene Symbol
TDRD7
0.997200
TYR
0.997000
ASCL3
0.997000
CACNA1A
0.997000
GRIPI
0.996800
ESCO2
0.996700
FOXPI
0.996700
CADPS2
0.996700
AFF2
0.996600
FMR1
0.996500
LOC347475
0.996400
MYH7
0.996400
PARK2
0.996300
GNPTAB
0.996300
TP53
0.996300
FHIT
0.996300
RP11-298P3.3
0.996200
KDM5C
0.996200
WT1
0.996200
PRKAR1A
0.996100
RNU4-1
0.995900
CAPN3
0.995700
DNM2
0.995500
GP1BA
0.995500
ATP7A
0.995400
RORA
0.995400
RYR1
0.995300
RP11-394C23.1
0.995300
DPP1O
0.995200
SCN9A
0.995100
ARIDIB
0.995000
NCAPG2
0.994800
KCNMAI
0.994700
SYNE1
0.994700
AUTS2
0.994500
KRT14
0.994400
UGT1A1
0.994300
EFNB1
0.994300
AHIl
0.994200
ERBB4
0.994200
KLHL1
0.994100
SLC9A9
0.994000
TWISTI.
0.993700
PMP22
0.993600
PCDH19
0.993600
SEMA5A
0.993500
CPT2
0.993500
ATRX
0.992900
ARNT2
0.992800
ATR
0.992800
HDAC4
0.992200
LBR
0.991800
PTPN11
0.991700
KCTD3
0.991600
NTRK3
0.991400
NLRP3
0.991200
UBE3A
0.991000
VDR
0.990800
ERCC6
0.990800
(continued on next page)
90
Table B.1
Gene
-
Continued.
Symbol
Association
Probability
SRGAP2
0.990700
RP11-5F19.1
0.990600
NLGN1
0.990500
VLDLR
0.990300
HPHB2
0.990200
EDNRB
0.990100
GH1
0.990000
SCN2A
0.990000
NLGN4X
0.989500
CMCl.
0.989500
RUNX2
0.989400
APC
0.989300
HSD11B1
0.989300
PSEN1
0.988700
ADRB2
0.988600
SPSB3
0.988500
CBP
0.988300
GNE
0.988300
EIF4E
0.988100
PTPRC
0.987700
DLX6
0.987200
NRXN1
0.986700
NDRG1
0.986400
DUSP22
0.986400
ERCC6L2
0.986200
EVC
0.986000
RMRP
0.985600
MGAT5B
0.985300
ERAG
0.984900
4-OCT
0.984900
SPINK5
0.984400
OFD1
0.984200
MBD5
0.983700
TRPS1
0.983500
SYNE3
0.983100
BSCL2
0.982600
TSC2
0.982400
FH
0.982100
COQ2
0.981700
BRCA2
0.981600
SALL4
0.980500
NCS-1
0.980200
FDXR
0.979900
CTR9
0.979400
MAOA
0.979100
EYAl
0.979000
HR
0.978900
TBCE
0.978200
TR-B
0.977400
HOXA13
0.977100
SPSB4
0.976700
MBNL2
0.976700
PC
0.976400
ANKH
0.976300
PRPS1
0.976200
PHLDA3
0.975900
EPB41 L3
0.975800
HOXC8
0.975700
MYO7A
0.975400
(continued on next page)
91
Table B.1
Gene
-
Continued.
Symbol
Association
Probability
SHANK3
0.974700
CAV3
0.974400
MYO5A
0.973400
SCN4A
0.973200
IL2RG
0.972800
FAM111A
0.972300
NROB1
0.972200
DYSF
0.971900
HOXD3
0.971200
DPP6
0.971100
SOST
0.971000
RNU6-1
0.970500
DYTIO
0.970500
GNB2L1
0.970200
ALG6
0.969900
WFS1
0.969600
XPA
0.966300
BUB1B
0.966300
INPPL1
0.966200
CDKL5
0.966100
IL6
0.966000
ZEB2
0.965700
RB1
0.965700
ALS2
0.965400
PDP1
0.964600
ESR1
0.962300
INSR
0.962200
LRIG1
0.962200
SHH
0.962100
GDAP1
0.961900
CD96
0.960800
HHF1
0.960000
CTSC
0.958600
PTPN1
0.958200
KCND3
0.956900
TGFBI
0.956600
FRS2
0.955700
THTPA
0.955300
CACNA1S
0.955200
ITGA2B
0.953700
GDF5
0.953300
TACC3
0.952700
GHR
0.952500
COL4A3
0.952400
DLL3
0.950700
SLC17A5
0.950700
WNT2
0.949700
FOXG1
0.949300
PIGL
0.949200
AIMI
0.949100
TADA3
0.948600
EMD
0.948200
MIF
0.948000
NTF3
0.948000
IRF6
0.947700
PAX8
0.947400
AASS
0.947300
HMGA2
0.947200
NF2
0.946400
(continued on next page)
92
Table B.1 - Continued.
Gene Symbol
Association Probability
NPHP1
0.946000
KCND2
0.945900
SMARCA2
0.945800
PEX5
0.945700
TBR1
0.944600
THRB
0.944600
SCARF2
0.944000
PANK2
0.943900
HSPG2
0.943600
ARSB
0.943100
FAM123B
0.942500
LAMA3
0.940800
SMS
0.940000
ABCA4
0.939900
SLC13A3
0.939100
KAT6B
0.938900
SMPD1
0.937900
ERCC2
0.937000
TREXI
0.936700
FLCN
0.936000
HOXB1
0.934500
SIM2
0.933800
SNTG1
0.933200
NKX2-1
0.933100
RP11-258C19.2
0.930600
PKHD1
0.929300
FCP1
0.928300
HPRT1
0.928300
ELANE
0.928100
NELL2
0.926300
PRRT2
0.925700
ROR2
0.925600
APC2
0.925500
FKRP
0.925300
OPAl
0.925200
SLC37A4
0.924400
MEF2A
0.923000
HBB
0.922400
STK11
0.922300
RAPGEF4
0.922000
EYA4
0.918500
CDH3
0.917900
ZNF81
0.917700
N4BP2L2
0.917500
DYM
0.917500
EGR2
0.917400
BINI
0.917000
HNFIA
0.915500
NTNG1
0.910400
CPs1
0.910100
KIT
0.909700
AHNAK
0.908500
CFTR
0.907800
TDO2
0.907000
TTR
0.906300
AVEN
0.906300
MITF
0.904200
SPTAN1
0.903100
TBPL1
0.902200
(continued on next page)
93
Table B.1 - Continued.
Gene
Symbol
Association
Probability
GLRA2
0.901600
PTH
0.901600
SMOCi
0.900300
RPS6KA3
0.899000
ADCK4
0.897800
SEC23A
0.896300
ASTN2
0.892600
CYLD
0.892400
BMP4
0.892100
MBNL1
0.889700
FOXHI
0.888200
KCNQ1
0.887100
ATP2A2
0.885500
DES
0.884400
CASR
0.882600
FLT4
0.877300
BCL7B
0.874000
DAAP-218M18.8
0.872500
FOXO4
0.872500
SOX3
0.872500
SYNM
0.871800
RAB40B
0.871400
GNAQ
0.869900
RP11-419L10.1
0.869900
SMARCEI
0.869400
FAM189A1
0.868800
PHF11
0.866700
PICK1
0.866200
XIST
0.866100
GPC6
0.864900
ATP8B1
0.862900
HSD17B4
0.861100
TRIM25
0.861000
NTRK1
0.859300
PLP1
0.858700
TBC1D24
0.857300
PLCG1
0.857300
OPN1LW
0.857200
GBE1
0.855300
ELN
0.854300
DDX59
0.853800
FOXL2
0.853400
ABCC6
0.852900
LHCGR
0.848500
VWF
0.846900
NOS3
0.846500
TSSK2
0.846100
STIMI
0.845000
DDB2
0.844800
VHL
0.844800
ATP6VOA2
0.844700
PRKAG2
0.844700
NCOA2
0.844600
NPHP3
0.842200
SOS'
0.842000
ITM2B
0.841300
94
Appendix C
Subnetworks in ASD Risk Gene Set
T1able C.1
-
Subnetworks in ASD risk gene
set
generated by QIAGEN's Ingenuity
0
Pathway Analysis
(IPA).1
ID
Molecules in Network
Score
Seed Genes
Top Diseases and Functions
1
ADRB2, Apl, ARNT2, ATP6VOA2, ATP8B1, CFTR,
46
28
Cancer,
EGR2, FLNA, GBE1, GH1, GNB2L1, HBB, HNF1A,
HPRT1, IL1RAPL1, INSR, Insulin, KIT, Mek, MET,
S6k, p85 (pik3r), PAX6, PDGF
SLC37A4, SLC9A9, SND1, SNTG1,
p70
Tissue Morphology,
Gastrointestinal Disease
BB, Proinsulin,
SPSB3, THRB,
UGT1A1, VHL, WFS1, XIST
2
ABHD17A,
DUSP22,
ADCK4,
BSCL2, CHRNA1,
ERBB, ERK,
HOXAI,
FAM154A,
HOXD3, HSFY1/HSFY2,
KRTAP26-1,
LAMA3,
KRTAP3-3,
CHRNA7,
FDXR,
44
27
Hnf3,
Igfbp, ITGA2B,
KRTAP5-9,
Cancer, Tissue Morphology,
Nervous System Development
and Function
LiCAM,
LGALS13, MGAT5B, N4BP2L2,
Nicotinic
acetylcholine receptor, NKX2-1, NRG (family), PAX8,
PC, POLG, SERCA, sGC, SMOCI, SOST, TGFBI
3
AES,
Akt,
ANK3,
CTNNSS-TCF/LEF,
DLX6,
FGF14,
GABRB3,
MSX1,
FOXG1,
HOXC8,
MSX2,
SCN4A,
ARX,
FOXHI,
KDM5C,
PMP22,
SCN5A,
Calbindin,
CDKL5,
CYPI9, DLX1, DLX2,
Developmental
Hereditary
Disorder,
Disorder,
Neuro-
logical Disease
MIR124,
SCNlA,
SCN9A,
27
FOXO4,
MECP2,
REST,
SCN8A,
Foxo,
43
DLX5,
SCN2A,
SOX2-OCT4-
NANOG, SYNM, voltage-gated sodium channel
4
ABCC6,
ABCC8,
CADPS2,
DMD,
CTR9,
EMD,
GDAP1,
LDL-cholesterol,
NCAPG2,
SLC25A12,
ARID1B,
LMNA,
OFD1,
ATPase,
cyclooxygenase,
IL23,
MEF2C,
MYH7,
PELO,
SMARCEl,
41
26
DISCI,
GNPTAB,
P38 MAPK,
SMARCA2,
BCL7B,
DES,
Cancer,
and
Organismal
Injury
Abnormalities,
Reproductive System Disease
KAT6B,
MYH9,
PHLDA3,
Spectrin,
SP-
TANI, SRGAP2, Tropomyosin, Troponin t, tubulin
(family)
5
20s proteasome,
plex),
AHIl, AMER1, APC, APC (com-
B-cell receptor,
BUB1B, Ctbp,
Eif4g,
FH,
GBA, GJB2, Glycogen synthase, Histone Hl, INPPL1,
KRT1, KRT14, Lamin, Mapk, MBD5, MYO7A, NF2,
34
23
Hereditary
tory
Disorder,
Disease,
Disease
Audi-
Neurological
NPHP1, NPHP3, OPA1, PRKAC, PRKARIA, Rab5,
Snare,
SNCG,
STXBP1,
SYN1,
SYNEl,
SYNE3,
TACC3
(continued on next page)
95
Table C.1
-
Continued.
ID
Molecules in Network
Score
Seed Genes
Top Diseases and Functions
6
ASTN2, ATR, BRCA2, Cdc2, CNTNAP2, Cyclin B,
34
24
Cancer,
NCOA2,
Nuclear
1,
factor
PARP,
Pde4,
Dermatological Dis-
eases and Conditions, Heredi-
DDB2, ERCC2, ERCC6, MBNL1, MBNL2, MCPH1,
tary Disorder
PDPl,
RECQL4, RFWD2, RNA poly-
PHF11, PRKAG2,
merase I, RNA polymerase II, Rnr, RPA, SETD2,
TRIM25,
Ube3,
Beta Tubulin,
BMP,
TDO2, TFIIH, TP53, TP53AIP1,
XPA, XPC, ZMIZ1
7
7S NGF,
Gli,
elastase,
ETS,
Integrin
alpha 3
EDA,
GAP3,
BAIAP2,
Arp2/3,
32
22
GRIN2B,
Filamin,
1,
beta
Cell-To-Cell Signaling and Interaction,
COBL, COL4A3, DIAPH3, DLGAP1, DL-
CAPN3,
Nervous
G-Actin,
Development
KCND3,
Behavior
and
System
Function,
LBR, MYO5A, NCKIPSD, NFkB (complex), NLGN1,
RPS6KA3,
RELN,
RAPGEF4,
Profilin,
NLGN3,
SHANK2, SHANKS, SMS, TBRI, Wave
8
28
Alpha tubulin, APC2, APC/APC2, BMP4, CK1, Cy-
clin
D, CYLD,
Dgk, Dishevelled,
20
Embryonic Development, Organismal Development, Gene
FLCN,
Dynein,
Expression
GLI3, Hedgehog, IRS, Jnk, KCNQ1, LRP, LRP5, mir181, MPZ, PAX2, PAX3, PITX2, ROR2, Secretase
gamma, SHH, SOX3, TBX1, TWIST2, Vdac, VLDL,
VLDLR, Wnt, WNT2, ZEB2
9
14-3-3, ATP7A, c-Src, COL11A1, COL11A2, COLlAl,
28
21
Developmental
Disorder,
Tissue
Disor-
COL2A1, COL7A1, collagen, Collagen type II, Col-
Connective
lagen Type XI, Collagen(s), Cpla2, DLL3, ELANE,
ders, Skeletal and Muscular
FBN1,
ELN, EPB41L3,
rin,
Fc gamma receptor,
Fib-
HSD17B4,
mir-
HSD11B1,
HOXD13,
GDF5,
MUSCLE
Notch, PEX5, SMOOTH
10, MTORC2,
Disorders
ACTIN, SPINK5, trypsin, TSC1, TSC2, Vegf, VWF
10
Cdk, Cyclin
AR, ATRX, caspase,
26s Proteasome,
27
20
idase,
EIF4E, ESCO2, GATAI,
HDAC4, HOXA13,
Hsp90,
MED12, Mitochon-
Hsp27, Hsp70,
MAOA,
Cancer, Cell Death and Survival, Cell Cycle
E, Cytochrome bcl, cytochrome C, cytochrome-c ox-
PARK2, PP2A, PRPS1,
drial complex 1, NDRG1,
Rb, SEC23A, SLC6A4, SRC (family), STK11, TBCE,
TRPS1, TSSK2, TYR, Ubiquitin, WT1
11
ANKH,
Atrial
Peptide,
Natriuretic
Cacnal,
cacn,
26
19
Molecular Transport, Cancer,
CACNA1H,
Organismal
CACNA1S, CAV3, DPP6, DPP10, ERK1/2, GLRA2,
normalities
CACNA1A,
CACNA1C,
CACNA1G,
KLHL1,
GNAS, Homer, ITPR, KCND2, KCNMA1,
rotrophin,
MGAT3,
Channel,
L-type Calcium
NOG,
Injury and Ab-
NELL2,
Neu-
NRXN1,
Pka catalytic subunit,
Pkg, Pki, potassium channel, Presenilin, Ryr, RYR1,
T-type
Calcium
Channel,
calcium
voltage-gated
channel
12
Ahr-aryl hydrocarbon-Arnt,
ABCA4,
Cyclin
A,
E2f,
DSP,
ARSB, Bcl9-
Cbp/p300, CDH3,
Cbp/p300-Ctnnbl-Lef/Tcf,
ESR1,
EN2,
24
20
h4,
peroxidase,
HMGA2,
SEMA5A,
NROB1,
Smadi/5/8,
SMN1/SMN2,
TADA3,
Hat,
RNase
A,
RPL10,
Devel-
ismal Development
SDC3,
Smad2/3-Smad4,
Smad2/3,
TBPL1,
Function,
opment and Function, Organ-
Histone
HISTONE,
and
Reproductive System
Esrl-Esrl-
estrogen-estrogen, estrogen receptor, FMR1, FOXL2,
glutathione
Renal and Urological System
Development
CPE,
TP63,
TWIST1,
UBE3A
13
AHNAK,
ATP2A2,
Cebp,
CPT1,
CPT2,
DYM,
FOXPI, FOXP2, FOXP4, GATAD2B, HR, IL6, INTERLEUKIN, IRF6, ITLN1, JUN/JUNB/JUND,
cor,
Na+,
K+
-ATPase,
Nrlh,
PEPCK,
22
17
Cancer, Gastrointestinal Disease, Hematological Disease
N-
Pmca,
PRKAA, PTH, RAIl, Rar, Rbp, Rxr, SALL4, STIM1,
SWI-SNF, Tcf 1/3/4, thymidine kinase, thyroid hormone receptor, TTR, VitaminD3-VDR-RXR
(continued on next page)
96
Table
C.1
-
Continued.
ID
Molecules in Network
14
ALS2,
Ampa Receptor,
CaMKII,
Cofilin,
Ctnna,
Score
Seed Genes
Top Diseases and Functions
18
16
Cell Death and Survival, Can-
EFNB1, EPHB2, F Actin, FHIT, FLNB, GNE, GRI,
GRIK2, GRIN2A, GRIP1, Integrin alpha V beta
cer, Gastrointestinal Disease
3, mGluR, Mic, Myosin, N-Cadherin, OPHN1, Pak,
PCDH19, PICKI, Pkc(s), PLCDI, PPI protein complex group,
Pp2b, Rabli,
RAB40B,
Rap,
Rapi,
SLC1A1, TSH, TTN
15
amylase,
eratin,
chymotrypsin,
DNM2,
Collagen type III, Cytok-
EFNB, ENaC,
Fgf, Fgfr,
17
14
FGFR1,
Cancer, Cell Death and Survival, Cellular Function and
FGFR2, FGFR3, FLT4, FRS2, Gap, GHR, GP1BA,
Maintenance
GPIIB-IIIA, growth factor receptor, Hspg, HSPG2,
IRS1/2, NCK, NTRK1, NTRK3, Ntrk dimer, Pdgfr,
P13K (complex), P13K p85,
PLC gamma,
PLCG1,
PTPN11, RPS6KA, SCT, Vla-4
16
ABRACL, AFF2, ANKRDl1, BCL6, C9orf78, CHM,
CPSI, CUTA, CWC27, DYSF, GRB2, KCTD3,
LSM5, LSM6,
LSM12,
POU2F3,
PRPF8,
RNU6-1,
SCARF2,
MT-ATP8,
RNU2-1,
NOS2,
17
14
Cancer,
and
PANK2,
Organismal
Injury
Abnormalities,
Repro-
ductive System Disease
RNU4-1,
SLC17A5,
RNU5A-1,
SNRNP25, SPSB4,
TDRD7, TESPA1, TTYH2, UBC, XIRP2, ZNF443,
ZNF609
17
Adaptor protein 2, ADCY, ADRB, AIM1, Ap2 alpha,
ASCL3, BDNF, BIN1, Caveolin, 0k2, Clathrin, Creb,
15
13
Cell Death and Survival, Cellular
DNA-methyltransferase, Dynamin, GLB1, Gm-csf, Go-
Development,
Cellular
Growth and Proliferation
coupled receptor, GTPase, Hdac, mGLUR Group I,
MITF, MITF-p300/CBP, NFl, NTF3, Pias, PKHD1,
Ppp2c, Ras, RET, Rsk, SIM2, SLC26A2, Syntaxin,
TCF, TCF4
18
Actin, calpain,
CD3, Cg, CNTN4, Cel, DPY19L3,
E130116L18Rik,
ERBB4,
FAM111A,
FSH,
14
14
Cell
Morphology,
Cellular
Assembly and Organization,
GJA1,
HIAT1, Histone h3, I kappa b kinase, Ikb, IKBKG,
IKK (complex), Integrin, MAPT, mir-223, MTORC1,
Cellular Development
NLRP3, PRNP, PSENI, PTEN, RB1, RORA, STAT,
STEAP1, STOXI, SUN5, TCR, Tnf receptor, ZNF211
19
AUTS2,
AVEN, BCL2, C1q, CTNNA3, CTSC, Ifn,
IFN Beta, IFN type 1, Iga, Ige,
12
11
IFN alpha/beta,
IgG, IgG1, Igg3, Igm, IL-2R, IL12 (complex),
liferation,
IL12
Growth
and
Pro-
Lymphoid Tissue
Structure and Development,
(family), IL2RG, Immunoglobulin, Interferon alpha,
ITM2B, Ldh (complex), LRIG1, mediator, MHC Class
II
Cellular
Organ Morphology
(complex), MHC II, MIR101, PAX5, PLP1, snRNP,
STATUa/b, Tlr, TREX1
20
ALG6, Baspl, CCNDI, CD96, COQ2, DAGI, DONSON,
DPH1,
GLI1,
GPC6,
MACROD2,
EPB41L4B,
H2AFY2,
PKD1,
ESCO2,
HACLI,
POPI,
EVC,
HRAS,
POP4,
12
11
FKRP,
IMPA2,
PPCS,
Cancer, Developmental Disorder, Cellular Growth and Proliferation
PRE-
LID1, PTTG1IP, RAB23, RASSF6, RMRP, SFXN3,
SLC13A3,
21
TBC1D24, TENM3, THG1L, UBC, ZNF711
ALPL, BCR (complex), Calcineurin A, Calcineurin
10
11
Post-Translational
Modifica-
protein(s), EYA1, EYA4, Fcerl, Gsk3, HOXB1, JAK,
tion,
Lh, MAP2K1/2, MEF2, MEF2A, NFAT (complex),
Organismal Development
Nfat (family),
Pdgf (complex),
phosphatase,
Organismal
Survival,
PISK
(family), Pka, Ptk, PTPase, PTPN1, PTPRC, Raf,
Shc, SHP, Sod, Sos, SOS1, SYK/ZAP, THTPA,
TRPV4, tyrosine kinase, WAS
(continued on next page)
97
Table C.1 -
Continued.
ID
Molecules in Network
22
AASS,
CNP,
ALDOC, ATP2B2,
DLG4,
EPB41L3,
BAIAP2,
GRID2,
CIT,
Grik,
CMC1,
Score
Seed Genes
8
9
Top Diseases and Functions
Cell-To-Cell Signaling and Interaction,
GRIK2,
Nervous
GRIK5, GRIN2C, GRIN2D, HTT, HUNK, KCNA1,
Development
KCNAB1,
Cancer
KCNJ2,
KCNJ4,
KCNJ12,
MAP3K10,
System
and Function,
MBTPS1, NLGN4X, NRXN1, PCLO, PLP1, PRELP,
SDHA, SFXN3, SRGAP3, STX1B, STXBP1, TYRO3
23
Alp, Alpha catenin, ALT, AMPK, C/ebp, Collagen
Alphal,
Collagen type
I,
8
8
Cell Death and Survival, Cellular Growth and Prolifera-
Collagen type IV, crea-
tion, Tissue Development
tine kinase, CYP, Fibrinogen, Focal adhesion kinase,
Growth hormone, HDL, HDL-cholesterol, hemoglobin,
HGF, Ifn gamma, ILl, JINK1/2, Laminin, LDL, MIF,
MTHFR,
Nos,
NOS3,
Pro-inflammatory
Cytokine,
Rock, RUNX2, SCARF2, Smad, SMPD1, Tgf beta, Tnf
(family), VDR
24
Alpha Actinin,
AVP, AVPRlA,
Angiotensin
II
receptor
type
1,
5
8
Beta Arrestin, Calmodulin, CASR,
Carbohydrate
Metabolism,
Molecular Transport,
chemokine, EDNRB, Endothelin, G protein, G pro-
Small
Molecule Biochemistry
tein alpha, G protein alphai, G protein beta gamma,
G-protein beta, GNAQ, GNRH, Gpcr, IgG2a, IgG2b,
LHCGR,
Metalloprotease,
Mmp, NMDA
Receptor,
OPN1LW, OXTR, PLC, PId, Rac, Ras homolog, Relaxin, Sapk, Sfk, Trk Receptor, tubulin (complex)
25
2
ERCC6L2, NEK6
1
Cell Cycle,
Cellular Move-
ment, Cell Morphology
1. IPAO QIAGEN Redwood City, http://wwv.qiagen.con/ingenuity
98
Bibliography
[1] Online Mendelian Inheritance in Man, OMIM®, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), March 2014. World Wide
Web URL: http://omim.org/.
[2] Brett S Abrahams and Daniel H Geschwind. Advances in autism genetics: on the
threshold of a new neurobiology. Nature Reviews Genetics, 9(5):341-355, 2008.
[3] Stein Aerts, Diether Lambrechts, Sunit Maity, Peter Van Loo, Bert Coessens, Frederik De Smet, Leon-Charles Tranchevent, Bart De Moor, Peter Marynen, Bassem
Hassan, et al. Gene prioritization through genomic data fusion. Nature Biotechnology,
24(5):537-544, 2006.
[4] David Altshuler, Mark Daly, and Leonid Kruglyak.
Genetics, 26(2):135-138, 2000.
Guilt by association.
Nature
[5] David Altshuler, Mark J Daly, and Eric S Lander. Genetic mapping in human disease.
Science, 322(5903):881-888, 2008.
[6] JY An, AS Cristino,
Q Zhao,
J Edson, SM Williams, D Ravine, J Wray, VM Marshall,
A Hunt, AJO Whitehouse, et al. Towards a molecular characterization of autism spectrum disorders: an exome sequencing and systems approach. TranslationalPsychiatry,
4(6):e394, 2014.
[7] Richard Anney, Lambertus Klei, Dalila Pinto, Joana Almeida, Elena Bacchelli, Gillian
Baird, Nadia Bolshakova, Sven B6lte, Patrick F Bolton, Thomas Bourgeron, et al. Individual common variants exert weak effects on the risk for autism spectrum disorders.
Human Molecular Genetics, 21(21):4781-4792, 2012.
[8] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler,
J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al.
Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25-29, 2000.
[9] American Psychiatric Association. The Diagnostic and Statistical Manual of Mental
Disorders (5th ed.). American Psychiatric Publishing, 2013.
&
[10 Samy A Azer. Overview of molecular pathways in inflammatory bowel disease associated with colorectal cancer development. European Journal of Gastroenterology
Hepatology, 25(3):271-281, 2013.
[11] Albert-Ldszl6 Barab~si, Natali Gulbahce, and Joseph Loscalzo. Network medicine:
a network-based approach to human disease. Nature Reviews Genetics, 12(1):56-68,
2011.
99
[12] Colin A Baron, Clifford G Tepper, Stephenie Y Liu, Ryan R Davis, Nicholas J Wang,
N Carolyn Schanen, and Jeffrey P Gregg. Genomic and functional profiling of duplicated chromosome 15 cell lines reveal regulatory alterations in UBE3A-associated
ubiquitin-proteasome pathway processes. Human Molecular Genetics, 15(6):853-869,
2006.
[13] Saumyendra N Basu, Ravi Kollu, and Sharmila Banerjee-Basu. AutDB: a gene reference resource for autism research. Nucleic Acids Research, 37(suppl 1):D832-D836,
2009.
[14] Brent R Bill and Daniel H Geschwind. Genetic advances in autism: heterogeneity
and convergence on shared pathways. Current Opinion in Genetics & Development,
19(3):271-278, 2009.
[15] Douglas C Bittel, Nataliya Kibiryeva, and Merlin G Butler. Whole genome microarray
analysis of gene expression in subjects with fragile X syndrome. Genetics in Medicine,
9(7):464-472, 2007.
[16] Hans K Blomquist, Michael Bohman, Sven Olof Edvinsson, Christopher Gillberg,
Karl-Henrik Gustavson, Gdsta Holmgren, Jan Wahlstr6m, et al. Frequency of the
fragile X syndrome in infantile autism. Clinical Genetics, 27(2):113-117, 1985.
[17] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web
search engine. Computer Networks and ISDN Systems, 30(1):107-117, 1998.
[18] OJ Broom, B Widjaya, J Troelsen, Jorgen Olsen, and OH Nielsen. Mitogen activated protein kinases: a role in inflammatory bowel disease? Clinical & Experimental
Immunology, 158(3):272-280, 2009.
[19] Andrej Bugrim, Tatiana Nikolskaya, and Yuri Nikolsky. Early prediction of drug
metabolism and toxicity: systems biology approach and modeling. Drug Discovery
Today, 9(3):127-135, 2004.
[20] Joseph D Buxbaum. Multiple rare variants in the etiology of autism spectrum disorders. Dialogues in Clinical Neuroscience, 11(1):35, 2009.
[21] Malcolm G Campbell, Isaac S Kohane, and Sek Won Kong. Pathway-based outlier
method reveals heterogeneous genomic structure of autism in blood transcriptome.
BMC Medical Genomics, 6(1):34, 2013.
[22] Rita M Cantor, Naoko Kono, Jackie A Duvall, Ana Alvarez-Retuerto, Jennifer L Stone,
Maricela Alarc6n, Stanley F Nelson, and Daniel H Geschwind. Replication of autism
linkage: fine-mapping peak at 17q21. The American Journal of Human Genetics,
76(6):1050-1056, 2005.
[23] Mengfei Cao, Hao Zhang, Jisoo Park, Noah M Daniels, Mark E Crovella, Lenore J
Cowen, and Benjamin Hescott. Going the Distance for Protein Function Prediction:
A New Distance Metric for Protein Interaction Networks. PloS One, 8(10):e76339,
2013.
[24] MA Care, JR Bradford, CJ Needham, AJ Bulpitt, and DR Westhead. Combining the
interactome and deleterious SNP predictions to improve disease gene identification.
Human Mutation, 30(3):485-492, 2009.
100
[25] Wenjun Chang, Liye Ma, Liping Lin, Liqiang Gu, Xiaokang Liu, Hui Cai, Yongwei
Yu, Xiaojie Tan, Yujia Zhai, Xingxing Xu, et al. Identification of novel hub genes
associated with liver metastasis of gastric cancer. International Journal of Cancer,
125(12):2844-2853, 2009.
[26] Yanqing Chen, Jun Zhu, Pek Yee Lum, Xia Yang, Shirly Pinto, Douglas J MacNeil,
Chunsheng Zhang, John Lamb, Stephen Edwards, Solveig K Sieberts, et al. Variations
in DNA elucidate molecular networks that cause disease. Nature, 452(7186):429-435,
2008.
[27] David Croft, Antonio Fabregat Mundo, Robin Haw, Marija Milacic, Joel Weiser, Guanming Wu, Michael Caudy, Phani Garapati, Marc Gillespie, Maulik R Kamdar, et al.
The Reactome pathway knowledgebase. Nucleic Acids Research, 42(D1):D472-D477,
2014.
[28] Disabilities Monitoring Network Surveillance Year Developmental, 2010 Principal Investigators, et al. Prevalence of autism spectrum disorder among children aged 8 yearsautism and developmental disabilities monitoring network, 11 sites, United States,
2010. Morbidity and Mortality Weekly Report. Surveillance Summaries (Washington,
DC: 2002), 63:1, 2014.
[29] Bernie Devlin, Nadine Melhem, and Kathryn Roeder. Do common variants play a role
in risk for autism? Evidence and theoretical musings. Brain Research, 1380:78-84,
2011.
[30] ZoltAn Dezsd, Yuri Nikolsky, Tatiana Nikolskaya, Jeremy Miller, David Cherba, Craig
Webb, and Andrej Bugrim. Identifying disease-specific genes based on their topological
significance in protein networks. BMC Systems Biology, 3(1):36, 2009.
[31] Annette J Dobson. An introduction to generalized linear models. CRC press, 2001.
[32] Lynnette R Ferguson. Nutrigenetics, nutrigenomics and inflammatory bowel diseases.
Expert Review of Clinical Immunology, 9(8):717-726, 2013.
[33] Eric Fombonne. Epidemiology of autistic disorder and other pervasive developmental
disorders. The Journal of Clinical Psychiatry, 66:3-8, 2004.
[34] Lude Franke, Harm van Bakel, Like Fokkens, Edwin D De Jong, Michael EgmontPetersen, and Cisca Wijmenga. Reconstruction of a functional human gene network,
with an application for prioritizing positional candidate genes. The American Journal
of Human Genetics, 78(6):1011-1025, 2006.
[35] Christine M Freitag. The genetics of autistic disorders and its clinical relevance: a
review of the literature. Molecular Psychiatry, 12(1):2-22, 2006.
[36] Richard A George, Jason Y Liu, Lina L Feng, Robert J Bryson-Richardson, Diane
Fatkin, and Merridee A Wouters. Analysis of proein sequence and interaction data
for candidate disease gene prediction. Nucleic Acids Research, 34(19):e130-e130, 2006.
[37] Daniel H Geschwind, Janice Sowinski, Catherine Lord, Portia Iversen, Jonathan Shestack, Patrick Jones, Lee Ducat, Sarah J Spence, AGRE Steering Committee, et al.
The autism genetic resource exchange: a resource for the study of autism and related
neuropsychiatric conditions. American Journal of Human Genetics, 69(2):463, 2001.
101
[38] Kwang-Il Goh, Michael E Cusick, David Valle, Barton Childs, Marc Vidal, and AlbertLAszl6 BarabAsi. The human disease network. Proceedings of the National Academy
of Sciences, 104(21):8685-8690, 2007.
[39] Jeffrey P Gregg, Lisa Lit, Colin A Baron, Irva Hertz-Picciotto, Wynn Walker, Ryan A
Davis, Lisa A Croen, Sally Ozonoff, Robin Hansen, Isaac N Pessah, et al. Gene
expression changes in children with autism. Genomics, 91(1):22-29, 2008.
[40] Joachim Hallmayer, Sue Cleveland, Andrea Torres, Jennifer Phillips, Brianne Cohen,
Tiffany Torigoe, Janet Miller, Angie Fedele, Jack Collins, Karen Smith, et al. Genetic
heritability and shared environmental factors among twin pairs with autism. Archives
of General Psychiatry, 68(11):1095-1102, 2011.
[41] Valerie W Hu, Bryan C Frank, Shannon Heine, Norman H Lee, and John Quackenbush. Gene expression profiling of lymphoblastoid cell lines from monozygotic twins
discordant in severity of autism reveals differential regulation of neurologically relevant
genes. BMC Genomics, 7(1):118, 2006.
[42] KR Hughes, F Sablitzky, and YR Mahida. Expression profiling of Wnt family of
genes in normal and inflammatory bowel disease primary human intestinal myofibroblasts and normal human colonic crypt epithelial cells. Inflammatory Bowel Diseases,
17(1):213-220, 2011.
[43] Daehee Hwang, Inyoul Y Lee, Hyuntae Yoo, Nils Gehlenborg, Ji-Hoon Cho, Brianne
Petritis, David Baxter, Rose Pitstick, Rebecca Young, Doug Spicer, et al. A systems
approach to prion disease. Molecular Systems Biology, 5(1), 2009.
[44] Sohyun Hwang, Seung-Woo Son, Sang Cheol Kim, Young Joo Kim, Hawoong Jeong,
and Doheon Lee. A protein interaction network associated with asthma. Journal of
Theoretical Biology, 252(4):722-731, 2008.
[45] Ronald Jansen, Haiyuan Yu, Dov Greenbaum, Yuval Kluger, Nevan J Krogan, Sambath Chung, Andrew Emili, Michael Snyder, Jack F Greenblatt, and Mark Gerstein. A
Bayesian networks approach for predicting protein-protein interactions from genomic
data. Science, 302(5644):449-453, 2003.
[46] LB Jorde, SJ Hasstedt, ER Ritvo, A Mason-Brothers, BJ Freeman, C Pingree,
WM McMahon, B Petersen, WR Jenson, and A Mo. Complex segregation analysis of autism. The American Journalof Human Genetics, 49(5):932, 1991.
[47] Minoru Kanehisa and Susumu Goto. KEGG: kyoto encyclopedia of genes and genomes.
Nucleic Acids Research, 28(1):27-30, 2000.
[48] Minoru Kanehisa, Susumu Goto, Yoko Sato, Masayuki Kawashima, Miho Furumichi,
and Mao Tanabe. Data, information, knowledge and principle: back to metabolism in
KEGG. Nucleic Acids Research, 42(D1):D199-D205, 2014.
[49] Shaul Karni, Hermona Soreq, and Roded Sharan. A network-based method for predicting disease-causing genes. Journal of ComputationalBiology, 16(2):181-189, 2009.
[50] Arthur Kaser and Herbert Tilg. "Metabolic aspects" in inflammatory bowel diseases.
Current Drug Delivery, 9(4):326-332, 2012.
102
[51] Paul Julian Kersey, James E Allen, Mikkel Christensen, Paul Davis, Lee J Falin,
Christoph Grabmueller, Daniel Seth Toney Hughes, Jay Humphrey, Arnaud Kerhornou, Julia Khobova, et al. Ensembl Genomes 2013: scaling up access to genome-
wide data. Nucleic Acids Research, 42(D1):D546-D552, 2014.
[52] Yoo-Ah Kim, Stefan Wuchty, and Teresa M Przytycka. Identifying causal genes
and dysregulated pathways in complex diseases.
PLoS Computational Biology,
7(3):e1001095, 2011.
[53] Young Shin Kim, Bennett L Leventhal, Yun-Joo Koh, Eric Fombonne, Eugene Laska,
Eun-Chung Lim, KeuA-Ah Cheon, Soo-Jeong Kim, Young-Key Kim, HyunKyung Lee,
et al. Prevalence of autism spectrum disorders in a total population sample. American
Journal of Psychiatry, 168(9):904-912, 2011.
[54] Michael D Kogan, Stephen J Blumberg, Laura A Schieve, Coleen A Boyle, James M
Perrin, Reem M Ghandour, Gopal K Singh, Bonnie B Strickland, Edwin Trevathan,
and Peter C van Dyck. Prevalence of parent-reported diagnosis of autism spectrum
disorder among children in the US, 2007. Pediatrics, 124(5):1395-1403, 2009.
[55] Isaac S Kohane, Andrew McMurry, Griffin Weber, Douglas MacFadden, Leonard Rappaport, Louis Kunkel, Jonathan Bickel, Nich Wattanasin, Sarah Spence, Shawn Murphy, et al. The co-morbidity burden of children and young adults with autism spectrum
disorders. PloS One, 7(4):e33224, 2012.
[56] Sebastian K6hler, Sebastian Bauer, Denise Horn, and Peter N Robinson. Walking the
interactome for prioritization of candidate disease genes. The American Journal of
Human Genetics, 82(4):949-958, 2008.
[57] Michael Krauthammer, Charles A Kaufmann, T Conrad Gilliam, and Andrey Rzhetsky. Molecular triangulation: bridging linkage and molecular-network information
for identifying candidate genes in Alzheimer's disease. Proceedings of the National
Academy of Sciences of the United States of America, 101(42):15148-15153, 2004.
[58] Kasper Lage, E Olof Karlberg, Zenia M Storling, PAl I Olason, Anders G Pedersen,
Olga Rigina, Anders M Hinsby, Zeynep Tiimer, Flemming Pociot, Niels Tommerup,
et al. A human phenome-interactome network of protein complexes implicated in
genetic disorders. Nature Biotechnology, 25(3):309-316, 2007.
[59] Eunjung Lee, Hyunchul Jung, Predrag Radivojac, Jong-Won Kim, and Doheon Lee.
Analysis of AML genes in dysregulated molecular networks. BMC Bioinformatics,
10(Suppl 9):S2, 2009.
[601 Charles William Lees. Role of the hedgehog signalling pathway in inflammatory bowel
disease. PhD thesis, University of Edinburgh, 2009.
[61] CW Lees, JC Barrett, M Parkes, and J Satsangi. New IBD genetics: common pathways
with other diseases. Gut, 60(12):1739-1753, 2011.
[62] Dan Levy, Michael Ronemus, Boris Yamrom, Yoon-ha Lee, Anthony Leotta, Jude
Kendall, Steven Marks, B Lakshmi, Deepa Pai, Kenny Ye, et al. Rare de novo and
transmitted copy-number variation in autistic spectrum disorders. Neuron, 70(5):886-
897, 2011.
103
[63] Yongjin Li and Jagdish C Patra. Genome-wide inferring gene-phenotype relationship
by walking on the heterogeneous network. Bioinformatics, 26(9):1219-1224, 2010.
[64] Luana Licata, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta lannuccelli,
Eugenia Galeota, Francesca Sacco, Anita Palma, Aurelio Pio Nardozza, Elena Santonico, et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids
Research, 40(D1):D857-D861, 2012.
[65] Bolan Linghu, Evan S Snitkin, Zhenjun Hu, Yu Xia, and Charles DeLisi. Genome-wide
prioritization of disease genes and identification of disease-disease associations from
an integrated human functional linkage network. Genome Biology, 10(9):R91, 2009.
[66] Li Liu, Jing Lei, Stephan J Sanders, Arthur Jeremy Willsey, Yan Kou, Abdullah Ercument Cicek, Lambertus Klei, Cong Lu, Xin He, Mingfeng Li, et al. DAWN: a framework to identify autism genes and subnetworks using gene expression and genetics.
Molecular Autism, 5(1):22, 2014.
[67] Manway Liu, Arthur Liberzon, Sek Won Kong, Weil R Lai, Peter J Park, Isaac S
Kohane, and Simon Kasif. Network-based analysis of affected biological processes in
type 2 diabetes models. PLoS Genetics, 3(6):e96, 2007.
[68] Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. Entrez Gene: genecentered information at NCBI. Nucleic Acids Research, 39(suppl 1):D52-D57, 2011.
[69] Christian R Marshall and Stephen W Scherer. Detection and characterization of copy
number variation in autism spectrum disorder. In Genomic Structural Variants, pages
115-135. Springer, 2012.
[70] Douglas R Mathern, Avantika Chitre, Lloyd Mayer, and Stephanie Dahan. The Notch
signaling pathway mediates tight junction protein stoichiometry in IBD: P-203. Inflammatory Bowel Diseases, 17:S72, 2011.
[71] Hans-Werner Mewes, Sabine Dietmann, Dmitrij Frishman, Richard Gregory, Gertrud
Mannhaupt, Klaus FX Mayer, Martin Miinsterk6tter, Andreas Ruepp, Manuel Spannagl, Volker Stimpflen, et al. MIPS: analysis and annotation of genome information
in 2007. Nucleic Acids Research, 36(suppl 1):D196-D201, 2008.
[72] Marcela K Monaco, Joshua Stein, Sushma Naithani, Sharon Wei, Palitha Dharmawardhana, Sunita Kumari, Vindhya Amarasinghe, Ken Youens-Clark, James
Thomason, Justin Preece, et al. Gramene 2013: comparative plant genomics resources.
Nucleic Acids Research, 42(D1):D1193-D1199, 2014.
[73] Linda B Moran and Manuel B Graeber. Towards a pathway definition of ParkinsonSs disease: a complex disorder with links to cancer, diabetes and inflammation.
Neurogenetics, 9(1):1-13, 2008.
[74] Eric M Morrow, Seung-Yun Yoo, Steven W Flavell, Tae-Kyung Kim, Yingxi Lin,
Robert Sean Hill, Nahit M Mukaddes, Soher Balkhy, Generoso Gascon, Asif Hashmi,
et al. Identifying autism loci and genes by tracing recent shared ancestry. Science,
321(5886):218-223, 2008.
[751 Saket Navlakha and Carl Kingsford. The power of protein interaction networks for
associating genes with diseases. Bioinformatics, 26(8):1057-1063, 2010.
104
[76] Rod K Nibbe, Mehmet Koyutiirk, and Mark R Chance. An integrative-omics approach
to identify functional sub-networks in human colorectal cancer. PLoS Computational
Biology, 6(1):e1000639, 2010.
[77] Rod K Nibbe, Sanford Markowitz, Lois Myeroff, Rob Ewing, and Mark R Chance.
Discovery and scoring of protein interaction subnetworks discriminative of late stage
human colon cancer. Molecular & Cellular Proteomics, 8(4):827-845, 2009.
[78] Yuhei Nishimura, Christa L Martin, Araceli Vazquez-Lopez, Sarah J Spence, Ana Isabel Alvarez-Retuerto, Marian Sigman, Corinna Steindler, Sandra Pellegrini, N Carolyn Schanen, Stephen T Warren, et al. Genome-wide expression profiling of lymphoblastoid cell lines distinguishes different forms of autism and reveals shared pathwaysF. Human Molecular Genetics, 16(14):1682-1698, 2007.
[79] Tiago Nunes, Claudio Bernardazzi, and Heitor S de Souza. Cell Death and Inflammatory Bowel Diseases: Apoptosis, Necrosis, and Autophagy in the Intestinal Epithelium.
BioMed Research International, 2014, 2014.
[80] International Molecular Genetic Study of Autism Consortium et al. A full genome
screen for autism with evidence for linkage to a region on chromosome 7q. Human
Molecular Genetics, 7(3), 1998.
[81] International Molecular Genetic Study of Autism Consortium et al. A genomewide
screen for autism: strong evidence for linkage to chromosomes 2q, 7q, and 16p. American Journal of Human Genetics, 69(3):570, 2001.
[82] Stephen Oliver. Proteomics: guilt-by-association goes global. Nature, 403(6770):601-
603, 2000.
[83] Brian J O'Roak, Laura Vives, Santhosh Girirajan, Emre Karakoc, Niklas Krumm,
Bradley P Coe, Roie Levy, Arthur Ko, Choli Lee, Joshua D Smith, et al. Sporadic
autism exomes reveal a highly interconnected protein network of de novo mutations.
Nature, 485(7397):246-250, 2012.
[84] Martin Oti and Han G Brunner. The modular nature of genetic diseases.
Genetics, 71(1):1-11, 2007.
Clinical
[85] Martin Oti, Berend Snel, Martijn A Huynen, and Han G Brunner. Predicting disease
genes using protein-protein interactions. Journal of Medical Genetics, 43(8):691-698,
2006.
[86] Sally Ozonoff, Gregory S Young, Alice Carter, Daniel Messinger, Nurit Yirmiya, Lonnie Zwaigenbaum, Susan Bryson, Leslie J Carver, John N Constantino, Karen Dobkins,
et al. Recurrence risk for autism spectrum disorders: a Baby Siblings Research Consortium study. Pediatrics, 128(3):e488-e495, 2011.
[87] Neelroop N Parikshak, Rui Luo, Alice Zhang, Hyejung Won, Jennifer K Lowe, Vijayendran Chandran, Steve Horvath, and Daniel H Geschwind. Integrative functional
genomic analyses implicate specific molecular pathways and circuits in autism. Cell,
155(5):1008-1021, 2013.
105
[88] Luca Pastorelli, Carlo De Salvo, Marissa A Cominelli, Maurizio Vecchi, and Theresa T
Pizarro. Novel cytokine signaling pathways in inflammatory bowel disease: insight into
the dichotomous functions of IL-33 during chronic intestinal inflammation. Therapeutic
Advances in Gastroenterology, 4(5):311-323, 2011.
[891 Carolina Perez-Iratxeta, Peer Bork, and Miguel A Andrade-Navarro. Update of the
G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids
Research, 35(suppl 2):W212-W216, 2007.
[90] Dalila Pinto, Alistair T Pagnamenta, Lambertus Klei, Richard Anney, Daniele Merico,
Regina Regan, Judith Conroy, Tiago R Magalhaes, Catarina Correia, Brett S Abrahams, et al. Functional impact of global rare copy number variation in autism spectrum
disorders. Nature, 466(7304):368-372, 2010.
[91] G Poelmans, B Franke, DL Pauls, JC Glennon, and JK Buitelaar. AKAPs integrate
genetic findings for autism spectrum disorders. Translational Psychiatry, 3(6):e270,
2013.
[92] Predrag Radivojac, Kang Peng, Wyatt T Clark, Brandon J Peters, Amrita Mohan,
Sean M Boyle, and Sean D Mooney. An integrated approach to inferring genedisease associations in humans. Proteins: Structure, Function, and Bioinformatics,
72(3):1030-1037, 2008.
[931 Monika Ray, Jianhua Ruan, and Weixiong Zhang. Variations in the transcriptome
of Alzheimer's disease reveal molecular networks involved in cardiovascular diseases.
Genome Biology, 9(10):R148, 2008.
[94] Richard Redon, Shumpei Ishikawa, Karen R Fitch, Lars Feuk, George H Perry,
T Daniel Andrews, Heike Fiegler, Michael H Shapero, Andrew R Carson, Wenwei Chen, et al. Global variation in copy number in the human genome. Nature,
444(7118):444-454, 2006.
[95] Angelica Ronald, Francesca Happ6, Patrick Bolton, Lee M Butcher, Thomas S Price,
Sally Wheelwright, Simon Baron-Cohen, and Robert Plomin. Genetic heterogeneity
between the three components of the autism spectrum: a twin study. Journal of the
American Academy of Child & Adolescent Psychiatry, 45(6):691-699, 2006.
[96] Rebecca E Rosenberg, J Kiely Law, Gayane Yenokyan, John McGready, Walter E
Kaufmann, and Paul A Law. Characteristics and concordance of autism spectrum
disorders among 277 twin pairs. Archives of Pediatrics & Adolescent Medicine,
163(10):907-914, 2009.
[97] Lukasz Salwinski, Christopher S Miller, Adam J Smith, Frank K Pettit, James U
Bowie, and David Eisenberg. The database of interacting proteins: 2004 update.
Nucleic Acids Research, 32(suppl 1):D449-D451, 2004.
[98] Rodney C Samaco, Amber Hogart, and Janine M LaSalle. Epigenetic overlap in
autism-spectrum neurodevelopmental disorders: MECP2 deficiency causes reduced
expression of UBE3A and GABRB3. Human Molecular Genetics, 14(4):483-492, 2005.
[99] Carl F Schaefer, Kira Anthony, Shiva Krupa, Jeffrey Buchoff, Matthew Day, Timo
Hannay, and Kenneth H Buetow. PID: the pathway interaction database. Nucleic
Acids Research, 37(suppl 1):D674-D679, 2009.
106
[100] Patrick R Schmid, Nathan P Palmer, Isaac S Kohane, and Bonnie Berger. Making
sense out of massive data by going beyond differential expression. Proceedings of the
National Academy of Sciences, 109(15):5594-5599, 2012.
[101] Jonathan Sebat, B Lakshmi, Dheeraj Malhotra, Jennifer Troge, Christa Lese-Martin,
Tom Walsh, Boris Yamrom, Seungtai Yoon, Alex Krasnitz, Jude Kendall, et al. Strong
association of de novo copy number mutations with autism. Science, 316(5823):445-
449, 2007.
[102] David Q Shih and Stephan R Targan. Insights into IBD pathogenesis. Current Gastroenterology Reports, 11(6):473-480, 2009.
[103] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and Mike Tyers. BioGRID: a general repository for interaction datasets.
Nucleic Acids Research, 34(suppl 1):D535-D539, 2006.
[104] Jennifer L Stone, Barry Merriman, Rita M Cantor, Amanda L Yonan, T Conrad
Gilliam, Daniel H Geschwind, and Stanley F Nelson. Evidence for sex-specific risk
alleles in autism spectrum disorder. American Journalof Human Genetics, 75(6):1117-
1123, 2004.
[105] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R
Golub, Eric S Lander, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National
Academy of Sciences of the United States of America, 102(43):15545-15550, 2005.
[1061 Satoshi Sumi, Hiroko Taniai, Taishi Miyachi, and Mitsuyo Tanemura. Sibling risk of
pervasive developmental disorder estimated by means of an epidemiologic survey in
Nagoya, Japan. Journal of Human Genetics, 51(6):518-522, 2006.
[107] Peter Szatmari, Andrew D Paterson, Lonnie Zwaigenbaum, Wendy Roberts, Jessica
Brian, Xiao-Qing Liu, John B Vincent, Jennifer L Skaug, Ann P Thompson, Lill
Senman, et al. Mapping autism risk loci using genetic linkage and chromosomal rearrangements. Nature Genetics, 39(3):319-328, 2007.
[108] Hiroko Taniai, Takeshi Nishiyama, Taishi Miyachi, Masayuki Imaeda, and Satoshi
Sumi. Genetic influences on the broad spectrum of autism: Study of probandascertained twins. American Journal of Medical Genetics Part B: Neuropsychiatric
Genetics, 147(6):844-849, 2008.
[109] Ian W Taylor, Rune Linding, David Warde-Farley, Yongmei Liu, Catia Pesquita,
Daniel Faria, Shelley Bull, Tony Pawson, Quaid Morris, and Jeffrey L Wrana. Dynamic modularity in protein interaction networks predicts breast cancer outcome.
Nature Biotechnology, 27(2):199-204, 2009.
[110] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological), pages 267-288, 1996.
[111] Nicki Tiffin, Euan Adie, Frances Turner, Han G Brunner, Marc A van Driel, Martin Oti, Nuria Lopez-Bigas, Christos Ouzounis, Carolina Perez-Iratxeta, Miguel A
107
Andrade-Navarro, et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Research,
34(10):3067-3081, 2006.
[112] Marc A van Driel, Jorn Bruggeman, Gert Vriend, Han G Brunner, and Jack AM
Leunissen. A text-mining analysis of the human phenome. European Journalof Human
Genetics, 14(5):535-542, 2006.
[113] Oron Vanunu, Oded Magger, Eytan Ruppin, Tomer Shlomi, and Roded Sharan. Associating genes and protein complexes with disease via network propagation. PLoS
ComputationalBiology, 6(1):e1000641, 2010.
[114] Marc Vidal.
A unifying view of 21st century systems biology.
FEBS Letters,
583(24):3891-3894, 2009.
[115] Christian Von Mering, Lars J Jensen, Michael Kuhn, Samuel Chaffron, Tobias Doerks,
Beate Kruger, Berend Snel, and Peer Bork. STRING 7-recent developments in the
integration and prediction of protein interactions. Nucleic Acids Research, 35(suppl
1):D358-D362, 2007.
[116] Kai Wang, Haitao Zhang, Deqiong Ma, Maja Bucan, Joseph T Glessner, Brett S
Abrahams, Daia Salyakina, Marcin Imielinski, Jonathan P Bradfield, Patrick MA
Sleiman, et al. Common genetic variants on 5p14. 1 associate with autism spectrum
disorders. Nature, 459(7246):528-533, 2009.
[117] Xiujuan Wang, Natali Gulbahce, and Haiyuan Yu. Network-based methods for human
disease gene prediction. Briefings in Functional Genomics, 10(5):280-293, 2011.
[118] Jia Wei and Jiexiong Feng. Signaling pathways associated with inflammatory bowel
disease. Recent Patents on Inflammation & Allergy Drug Discovery, 4(2):105-117,
2010.
[119] A Jeremy Willsey, Stephan J Sanders, Mingfeng Li, Shan Dong, Andrew T
Tebbenkamp, Rebecca A Muhle, Steven K Reilly, Leon Lin, Sofia Fertuzinhos,
Jeremy A Miller, et al. Coexpression networks implicate human midfetal deep cortical
projection neurons in the pathogenesis of autism. Cell, 155(5):997-1007, 2013.
[120] Christof Winter, Glen Kristiansen, Stephan Kersting, Janine Roy, Daniela Aust,
Thomas Kn6sel, Petra Rimmele, Beatrix Jahnke, Vera Hentrich, Felix Rickert, et al.
Google goes cancer: improving outcome prediction for cancer patients by networkbased ranking of marker genes. PLoS Computational Biology, 8(5):e1002511, 2012.
[121] Xuebing Wu, Rui Jiang, Michael Q Zhang, and Shao Li. Network-based global inference of human disease genes. Molecular Systems Biology, 4(1), 2008.
[122] Xuebing Wu, Qifang Liu, and Rui Jiang. Align human interactome with phenome
to identify causative genes and networks underlying disease families. Bioinformatics,
25(1):98-104, 2009.
[123] Amanda L Yonan, Maricela Alarcon, Rong Cheng, Patrik KE Magnusson, Sarah J
Spence, Abraham A Palmer, Adina Grunn, Suh-Hang Hank Juo, Joseph D Terwilliger,
Jianjun Liu, et al. A genomewide screen of 345 families for autism-susceptibility loci.
The American Journal of Human Genetics, 73(4):886-897, 2003.
108
[124] Mengjin Zhu and Shuhong Zhao. Candidate gene identification approach: progress
and challenges. InternationalJournal of Biological Sciences, 3(7):420, 2007.
109
Download