Phenotypic classification of genetic skin diseases reveals

advertisement
The expanded human disease network combining the protein-protein interaction
information
Xuehong Zhang1,†, Ruijie Zhang1,†,*, Yongshuai Jiang1,†, Peng Sun1,†, Guoping Tang1,
Xing Wang1, Hongchao Lv1, Xia Li1, 2,**
Supplementary Information
1. Disease-gene association and protein-protein interaction data
The Genetic Association Database (GAD)1 collected, standardized and archived almost
all of the genetic association study data stored in many published literatures. Each entry
of the GAD is composed of fifteen fields, disease phenotype (broad phenotype, narrow
phenotype and molecular phenotype), MeSH disease terms, disease expert, disease class,
official gene symbols, association tag ("Y" or "N"), chromosomal location (chromosome
and chromosome band), genomic DNA position, p value, reference and its corresponding
PubMed id and OMIM id. We downloaded the disorder-gene association file as of June 6,
2009. Out of 39,930 GAD entries, we selected 11,571 entries with "Y" tag, for which
there is positive association with the disorder. Then, we removed the records whose gene
symbol is not mapped onto an Entrez ID. Finally, we obtained 6,350 entries by merging
multiple records which reported the same disease gene associated with same disease
published in different literatures into one disease-gene association. We then parsed these
6,350 disorder terms into 1,336 distinct disorders by merging disease subtypes of a single
disease, based on their given disorder names as Goh et al in 20062. Each disease was
assigned a unique disease ID.
To obtain comprehensive and highly reliable human protein-protein interaction (PPI) data,
we selected the Human Protein Reference Database (HPRD)3, in which PPIs are obtained
from literature by manual curation. HPRD_Release_7_09012007 file was downloaded on
July 14, 2009, which contains 37,107 interactions. After eliminating the self-interacting,
and the interactions whose corresponding Entrez ID are tagged as "None", 35,000
interactions between 9,303 genes used in our study are available.
2. Construction of the eHDN
We classified each disorder into 19 disorder classes, following the classification scheme
shown in Fig. 1. The classification is based on the "disease class" tag provided in GAD.
Each gene symbol was mapped onto an Entrez ID, generating the list of disease-gene
associations, which is defined as GAD diseasome. Starting from the GAD diseasome
bipartite graph, we generated the original human disease network (oHDN) by projecting
onto disease spaces2. In the oHDN, nodes represent disorders, and two disorders are
connected to each other if they shared at least one gene.
We constructed the expanded diseasome map by combining the disease-gene information
in GAD with the protein-protein interaction information in HPRD. First, we obtained 559
genes (227 genes belong to the GAD diseasome map) whose corresponding protein
products interact with at least two proteins encoded by genes associated with a disorder.
Second, the remaining 332 genes were added to GAD diseasome, generating the
expanded list of disease-gene associations which is defined GAD-HPRD2 diseasome
available as Supplementary Table S1. Similar to the oHDN, we also got the eHDN
projection (Fig. 1).
3. Gene expression microarray data
The Gene Expression Omnibus (GEO) repository4 at the National Center for
Biotechnology Information (NCBI) archived and freely disseminated microarray and
other forms of high-throughput data generated by the scientific community. To calculate
the coexpression correlation between wild-type human gene transcripts, we used
microarray data available for normal human tissues. GSE7307 and GSE3526 were used
in follow-up analysis.
GSE7307: Normal and diseased human tissues were profiled for gene expression using
the Affymetrix U133 plus 2.0 arrays. It is composed of 677 samples, representing over 90
distinct tissue types. We selected 504 samples involved in 83 healthy tissues to carry out
subsequent analysis. First of all, we used ArrayTools (3.7.1 versions)5 to match Entrez
gene ID, and 41,558 microarray probes (76% of total probes) were assigned to 20,080
Entrez gene ID. In the GAD and GAD-HPRD2 diseasome, 1,594 (97% of 1,639 human
disease genes) and 1,925 (98% of 1,971 genes) genes have corresponding microarray
probes, respectively. Secondly, we followed the way in the previous works6 to determine
the tissue-selective genes using the Significance Analysis of Microarrays (SAM)
algorithm7.
GSE3526: Normal human tissue samples from ten post-mortem donors were processed to
generate total RNA, which was subsequently analyzed for gene expression using
Affymetrix U133 plus 2.0 arrays. Donor information: Donor 1-25 year old male; donor 2
- 38 year old male; donor 3-39 year old female; donor 4-30 year old male; donor 5-35
year old male; donor 6-52 year old male; donor 7-50 year old female; donor 8-48 year old
female; donor 9-53 year old female; donor 10-23 year old female. It was composed of
353 samples, representing 65 distinct tissue types. We carried out subsequent analysis
using 63 healthy tissues which contain at least two samples. Similarly, we used
ArrayTools to match Entrez gene ID and determined the tissue-selective genes using
SAM algorithm, and 34,000 microarray probes (76% of total probes) were assigned to
17,906 Entrez gene ID. In the GAD and GAD-HPRD2 diseasome, 1,471 (89% of 1,639
human disease genes) and 1,781 (90% of 1,971 genes) genes have corresponding
microarray probes, respectively.
4. Computation of dyadicity D and heterophilicity H
Similar to Jiang, X., et al.8, we defined the value of a phenotype/disease based on whether
it belongs (1) or does not belong (0) to a disease class. Thus three types of links between
phenotype/disease exist: 1-1, 1-0, and 0-0; the number of these links are termed m11 ,
m10 and m00 ,
respectively.
The two parameters dyadicity D and heterophilicity H are defined as:
D
m11
and H  m10 ,
m11
m10
where m11 , m10 represent the expected values of m11 , m10 respectively. These parameters
can successfully characterize the modular structure of protein-protein interaction
networks9.
The expected value of m11 and m10 is computed next. If we take cancer as an example, we
can call n1 the number of phenotypes/diseases belonging to cancer and n0 the number of
other phenotypes/diseases. N  n1  n0 is the total number of phenotypes/diseases and
is the total number of edges in the network. Let
p
2M
N ( N  1)
M
represent the competence that
indicates the average probability that two phenotypes/diseases are connected in the
network. The value of a phenotype/disease depends on whether it belongs to a cancer
class (1), or does not (0). The three varieties of link styles between phenotypes are 1-1, 10, and 0-0, and the number of these links can be labeled as m11 , m10 and m00 respectively.
If any phenotype/disease in the network has an equal chance of being cancer, the
expected values of m11 and m10 are m11 and m10 respectively 9.
n 
n (n  1)
m11 =  1   p  1 1
p,
2
2 
 n  n 
m10 =  1   0   p  n1 ( N  n1 ) p .
1 1 
Statistically significant deviations of m11 and m10 from their expected values of m11 and
m10 imply that cancer are not distributed randomly in the HDN network.
Dyadicity
D  1( D  1)
indicates that phenotypes/diseases in the disease class tend to
connect more (less) densely among themselves than expected for a random configuration.
Similarly, heterophilicity
H 1( H  1 )
means that phenotypes/diseases in the disease
class have more (fewer) connections to phenotypes in other classes than expected
randomly.
5. Gene ontology homogeneity analysis
The HDN based on the Online Mendelian Inheritance in Man (OMIM)10 has been
reported previously2 that it showed modular organization then a group of genes
associated with the same common disorder share similar cellular and functional
characteristics, as annotated in Gene Ontology11. Here, we also investigated the GO
homogeneity (GH) for the original and expanded HDN. The GH of a disorder i is defined
as the maximum fraction of genes in the same disorder that have the same GO terms,
GHi = maxj [nji/ni],
where ni denotes the number of genes in the disorder i that have any GO annotations, and
nji the number of genes that have the specific GO term j. We calculated GHi not only
considering the GO annotations as a whole (GW), but also calculated GHi separately for
each branch of GO, biological process (BP), molecular function (MF), and cellular
component (CC).
To obtain the random control of the GO homogeneity distribution for each disorder, we
picked the same number of genes randomly in the GO annotation data and calculated
their GO homogeneity. We generated 104 random instances to reach statistical
significance.
6. KEGG (Kyoto Encyclopedia of Genes and Genomes) homogeneity analysis
Functionally related genes generally exhibited genes which belong to the same cellular
pathways8,12, such as annotated in Kyoto Encyclopedia of Genes and Genomes
(KEGG)13. To investigate whether a group of genes associated with the same disorder
have a tendency to share the same cellular pathways, we measured the KEGG
homogeneity (KH) of each disorder here the maximum fraction of genes in the same
disorder that have the same KEGG terms. Similar to GH, it is defined as
KHi = maxj [nji/ni],
where in this case ni denotes the number of genes in the disorder i that have any KEGG
annotations, and nji the number of genes that have the specific KEGG term j. To reduce
the bias, we removed the corresponding disease pathways from KEGG when the KEGG
homogeneity is calculated for a certain disease.
To obtain the random control of the KEGG homogeneity distribution for each disorder,
we picked the same number of genes randomly in the KEGG annotation data and
calculated their KEGG homogeneity. We generated 104 random instances to reach
statistical significance.
7. Subcellular location analysis
The function of a protein and its role in a cell are closely correlated with its subcellular
locations or environments14. For example, the target proteins and non-target proteins are
different in subcellular locations15. Here, we further analyzed whether the corresponding
protein products of the genes in the common disease module have a tendency to gather in
the same subcellular location based on another annotation. Swiss-Prot16 is a manually
annotated protein sequence and knowledge database that is valued for its high quality
annotation, the usage of standardized nomenclature, integration with other databases and
minimal redundancy. We extracted the subcellular location information from the SwissProt database on September 10, 2009. However, only 45% of the proteins have
subcellular location annotations derived from the experimental observation for the all
human proteins stored in the Swiss-Prot database. We used the Hum-mPLoc Classifier14
to predict their subcellular location for those proteins which have not annotations or
annotated with uncertain labels such as probable’’, ‘‘potential’’, ‘‘perhaps’’, and ‘‘by
similarity’’.
Then, we measured the subcellular location homogeneity (SLH) of each disorder here the
maximum fraction of genes in the same disorder that have the same subcellular location.
Similarly, it is defined as
SLHi = maxj [nji/ni],
where in this case ni denotes the number of genes in the disorder i that have any
subcellular location annotations, and nji the number of genes that have the specific
subcellular location j.
To obtain the random control of the subcellular location homogeneity distribution for
each disorder, we picked the same number of genes randomly in the subcellular location
annotation data and calculated their subcellular location homogeneity. We generated 104
random instances to reach statistical significance.
8. Tissue homogeneity
The tissue homogeneity (TH) coefficient proposed by Goh, et al.2 quantified whether
genes that are implicated in the same disorders tend to be expressed in similar human
tissues. The TH of a disorder i, is defined as
THi = maxj [nji/ni],
where ni denotes the number of genes in the disorder i that are expressed in at least one
tissue, nji the number of genes that are expressed in the tissue j among them, and maxj [×]
denotes the function returning the maximum-value argument across j. TH has the
maximal value 1 if all of the genes are expressed together in at least one tissue, and takes
the minimum value 1/n when all are expressed in different tissues. To obtain the random
control of the tissue homogeneity distribution, we picked the same number of genes
randomly in the microarray data for each disorder and calculated their tissue
homogeneity. We generated 104 random instances to reach statistical significance.
9. Random controls for gene expression analysis
For the eHDN, to obtain the random control of the Pearson correlation coefficient (PCC)
distributions for the gene expression in Fig. 2F and G, we calculated the distribution of
all gene pairs in the microarray data (Fig. 2F) and the average PCC between the same
number of genes chosen randomly from the microarray data (Fig. 2G). Similarly, we also
obtained the random controls of the cosin correlation distance and the corresponding
average for the eHDN. We performed 104 independent runs to obtain significant
statistics.
10. Comparative analysis
To illustrate the credibility of our eHDN, we carried out a comparative analysis of eHDN
and oHDN. We found that the eHDN is highly consistent with the oHDN over all
topological and functional characteristics. To further demonstrate the reliability of eHDN,
we eliminated the real disease genes from the GAD-HPRD2 diseasome and obtained the
HPRD2 diseasome and its HDN projection. We then analyzed the topological and
functional properties of the HDN projection and found it to be consistent with the oHDN.
Firstly, the distribution of k and C(k) are also both significantly different from the
random controls (p value <5.7e-6, p value <2.2e-16). For the GO and KEGG homogeneity,
we also found a significant elevation with respect to random expectations (p value <2.0e11
, p value <1.0e-5). In the HDN projection of HPRD2 diseasome, subcellular location
homogeneity and tissue homogeneity all significantly differ from the random controls (p
value <1.0e-5, p value <2.7e-5). In addition, the coexpression and synchronized expression
features are significantly different from the random controls (p value <1.0e-5). This result,
to some extent, conforms that the properties of HDN projection obtained by eliminating
the true disease genes from GAD-HPRD2 diseasome are similar to oHDN. Therefore, we
feel confident that our eHDN constructed by combining disease gene information with
PPI information is reliable.
SUPPLEMENTAL REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database.
Nat Genet 2004; 36: 431-432.
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human
disease network. Proc Natl Acad Sci U S A 2007; 104: 8685-8690.
Peri S, Navarro JD, Kristiansen TZ et al: Human protein reference database as a
discovery resource for proteomics. Nucleic Acids Res 2004; 32: D497-501.
Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene
expression and hybridization array data repository. Nucleic Acids Res 2002; 30:
207-210.
Xu X, Zhao Y, Simon R: Gene Set Expression Comparison kit for BRBArrayTools. Bioinformatics 2008; 24: 137-139.
Hsiao LL, Dangond F, Yoshida T et al: A compendium of gene expression in
normal human tissues. Physiol Genomics 2001; 7: 97-104.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to
the ionizing radiation response. Proc Natl Acad Sci U S A 2001; 98: 5116-5121.
Jiang X, Liu B, Jiang J et al: Modularity in the genetic disease-phenotype
network. FEBS Lett 2008; 582: 2549-2554.
Park J, Barabasi AL: Distribution of node characteristics in complex networks.
Proc Natl Acad Sci U S A 2007; 104: 17916-17920.
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online
Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and
genetic disorders. Nucleic Acids Res 2005; 33: D514-517.
Harris MA, Clark J, Ireland A et al: The Gene Ontology (GO) database and
informatics resource. Nucleic Acids Res 2004; 32: D258-261.
Ravasz E, Barabasi AL: Hierarchical organization in complex networks. Phys Rev
E Stat Nonlin Soft Matter Phys 2003; 67: 026112.
Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic
Acids Res 2000; 28: 27-30.
Shen HB, Chou KC: Hum-mPLoc: an ensemble classifier for large-scale human
protein subcellular location prediction by incorporating samples with multiple
sites. Biochem Biophys Res Commun 2007; 355: 1006-1011.
Bakheet TM, Doig AJ: Properties and identification of human protein drug
targets. Bioinformatics 2009; 25: 451-457.
Boeckmann B, Bairoch A, Apweiler R et al: The SWISS-PROT protein
knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003; 31:
365-370.
Download