1 Role of Systems Biology in the discovery of genetic basis of complex cardiovascular diseases Roozbeh Arshadi Abstract—Detection of the genetic variants contributing to complex, polygenic cardiovascular diseases is inherently difficult since most forms of these diseases are a result of many genes with small effects further complicated by the effects of environment and other genes. This paper discusses recombination-mapping techniques, both linkage and association studies, as current prevalent methods in detection of genetic basis of complex disease traits. However, despite some success, many problems and obstacles are encountered. In recent years, systems biology inspired tools and methodologies coupled with advances in genomics have attempted to overcome some of these shortcomings. These include, among others, integrative approaches such as combination of linkage mapping with physiological profiling, or novel pattern detection algorithms and the use of neural networks as classifiers in association studies. Finally, future progress requires not only genomic studies, but also an integration of the transcriptome, proteome, and phenome data to give a more complete picture of the complex interacting networks contributing to the disease. Index Terms— Association studies, Cardiovascular disease, Linkage mapping, , Systems Biology. I. INTRODUCTION Each year, Cardiovascular Diseases claim the lives of close to one million people in US alone [9]. Among Cardiovascular Diseases (CVD), hypertension and Atherosclerosis have received a great deal of attention. Hypertension, affecting more than 50 million people in US alone [7], is a multifactorial disease developed as a consequence of errors in biological systems that determine blood pressure [10]. Atherosclerosis, a primary cause of coronary heart disease, is best described by buildup of fatty substances, cholesterol, cellular waste products, calcium and other substances in the inner lining of an artery. The existence of a genetic basis for these diseases has been well established. In the case of hypertension, genetic determinants contribute between 30 – 50 % of blood pressure variation among individuals [5]. Also, twin studies have confirmed that with regards to coronary heart disease, 40 to 60 % of the variance of the disease correlates with genetic differences [4]. These common cardiovascular diseases are complex in that they involve an interplay of many genetic variations of molecular and biochemical pathways and their interactions with environmental factors [8]. Whereas in monogenic diseases, a single gene or allele determines the disease phenotype, in polygenic diseases, such as Atherosclerosis, the phenotype is a product of many genes with small effects, further complicated by environmental interactions. Studies of mice have revealed that over 100 genes influence the development of Atherosclerotic lesions [4]. Using traditional positional cloning approaches, over 2000 different Mendelian disease genes have been identified [4]. However, the non-Mendelian, polygenic nature of most CVD creates difficulty in locating the influential genes. Hence, there is a great need to tailor the methodology, using advances in the area of genomics and systems biology, to increase its effectiveness in dealing with such complexity. Recombination mapping as a methodology has been extensively used, with varying degrees of success, in identifying disease causing genes and mutations [1]. It is important to have a clear understanding of the fundamentals of the methodology before delving into its strengths and limitations in dealing with complex disease phenotypes, and ultimately discussing its potential as a discovery tool within a systems biology framework. II. FUNDAMENTALS OF RECOMINATION MAPPING Human chromosomes exist in homologous pairs (with corresponding DNA sequences each from a different parent). A source of diversity between generations is the occurrence of crossovers between homologous chromosomes during meiosis [3]. The closer two loci are on a chromosome, the lower the chance of crossover between them and the higher the chance they will stay together in the next generation (i.e. cosegregate). The above phenomenon forms the basis of recombination mapping as a tool for determining the regions of the chromosome or the alleles linked with a particular disease phenotype. The over-simplified method is as follows: one first genotypes (determines the specific alleles of) a sample population at a number of locations on the chromosome (markers) and then based on the statistical examination of the 2 results, determines whether a particular locus/allele cosegregates with the disease trait. If such evidence is found, one can infer that a locus (location on the chromosome) influencing the trait is near the locus that co-segregated with the disease trait [1]. Large-scale experiments and statistical analysis are important keys in recombination mapping. The identification of genes underlying complex CVD requires a truly multidisciplinary approach involving geneticists, molecular biologists (development of assays and markers), bioinformaticists (to store and manipulate the data), and statisticians (development of algorithms to assess cosegregation) to name a few [1]. There are two prevalent methods of recombination mapping: Linkage Mapping and Association (linkage disequilibrium) studies. Regardless of the methodology - linkage or Association - recombination mapping involves one of the two strategies: candidate-gene approach or the total genome scan [8]. The first is a hypothesis testing approach where a suspected gene or region of chromosome is tested. The other, the total genome scan, is a hypothesis-generating approach. In this case, a great number of markers (polymorphisms on a chromosome to be tested for linkage/association) along the genome are used to locate regions which might contain genes influencing the trait. A. Linkage Mapping Linkage between two locations on a chromosome is a function of their distance: The closer they are, the higher the probability that they will not be separated by recombination events. Therefore, linkage of two loci can be tested by counting the frequency of recombination between them [3]. The lower the recombination frequency, the higher is the probability that they are located close to each other on the chromosome. The above principle combined with the use of known markers (variants on the chromosomes with known locations) can be used to identify the chromosomal location of gene variants related to a given disease [8]. Linkage analysis requires family-based sample collections [3]. As an example, a common linkage study is performed in the following manner [3]: affected sibling pairs are genotyped and the degree of similarity between them at a specific number of genetic markers is assessed. If the degree of similarity at a specific marker is significantly different from that expected from Mendelian segregation (where alleles do not cosegregate), then one can infer the disease is linked to that marker (i.e. disease causing region is close to the marker and therefore co-segregates with it). The result of a linkage study is typically the identification of portions of the chromosome linked to a particular trait/phenotype. These regions are referred to as QTL (quantitative trait loci). B. Association Studies While linkage maps attempt to determine locations (loci) on the chromosomes linked to a particular disease trait, association studies attempt to determine the association of a particular allele to the disease trait [3]. This is accomplished by conducting a case (diseased)-control (non-diseased) genotyping and examining the frequency of specific DNA variants (polymorphisms) between the two groups. Association studies by definition do not require family-based sample collections [8]. III. OBSTACLES FACING RECOMBINATION STUDIES Several fundamental issues have hampered the effectiveness of linkage and association studies in recent years. Some of these problems have been addressed to a certain extent by new integrative approaches, some of which will be discussed in the next section. Some of the problems alluded to in published papers include problems associated with focusing on limited number of complex phenotypes, population stratification and non-homogeneity, lack of resolution resulting in false positives, and the inadequacy of one-locus approache [1, 4, 10, 11]. One of the problems with many of the linkage studies has been the focus on a limited number of complex, high-level phenotypes [1]. For example, in many rat studies of hypertension, blood pressure and heart rate have been used as phenotypes. The result has been the identification of loci on almost every rat chromosome, with confirmed locations on chromosomes 1, 2, 3, 5, 10, 12 [10]. Therefore, for complex phenotypes, where the contribution of any one gene/variant to the phenotype can be obscured by others [1], results of linkage mapping might lead to wide QTL (regions that likely contain genes that affect a trait) regions on many chromosomes. Population stratification is another major concern casting doubt on the validity of some association studies [3]. In association studies, the homogeneity of samples is a very important issue. For example, in a case-control association study, suppose that the sample population is a mixture of two populations, one high risk for the disease/phenotype in question and one low, both with different allele frequencies for a gene used as a marker. The high frequency of our marker allele in the diseased (case) portion might lead the association study to falsely associate the marker gene (although unrelated) with the disease phenotype. Despite the success in dealing with monogenic phenotypes, studying complex disease phenotypes by studying one or few polymorphisms has shown its limitations [11]. In many cases, a single genetic variant (single locus) might not show 3 observable coinheritance (co-segregation) with the phenotype [1]. To detect the combinatorial effects of multiple variants (loci) simultaneously - as the variants in certain combinations might influence the phenotype - a multi-locus approach is preferred. This approach requires the development of statistical methods that are able to handle multiple variable loci [11]. Another issue with association studies is the presence of false negatives (missing actual associations) due to lack of resolution of linkage disequilibrium. As discussed previously, random recombination events from generation to generation tend to separate regions of the chromosome. Therefore, in many cases, the disease-influencing gene will have to be very close to the marker allele, not to be affected by recombination, and hence show linkage disequilibrium or association. If there exists a high degree of recombination and not enough markers in the region, the association might not be detected, leading to a false negative. It may be necessary to detect and genotype every variant in a particular gene to eliminate the possibility of a false negative [3]. IV. INTEGRATIVE GENOMIC SYSTEMS-BIOLOGY SOLUTIONS Considering the obstacles discussed previously, the discovery of the genetic variants contributing to complex CVD seems like a daunting task. In addressing some of the problems discussed previously, there have been attempts at more integrative approaches to disease-gene identification problem. One of the solutions proposed is to refine the definition of phenotype. The use of intermediate phenotypes has been advocated by several publications [1,10]. For example, in linkage/association studies of hypertension, instead of a broad phenotype such as blood pressure, an intermediate phenotype such as catecholamin levels should be used [1], increasing the chance that a specific locus will be linked to it. Also, a joint analysis of a group of functionally related intermediate phenotypes would increase the power to detect a contributing gene since a gene might affect a network of correlated phenotypes [1]. A useful integrative strategy is to combine linkage maps relating to a multitude of intermediate phenotypes with patterns of correlations between these phenotypes (physiological profiling) [10]. In a particular study [10], 239 cardiovascular and renal phenotypes were measured in normal and stressed rats, and 125 of these phenotypes were mapped to the regions on chromosomes using linkage analysis. At each marker allele, these 125 phenotypes were incorporated into a visual profile of correlation coefficients between the traits. Figure 1 demonstrates the methodology in a simplified manner (Note: the figure serves to demonstrate the integrative methodology and not published results of the study in question). Essentially, what has been created is a ‘systems biology map for cardiovascular traits’ (in F2 rats) and a physiological profiling tool to assess the complex relationships between the phenotypes as a function of genotype [10]. Combining the genetic linkage maps and the physiological profiles facilitates relating genetic information with functional pathways. One of the conclusions inferred from this methodology was the relationship between alleles of nitric oxide synthase (NOS) and arterial pressure response [10]. As discussed previously, the heterogeneity of the sample population used for a study can cast doubt on the results. One approach is to use younger, genetic isolate populations in these studies. In these cases, the greater environmental homogeneity tends to lessen the effect of the environmental factors, there is greater genetic homogeneity, and also a fewer number of generations implies a smaller chance of recombination events, and hence stronger association [1]. Using an isolate population, one can then combine linkage and association studies. For example, at deCODE, the gene encoding PDE4D in ischemic stroke (caused by Atherosclerosis) was identified by a combination of linkage and association studies [4]. First, linkage analysis of families was used to map the gene to a portion of chromosome 5 and then association of the gene was confirmed by saturating the region with genetic markers [4]. One of the problems discussed previously was that of missed associations or false negatives due to lack of resolution in association studies. This problem could be alleviated by an increase in the number of markers (known genetic segments used as comparison points in linkage and association studies). The larger number of markers would also allow for genomewide association studies and a hypothesis-generating approach. Until recently, many association studies have been restricted to candidate genes (regions suspected by biochemists to be involved in a particular pathway) [4]. Hence, the association study would be a hypothesis-testing approach attempting to associate a particular allele at the suspected locus with the disease phenotype in question. The identification of genetic differences (especially SNP – single nucleotide polymorphisms) throughout the genome in the recent years will create a larger marker pool, which together with highthroughput genotyping techniques, will allow for wholegenome association studies. 4 Figure 1 - Integrating Linkage Mapping with physiological profiling Another effort with the aim to facilitate gene discovery through recombination mapping is the Human Haplotype Map Initiative. A haplotype is a sparse representation of DNA representing the alleles on a chromosome [4]. The goal of the initiative is to determine the size and structure of these common chromosome segments across any set of individuals [4]. The haplotype map would ideally allow association studies to be performed by counting how often diseased vs. non-diseased individuals carry a certain haplotype. Those haplotypes that show a statistically significant difference in frequency between the diseased and non-diseased are likely to contain the disease causing gene or mutation [4]. It was previously discussed that approaches ignoring the combinatorial effects of genes on a complex phenotype, have show significant limitations [11]. To that effect, multi-locus approaches have been advocated with the aim to detect, among all measured polymorphisms, the ones that individually or in combination with others, influence the complex phenotype [11]. Such large-scale, multi-locus association studies require sophisticated data mining tools, statistical analysis and pattern detection algorithms. For example, the Combinatorial Partitioning Method (CPM) [6] identifies partitions of twolocus genotypes that are most predictive of the phenotype variability. Tahri-Daizadeh et al have proposed an automated Detection of Informative Combined Effects (DICE) algorithm to be used in association studies in extracting combinatorial effects of several polymorphisms and non-genetic covariates [11]. Curtis et al outline the use of artificial neural networks and their pattern-recognition properties in detection of association between disease phenotypes and ‘multiple marker genotypes’ [2]. The goal is for the neural network to be able to classify the subjects in a case-control study based on their marker genotypes. better understanding of patterns of altered gene and protein expression during disease development or progression. The hope is that the integration of expression profiling information at a variety of time points, phenotypes, and environmental conditions, with genomic information will give us a better understanding of the gene regulatory networks [12]. Yet, even the transcriptome is not fully representative of the set of proteins encoded by the genome (proteome) [12]. Figure 2 is an illustration of the various layers of integration starting from the genome and leading to the phenome. The integration of data obtained from one layer with another using a systems biology approach has been proposed [12]. In fact, in an example of such a study [10], discussed previously, the integration of genetic linkage maps (genome) with physiological profiling (phenome) in model rats revealed functional interactions between traits not apparent from linkage analysis alone. Hence a multi-layered approach, guided by systems biology principles, will likely dominate the future landscape of cardiovascular research. - quantitative description of integrated functions of organism - regulatory networks and signaling pathways singallin - collection of all gpathways encoded proteins - messenger RNA - DNA Figure 2 - Multi-layered approach to cardiovascular studies [Adapted from 8] REFERENCES [1] Broeckel U, Schork NJ. Identifying genes and genetic variation V. FUTURE – BEYOND THE GENOME The explosion of genome information in recent years has empowered many of the aforementioned techniques for discovery of genetic basis of complex CVD phenotypes. A milestone was the completion of the human genome sequence in April 2003 [12]. In addition, the on-going discovery of SNP markers and high-throughput genotyping technology pave the way for informative genome-wide association studies. There is growing recognition that emergent, integrative behavior – applies to most complex CVD phenotypes – is a result of dynamic interactions between many components. It is apparent that understanding integrative behavior is essential for progress, a feat which cannot be accomplished by genomic studies alone. Current literature [8, 12] advocate exploring the transcriptome (messenger RNA associated with cellular response to disease) using expression profiling, to give us a underlying human diseases and complex phenotypes via recombination mapping. J Physiol 2003 554(1): 40-45 [2] Curtis D, North BV, Sham PC. Use of an artificial neural network to detect association between a disease and multiple marker genotypes. Ann. Hum. Genet. 65: 95-107 Part 1, JAN 2001 [3] Keavney B., Genetic association studies in complex diseases. J. Hum. Hypertens. 14 (2000), pp. 361–367 [4] Lusis, A.J, et al. Genetic basis of atherosclerosis: part I: new genes and pathways. Circulation. 2004 Sep 28; 110(13): 1868-73 [5] McBride, Martin W., et al. Functional genomics in rodent models of hypertension, J Physiol 2003 554(1): 56-63 [6] Nelson, M.R., et al. 2001. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 11: 458-470 [7] NHLBI Working Group. Future Directions for Hypertension Research Executive Summary, 2004 [8] Podgoreanu M.V. and Schwinn D.A., 2004. Genomics and the circulation. Br J Anaesth 93 (1): 140-148 JUL 2004 [9] Smith, I.K. Protect Your Heart. Newsweek, July 19, 2002. [10] Stoll M, Cowley AW, Jr, Tonellato PJ, et al. A genomic-systems biology map for cardiovascular function. Science 2001; 294: 1723 – 6 [11] Tahri-Daizadeh N, et al. Automated Detection of Informative Combined Effects in Genetic Association Studies of Complex Traits. Genome Res. 2003 Aug; 13(8): 1952-6 5 [12] Winslow, R.L. and Boguski, M.S., 2003. Genome informatics: current status and future prospects. Circ. Res. 92, pp. 953–961