REMBRANDT (REpository for Molecular BRAin Neoplasia DaTa) Empowering Translational Research… Draft Domain Model Document Prepared for HL7 Clinical Genomics SIG By Subha Madhavan NCI, Center for Bioinformatics madhavas@mail.nih.gov Phone: 301-451-2882 (May 19, 2004) 1 Background and Rationale: A critical factor in the advancement of biomedical research is the ease with which data can be integrated, redistributed and analyzed both within and across functional domains. The mission of the National Cancer Institute Center for Bioinformatics (NCICB) is to provide informatics infrastructure and scientific applications that support advanced translational research in cancer biology and medicine. Rembrandt (REpository for Molecular BRAin Neoplasia DaTa) is one such translational informatics effort that is aimed at producing a national molecular/genetic/clinical database of several thousand primary brain tumors that is fully open and accessible to all investigators (intramural and extramural). It is envisioned to provide informatics support to molecularly characterize a large number of adult and pediatric primary brain tumors and to correlate those data with extensive retrospective and prospective clinical data. Specific data types hosted include gene expression profiles, CGH and SNP information, sequencing data, tissue array results, and patients’ response to various drug treatments. This comprehensive brain tumor data portal will allow for easy ad hoc querying across multiple domains such as clinical data, functional genomics data and proteomics data thus allowing physicianscientists to make the right decisions during patient treatments and in clinical research. Research goals: The goal of this molecular diagnostics initiative study is to integrate gene expression patterns with chromosomal abnormalities and clinical observations for classifying tumors into biologically meaningful and clinically useful categories and identify molecular signatures for specific tumor types. In addition, a long-term goal of this project is the identification of target genes for novel diagnostic, prognostic or therapeutic approaches. SNP domain model: The focus of this discussion is the subset of use cases from this project that require a robust, extendable model to capture SNP-related information within the scope of Rembrandt goals. CGH (Comparative genomic hybridization) has been used extensively to document gains and losses of genomic DNA in diseases such as cancer. The recent development of CGH using arrays of either genomic or cDNA clones has improved the resolution of these analyses, allowing better detection and mapping of localized changes such as gene amplification or homozygous deletions. CGH by these methods only catalogs the number of copies of a DNA sequence. It cannot, for example, distinguish one copy of each parental chromosome from two copies of one parental chromosome, both of which will generate a signal equivalent to two copies. However, in cancer and other human diseases, the provenance of the chromosome or genomic region undergoing copy number alteration is often important. Therefore, a platform such as the SNP arrays that provides information pertaining to both copy number and the status of each parental allele is used in this study. Our research partners at the Neuro-oncology branch, National Cancer Institute are using GeneChip® Mapping 100K Arrays as the platform to study chromosomal aberrations in samples from patients with Gliomas. The abovementioned analysis of the tumor samples and the need to model these functional genomics data and the relationships that exist between them, act as the backdrop for our discussion going forward. 2 Use-case-driven modeling: Our experience with modeling biological entities led us to believe that UML modeling was a good way to describe translational research. Our model does not focus on messaging SNP data between organizations; it is intended as a model that helps describe our area of interest derived from our use cases. A typical Rembrandt data portal search use-case is as follows: Show me all tumor samples that have amplification of 13q11.3, deletion of 10p21, D7S522 and the FHIT region confirmed by SNP chips and CGH analysis. Display regions with LOH for these samples. Which genes are under-expressed in these tumor samples with respect to normal? Does this subset of tumors have a better survival? Do they segregate to a certain age group, geographical area or ethnicity? Our Model: The draft UML model for the Rembrandt SNP data domain is shown in Figure 1. We reviewed the clinical genomics re-usable genotype model from a stand point of the R-MIM being a starting point for modeling the Rembrandt use cases. Modifications/Extensions to the Reusable Genotype R-MIM, V0.4, 200403-14 are shown in purple. SNP domain-related classes such as Polymorphism, Genotype etc are shown in the diagram with all the associations and cardinalities. We have designed 3 different packages, GeneExpression, Clinical and Population to define related, but, separate domains. One note-worthy point is that we have modeled Probe (from the GeneExpression Package) as an interface to abstract specific technologies such as cDNA arrays, Oligo arrays etc. In order to address the Rembrandt use cases (such as the one mentioned above), it was critical for us to model SNPs in the context of 3 main areas as elucidated below: SNPs as markers on the genome: Because SNPs occur frequently throughout the genome and tend to be relatively stable genetically, they serve as excellent biological markers. Biological markers are segments of DNA with an identifiable physical location that can be easily tracked and used for constructing a chromosome map that shows the positions of known genes, or other markers, relative to each other. These maps allow researchers to study and pinpoint traits resulting from the interaction of more than one gene. Hence, there is a need to model SNPs in relation to chromosomal aberrations and as markers on the genome. Relevant classes (refer Figure 1): Classes Chromosome, Map Location, LengthPolymorphism (to capture insertions and deletions) are included. 3 Annotations and external cross-references: Rich annotations are available for SNP data through different projects such as CGAP, GAI (Genome Anatomy initiative) and TSC (The SNP Consortium). We require the model to allow for extensive data annotations through internal or external cross-references. Relevant Classes (refer Figure 1): Classes Population and SNPFrequency are included, as a start, for annotating the SNP data. Experimental observations: o LOH: Human cancers arise by a combination of genetic changes including activation of cellular oncogenes and inactivation of tumor suppressor genes (TSGs). Chromosomal regions demonstrating a high rate of loss of genetic material are frequently found to harbor putative TSGs. The classic model of TSG inactivation is described by a two-hit process in which one allele is mutated an d the other allele is lost through a number of possible mechanisms resulting in loss of heterozygosity (LOH) at the affected locus. Hence, there is a need to include experimental observations such as LOH in the SNP domain model to identify allelic imbalances. Relevant Class attribute Class attribute heterozygousStatus is added to Class SNP. o Signal Values: There is a need to capture signal/intensity value for SNP elements on arrays to correlate with DNA copy number. Relevant Classes (refer Figure 1): New Package called GeneExpression is included along with an Interface Probe (with class attribute called signalValue). Conclusions: SNP arrays are being widely used in cancer research to study chromosomal amplification, deletion, and loss of heterozygosity (LOH) analysis. Rembrandt Phase 1 is focusing on using the SNP chip data for amplification/deletion studies and for LOH. Future uses would include patient genotyping to find common alleles of susceptibility. For Phase 1 needs, our usecase driven design goal was to model a space which will allow the researches to traverse between observation values associated with a SNP marker that they wish to interrogate to any annotations associated with that particular genomic region. We plan to achieve this objective by extending our bio-medical object model, caBIO. We hope this kind of use-case driven modeling will help address easy navigability between related biological data and also allow for easy extendibility in the future to help accommodate the growing needs of the cancer research community. 4 Figure 1: Object Model for Rembrandt to capture SNP-related information Rembrandt biological and operational definitions for classes shown in the above diagram: Allele Two or more alternative forms of a gene resulting in different gene products and thus different phenotypes. A single allele for each gene locus is inherited separately from each parent (e.g., at a locus for eye color the allele might result in blue or brown eyes). An organism is homozygous for a gene if the alleles are identical, and heterozygous if they are different. In the SNP context, we refer to the allele of a particular SNP, rather than for a gene. An object that represents the alternative forms of a gene/SNP/chromosome location in a particular sample. The typical case would be an allele pair in a sample, one from each parent. The third allele could be present in a sample containing three copies of a chromosome, as in Down’s syndrome. Chromosome A structure of compact intertwined molecules of DNA found in the nucleus of cells. Chromosomes contain the cell's genetic information. An object representing a specific chromosome for a specific taxon; provides access to all known genes contained in the chromosome and to the taxon. ClinicalPhenotype Observable clinical characteristics of an organism produced by the organism's genotype interacting with the environment. 5 An object that represents the clinical observation for a patient collected on the Case Report forms during patient visit. Clone A section of DNA that has been inserted into a vector molecule, such as a plasmid or a phage chromosome, and then replicated to form many identical copies. An object used to hold information pertaining to I.M.A.G.E/BAC clones; provides access to sequence information, associated trace files, and the clone's library. Gene A gene is an ordered sequence of nucleotides located in a particular position (locus) on a particular chromosome that encodes a specific functional product (the gene product, i.e. a protein or RNA molecule). It includes regions involved in regulation of expression and regions that code for a specific functional product. Gene objects are the effective portal to most of the genomic information provided by the caBIO data services; organs, diseases, chromosomes, pathways, sequence data, and expression experiments are among the many objects accessible via a gene. Genotype The hereditary constitution of an individual, or of particular nuclei within its cells. An object that represents the complete set of allelotypes for all the SNPs being studied for that sample/patient. Haplotype A particular pattern of sequential SNPs (or alleles) found on a single chromosome. An object that represents a subset of allelotypes (from the complete genotype) in a particular sample/patient. LengthPolymorphism Differences in DNA sequences involve more than one nucleotide substitution. Includes insertions and deletions of nucleotides and repetitive sequences An object representing insertions and deletions in a sequence. MapLocation The position of a gene on a chromosome or other chromosome markers. An object associated with a Gene object, the physical map location of the gene. Polymorphism Difference in DNA sequence among individuals. Applied to many situations ranging from genetic traits or disorders in a population to the variation in the sequence of DNA or proteins. An abstract class from which SNP and Length polymorphism are derived. 6 SNP A Single Nucleotide Polymorphism, or SNP (pronounced "snip"), is a small genetic change, or variation, that can occur within a person's DNA sequence. The genetic code is specified by the four nucleotide "letters" A (adenine), C (cytosine), T (thymine), and G (guanine). SNP variation occurs when a single nucleotide, such as an A, replaces one of the other three nucleotide letters— C, G, or T. Because SNPs occur frequently throughout the genome and tend to be relatively stable genetically, they serve as excellent biological markers. An object representing a Single Nucleotide Polymorphism; provides access to the clones and the trace files from which it was identified, the two most common substitutions at that position, the offset of the SNP in the parent sequence, and a confidence score. SNPFrequency Rate of occurrence of a particular SNP in a Population An Association class between SNP and Population. Given the knowledge of the SNP object and the Population, one could traverse this association and obtain the SNP frequency in that particular population. Acknowledgements: I thank Himanso Sahni and Smita Hastak for their ongoing contributions to the Rembrandt modeling effort; Peter Covitz and Carl Schaefer for their valuable comments on this document; Jean-Claude Zenklusen and Howard Fine for clearly articulating the informatics goals of this Glioma molecular diagnostic initiative. 7