Because SNPs occur frequently throughout the genome and

advertisement
REMBRANDT (REpository for Molecular BRAin Neoplasia DaTa)
Empowering Translational Research…
Draft Domain Model Document
Prepared for
HL7 Clinical Genomics SIG
By
Subha Madhavan
NCI, Center for Bioinformatics
madhavas@mail.nih.gov
Phone: 301-451-2882
(May 19, 2004)
1
 Background and Rationale: A critical factor in the advancement of biomedical
research is the ease with which data can be integrated, redistributed and analyzed both
within and across functional domains. The mission of the National Cancer Institute
Center for Bioinformatics (NCICB) is to provide informatics infrastructure and
scientific applications that support advanced translational research in cancer biology
and medicine. Rembrandt (REpository for Molecular BRAin Neoplasia DaTa) is
one such translational informatics effort that is aimed at producing a national
molecular/genetic/clinical database of several thousand primary brain tumors that is
fully open and accessible to all investigators (intramural and extramural). It is
envisioned to provide informatics support to molecularly characterize a large number
of adult and pediatric primary brain tumors and to correlate those data with extensive
retrospective and prospective clinical data. Specific data types hosted include gene
expression profiles, CGH and SNP information, sequencing data, tissue array results,
and patients’ response to various drug treatments. This comprehensive brain tumor
data portal will allow for easy ad hoc querying across multiple domains such as
clinical data, functional genomics data and proteomics data thus allowing physicianscientists to make the right decisions during patient treatments and in clinical
research.
 Research goals: The goal of this molecular diagnostics initiative study is to integrate
gene expression patterns with chromosomal abnormalities and clinical observations
for classifying tumors into biologically meaningful and clinically useful categories
and identify molecular signatures for specific tumor types. In addition, a long-term
goal of this project is the identification of target genes for novel diagnostic,
prognostic or therapeutic approaches.
 SNP domain model: The focus of this discussion is the subset of use cases from this
project that require a robust, extendable model to capture SNP-related information
within the scope of Rembrandt goals. CGH (Comparative genomic hybridization) has
been used extensively to document gains and losses of genomic DNA in diseases
such as cancer. The recent development of CGH using arrays of either genomic or
cDNA clones has improved the resolution of these analyses, allowing better detection
and mapping of localized changes such as gene amplification or homozygous
deletions. CGH by these methods only catalogs the number of copies of a DNA
sequence. It cannot, for example, distinguish one copy of each parental chromosome
from two copies of one parental chromosome, both of which will generate a signal
equivalent to two copies. However, in cancer and other human diseases, the
provenance of the chromosome or genomic region undergoing copy number alteration
is often important. Therefore, a platform such as the SNP arrays that provides
information pertaining to both copy number and the status of each parental allele is
used in this study. Our research partners at the Neuro-oncology branch, National
Cancer Institute are using GeneChip® Mapping 100K Arrays as the platform to study
chromosomal aberrations in samples from patients with Gliomas. The abovementioned analysis of the tumor samples and the need to model these functional
genomics data and the relationships that exist between them, act as the backdrop for
our discussion going forward.
2
 Use-case-driven modeling: Our experience with modeling biological entities led
us to believe that UML modeling was a good way to describe translational
research. Our model does not focus on messaging SNP data between
organizations; it is intended as a model that helps describe our area of interest
derived from our use cases.
 A typical Rembrandt data portal search use-case is as follows:
 Show me all tumor samples that have amplification of 13q11.3, deletion of
10p21, D7S522 and the FHIT region confirmed by SNP chips and CGH
analysis.
 Display regions with LOH for these samples.
 Which genes are under-expressed in these tumor samples with respect to
normal?
 Does this subset of tumors have a better survival?
 Do they segregate to a certain age group, geographical area or ethnicity?
Our Model: The draft UML model for the Rembrandt SNP data domain is shown
in Figure 1. We reviewed the clinical genomics re-usable genotype model from a
stand point of the R-MIM being a starting point for modeling the Rembrandt use
cases. Modifications/Extensions to the Reusable Genotype R-MIM, V0.4, 200403-14 are shown in purple. SNP domain-related classes such as Polymorphism,
Genotype etc are shown in the diagram with all the associations and cardinalities.
We have designed 3 different packages, GeneExpression, Clinical and
Population to define related, but, separate domains. One note-worthy point is that
we have modeled Probe (from the GeneExpression Package) as an interface to
abstract specific technologies such as cDNA arrays, Oligo arrays etc.
In order to address the Rembrandt use cases (such as the one mentioned above), it
was critical for us to model SNPs in the context of 3 main areas as elucidated
below:
 SNPs as markers on the genome: Because SNPs occur frequently throughout the
genome and tend to be relatively stable genetically, they serve as excellent
biological markers. Biological markers are segments of DNA with an
identifiable physical location that can be easily tracked and used for constructing
a chromosome map that shows the positions of known genes, or other markers,
relative to each other. These maps allow researchers to study and pinpoint traits
resulting from the interaction of more than one gene. Hence, there is a need to
model SNPs in relation to chromosomal aberrations and as markers on the
genome.
Relevant classes (refer Figure 1): Classes Chromosome, Map Location,
LengthPolymorphism (to capture insertions and deletions) are included.
3
 Annotations and external cross-references: Rich annotations are available for
SNP data through different projects such as CGAP, GAI (Genome Anatomy
initiative) and TSC (The SNP Consortium). We require the model to allow for
extensive data annotations through internal or external cross-references.
Relevant Classes (refer Figure 1): Classes Population and SNPFrequency are
included, as a start, for annotating the SNP data.
 Experimental observations:
o LOH: Human cancers arise by a combination of genetic changes
including activation of cellular oncogenes and inactivation of tumor
suppressor genes (TSGs). Chromosomal regions demonstrating a high rate
of loss of genetic material are frequently found to harbor putative TSGs.
The classic model of TSG inactivation is described by a two-hit process in
which one allele is mutated an d the other allele is lost through a number
of possible mechanisms resulting in loss of heterozygosity (LOH) at the
affected locus. Hence, there is a need to include experimental observations
such as LOH in the SNP domain model to identify allelic imbalances.
Relevant Class attribute Class attribute heterozygousStatus is added to
Class SNP.
o Signal Values: There is a need to capture signal/intensity value for SNP
elements on arrays to correlate with DNA copy number.
Relevant Classes (refer Figure 1): New Package called GeneExpression
is included along with an Interface Probe (with class attribute called
signalValue).
 Conclusions: SNP arrays are being widely used in cancer research to study
chromosomal amplification, deletion, and loss of heterozygosity (LOH) analysis.
Rembrandt Phase 1 is focusing on using the SNP chip data for
amplification/deletion studies and for LOH. Future uses would include patient
genotyping to find common alleles of susceptibility. For Phase 1 needs, our usecase driven design goal was to model a space which will allow the researches to
traverse between observation values associated with a SNP marker that they wish
to interrogate to any annotations associated with that particular genomic region.
We plan to achieve this objective by extending our bio-medical object model,
caBIO. We hope this kind of use-case driven modeling will help address easy
navigability between related biological data and also allow for easy extendibility
in the future to help accommodate the growing needs of the cancer research
community.
4
Figure 1: Object Model for Rembrandt to capture SNP-related information
 Rembrandt biological and operational definitions for classes shown in the above
diagram:
Allele
Two or more alternative forms of a gene resulting in different gene products and thus different
phenotypes. A single allele for each gene locus is inherited separately from each parent (e.g., at a
locus for eye color the allele might result in blue or brown eyes). An organism is homozygous for
a gene if the alleles are identical, and heterozygous if they are different. In the SNP context, we
refer to the allele of a particular SNP, rather than for a gene.
An object that represents the alternative forms of a gene/SNP/chromosome location in a particular
sample. The typical case would be an allele pair in a sample, one from each parent. The third
allele could be present in a sample containing three copies of a chromosome, as in Down’s
syndrome.
Chromosome
A structure of compact intertwined molecules of DNA found in the nucleus of cells. Chromosomes
contain the cell's genetic information.
An object representing a specific chromosome for a specific taxon; provides access to all known
genes contained in the chromosome and to the taxon.
ClinicalPhenotype
Observable clinical characteristics of an organism produced by the organism's genotype
interacting with the environment.
5
An object that represents the clinical observation for a patient collected on the Case Report forms
during patient visit.
Clone
A section of DNA that has been inserted into a vector molecule, such as a plasmid or a phage
chromosome, and then replicated to form many identical copies.
An object used to hold information pertaining to I.M.A.G.E/BAC clones; provides access to
sequence information, associated trace files, and the clone's library.
Gene
A gene is an ordered sequence of nucleotides located in a particular position (locus) on a
particular chromosome that encodes a specific functional product (the gene product, i.e. a protein
or RNA molecule). It includes regions involved in regulation of expression and regions that code
for a specific functional product.
Gene objects are the effective portal to most of the genomic information provided by the caBIO
data services; organs, diseases, chromosomes, pathways, sequence data, and expression
experiments are among the many objects accessible via a gene.
Genotype
The hereditary constitution of an individual, or of particular nuclei within its cells.
An object that represents the complete set of allelotypes for all the SNPs being studied for that
sample/patient.
Haplotype
A particular pattern of sequential SNPs (or alleles) found on a single chromosome.
An object that represents a subset of allelotypes (from the complete genotype) in a particular
sample/patient.
LengthPolymorphism
Differences in DNA sequences involve more than one nucleotide substitution. Includes insertions
and deletions of nucleotides and repetitive sequences
An object representing insertions and deletions in a sequence.
MapLocation
The position of a gene on a chromosome or other chromosome markers.
An object associated with a Gene object, the physical map location of the gene.
Polymorphism
Difference in DNA sequence among individuals. Applied to many situations ranging from genetic
traits or disorders in a population to the variation in the sequence of DNA or proteins.
An abstract class from which SNP and Length polymorphism are derived.
6
SNP
A Single Nucleotide Polymorphism, or SNP (pronounced "snip"), is a small genetic change, or
variation, that can occur within a person's DNA sequence. The genetic code is specified by the
four nucleotide "letters" A (adenine), C (cytosine), T (thymine), and G (guanine). SNP variation
occurs when a single nucleotide, such as an A, replaces one of the other three nucleotide letters—
C, G, or T. Because SNPs occur frequently throughout the genome and tend to be relatively stable
genetically, they serve as excellent biological markers.
An object representing a Single Nucleotide Polymorphism; provides access to the clones and the
trace files from which it was identified, the two most common substitutions at that position, the
offset of the SNP in the parent sequence, and a confidence score.
SNPFrequency
Rate of occurrence of a particular SNP in a Population
An Association class between SNP and Population. Given the knowledge of the SNP object and
the Population, one could traverse this association and obtain the SNP frequency in that
particular population.
 Acknowledgements: I thank Himanso Sahni and Smita Hastak for their ongoing
contributions to the Rembrandt modeling effort; Peter Covitz and Carl Schaefer for
their valuable comments on this document; Jean-Claude Zenklusen and Howard Fine
for clearly articulating the informatics goals of this Glioma molecular diagnostic
initiative.
7
Download