TheCellLineOntology

advertisement
1)National Center for Integrative Biomedical Informatics and the Center for Computational
Medicine and Biology
2)Department of Psychiatry, University of Michigan, Ann Arbor, Michigan 48109
3)Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109
The Cell Line Ontology
Sirarat Sarntivijai, Alexander S. Ade, Brian D. Athey ,and David J. States
Abstract
Cell lines are used extensively throughout biomedical research, but the nomenclature
describing cell lines has not been standardized, and many ambiguous or non-unique
names appear in the published literature. The Cell Line Ontology is a well-structured
collection of names for cell lines cultured in vitro with descriptive data. This ontology
contains a broad collection of cell line names compiled from ATCC, Hyper CLDB, MeSH
and primary literature scanning. In addition to names, the ontology specifies
relationships between cell lines including derivation and homology. The Cell Line
Ontology facilitates data exploration and comparison of different cell lines in support of
clinical and experimental research. We analyze the use of cell line names in biomedical
text. Issues that are encountered when we attempt to interpret such tokens in natural
language processing include ambiguous names, polymorphisms in the use of names,
and the fact that some cell line names are also common English words. Linguistic
patterns associated with the occurrence of cell line names are analyzed. Applying these
patterns to find additional cell line names in the literature, identifies only a small number
of additional names. Annotation of microarray gene expression studies is used as a test
case. The objectives of this work are (a) to assist users in extracting useful information
from biomedical text and (b) to highlight the importance of standardizing cell line names
in biomedical research.
Background and Problem
Cells cultured in vitro are powerful and convenient model systems that are widely used in
biomedical research. As they are developed at lab benches, cell line names are
assigned by their laboratory of origin, or sometimes, a catalogue number is added when
they are submitted to a repository. Cell line names are inherited, reproduced, or
preferentially formatted with variations in the published literature. Non-unique or
ambiguous names also appear in the literature. The inconsistency and ambiguity in
misleading or non-informative cell line names add complications to data-mining and
knowledge-seeking activities. Here, we propose the Cell Line Ontology that includes
structured knowledge about cell lines and their derivatives and homologues.
Methods
Large collections of cell line information were obtained from American Type Culture
collection (ATCC) catalogue (available online at
http://www.atcc.org/common/documents/pdf/CellCatalog/CellIndex.pdf as of November
2006) and Hyper Cell line Data Base (CLDB) version 4.200201
(http://www.biotech.ist.unige.it/interlab/cldb.html). Cell line names in the NLM Medical
Subject Headings (MeSH) are linked to their corresponding cell line. Name duplication
across the data sources is eliminated while the provenance of those duplicates is kept in
the cross-reference links for each cell line. After data cleaning, we found 8,740 cell line
instances. Each cell line record holds the information attributes of CellLineID, Organism,
Tissue, Pathology, Growth Mode, Repository Source, ATCC number, CLDB html, and
MeSHID. ATCC number is the ATCC catalogue number, and CLDB html identifier is the
orginal html file names as taken from the base URL (http://www.biotech.ist.unige.it/cldb/
*.html). The ontology was constructed using Stanford Protégé ontology editor. A Java
script was written to automate data loading to this ontology in a W3C Web Ontology
Language Format (OWL-DL).
• Data (Preprocessing)
– ATCC
• .PDF catalogue
• 3,489 distinct catalogue numbers
– Hyper CLDB
• .html web pages
• 6,856 entries with duplicates -> 5,889 case-sensitive unique name entries.
– Total ATCC + CLDB
• 8,861 non-duplicate non-overlap entries -> 8,740 entries after data cleaning
• Ontology Creation
– Ontology structure constructed in Protégé ontology editor
– Java script for automating instance creation in a W3C Web Ontology Language Format
(OWL-DL)
Results
Most of the cell line entities in this Cell Line Ontology are stand-alone instances with the
link to their source of origin. However, a number of cell lines are derivatives from some
common cell lines. These entities contain the isDerivedFrom relationship to their parents.
Cell lines that are parents of some derivatives also contain the hasDerivatives
relationship to their children cell lines. For cell lines that derive from a common parent
cell line, they have transitive hasHomolog relationship to their siblings (other clones from
same parent). With the creation of this ontology, we have also witnessed the use of cell
line names that are non-unique, extremely common, and ambiguous throughout the
biomedical text. We have performed an experiment using Natural Language Processing
(NLP) approach to evaluate these cell line names. Noun phrases ending in ‘cells’ or ‘cell
line’ are likely to refer to cell lines and can help in tagging cell line names that are also
common English words, but requiring the use of these constructs impairs recall in named
entity recognition.
Data Structure
UML Diagram describing the entities and relationships of the Cell Line Ontology
Top ten Hyper CLDB cell line names that appear in non-biomedical corpus ranked by number of
occurrences
Information Access through the Cell Line Ontology
• Genotypic information
– Next release: breast cancer genotypic information from Lawrence Berkeley National
Laboratory
• Quick cell line information in PubMed without the hassles
– NLP implementation killing false positives of the short or common-english-word cell line
names.
Cell Line Ontology
Examples of Cell Line Synonyms
Hyper CLDB cell line names that appear in NCBI GEO Microarray sample description,
ranked by number of occurrences.
Top ten Hyper CLDB cell line names that appear in non-biomedical corpus ranked by
number of occurrences
Cell Line Name Count
M4
48
35
44
Aa
11
EC
11
P1
11
380
9
BT
8
L
7
CAR
4
OK
4
Conclusion
The non-standardized nomenclature of cell line names has brought several issues in
NLP to attention. Ambiguity in the use of these names needs to be resolved wherever
possible. The issues includes the use of special characters in Unicode encodings,
inconsistent format in punctuation marks, the use of common names (e.g. HORSE,
MOUSE, OK, L and 35 for cell lines). The Cell Line Ontology is the first step to aid cell
line information mining while these annotation issues are not yet resolved at the present.
References
• Bard, J. Rhee, S.Y. Ashburner, M. (2005) An ontology for cell types Genome Biol., 6(2): R21
• Frank, E. Hall, M. Trigg, L. Holmes, G. Witten, I.H. (2004) Data mining in bioinformatics using
Weka Bioinformatics, 20(15), 2479-2481
• Lee, K.J. Hwang, Y.S. Kim, S. Rim, S.C. (2004) Biomedical named entity recognition using twophase model based on SVMs J. Biomed Inform., 37(6), 436-447
• Liu, H. Hu, Z.Z. Torii, M. Wu, C. Friedman, C. (2006) Quantitative assessment of dictionarybased protein named entity tagging J. Am. Med. Inform. Assoc., 13(5), 497-507
• NLM (2007) Medical Subject Headings http://www.nlm.nih.gov/mesh/meshhome.html
• Neve, R.M., et al. (2006) A collection of breast cancer cell lines for the study of functionally
distinct cancer subtypes, Cancer Cell. 2006 Dec; 10(6): 515-27.
• Noy, N.F. Sintek, M. Decker, S. Crubezy M. Gerguson, R.W. Musen, M.A. (2000) Creating
Semantic Web Contents with Protégé-2000 Intelligent Systems, IEEE Vol. 6, 2, 60-71.
• Noy, N.F. Crubezy M. Gerguson R.W. Knublauch, H. Tu, S.W. Vendetti, J. Musen, M.A. (2003)
Protégé-2000: an open source ontology-development and knowledge-acquisition environment
AMIA Annual Symposium Proceedings 2003, 2003: 953
• Rinaldi, F. Schneider, G. Kaljurand, K. Hess, M. Romacker, M. (2006) An environment for
relation mining over richly annotated corpora: the case of GENIA BMC Bioinformatics, 2006:
7(Suppl 3):S3
• Shulz, S. Beisswanger, E. Wermter, J. Hahn, U. (2006) Towards an upper level ontology for
molecular biology AMIA Annual Symposium Proceedings 2006, 694-698
• University of Illinois at Urbana Champaign (2004) Sentence segmentation tool
http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=SS
Acknowledgements
The authors thank Mark Musen and Natasha Noy at Stanford BMI for Protégé support, and
Chachrist Srisuwanrat for VBA technical support. This work has been done with the support from
NIH under grant U54 DA021519 for the National Center for Integrative Biomedical Informatics
and R01 LM008106. The web ontology file for this cell line collection can be downloaded at
http://www.stateslab.org/data/celllineOntology/cellline.zip
Download