Curating Biomedical Literature using Text Mining

advertisement
BACHELOR OF COMPUTER SCIENCE (HONOURS) - UNIVERSITY OF SOUTH AUSTRALIA
Curating Biomedical
Literature using Text
Mining
Research Proposal
Samuel O’Malley
110015053 – OYMSJ001
31st of May 2012
Supervisor: Professor Jiuyong Li
Associate Supervisor: Dr Jixue Liu
Abstract
Biomedical literature is increasing exponentially and manual curation processes are not
recording the facts fast enough. Advances in natural language processing and text mining
enable computers to assist in the curation process by categorising data into meaningful groups
so that curators only see the literature they are looking for. Also these tools can be powerful
enough that they can automatically curate the data without any human input. Currently a few
solutions exist for automatically discovering protein-protein interactions from biomedical
literature, however there is a clear lack of tools for microRNA literature. MicroRNA research
is increasing as the technology for deep sequencing becomes cheaper and the interest in
microRNA is growing. MicroRNA recognition has challenges due to the large number of
synonyms and the large number of species which are referred to in the literature. The research
proposed here will provide a solution to microRNA recognition and attempt to automatically
extract information from biomedical literature abstracts and generate a structured database of
facts.
i
Contents
1.
Introduction
4
1.1 Background and Motivation
4
1.2 Research Question
4
1.2.1 microRNA Entity Recognition
1.2.2 microRNA Relationship Detection
5
5
1.3 Justification
2.
5
Literature Review
6
2.1 Text Mining
2.1.1
2.1.2
2.1.3
2.1.4
Definition of Text Mining
Entity Recognition
Information Retrieval
Information Extraction
6
6
6
6
2.2 Mining Biomedical Literature
6
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
3.
6
DRENDA Disease Related Enzyme information Database
Gene Name Normalisation
BioPPIExtractor: Protein - Protein Interaction Extractor
Biolexicon
miRCancer
Methodology
6
7
8
8
8
9
3.1 Data Acquisition
9
3.2 Pre-processing
10
3.3 Entity Recognition
10
3.4 Relationship Analysis
10
3.5 Results Analysis
10
3.6 Expected Results
11
4.
Project Schedule
12
5.
Summary
13
6.
References
14
2
List of Figures
Figure 1: Schematic Illustration of DRENDA workflow (Sohngen, Chang & Schomburg 2011) 7
Figure 2: Process flow diagram
9
List of tables
Table 1: Structured Database Output – randomly chosen examples
3
11
1.
Introduction
1.1
Background and Motivation
MicroRNA are tiny single strand lengths of non-coding RNA which inhibit protein production
in our cells. They occur naturally in the body and can potentially cure a disease or condition.
Current microRNA research is aimed at discovering the links between different microRNA
and protein production. Researchers also aim to artificially introduce microRNA into cells to
reduce problem proteins to potentially cure cancers or diseases (Xie 2010; Liu et al. 2012;
Selth et al. 2012; Zhang et al. 2012).
MicroRNA research measured in the number of published articles and journals is increasing
considerably as technology is becoming cheaper and it is becoming relatively easier to
discover new MicroRNA. Although their existence was discovered in 1993 by an American
molecular biologist Victor Ambros, the technology used to discover and sequence new
microRNA has only been widely available and for a few short years (Roads 2010). Due to this
volume of new research the data needs to be represented in a structured format in order to be
useful. Currently the databases used to store this information are curated manually by teams
of domain experts, however these databases to not adequately reflect the current state of
research and no one researcher can be an expert in their field (Jensen, Saric & Bork 2006).
Literature mining tools are becoming essential for researchers to enable them to partition the
information to only relevant publications, and potentially discover new information.
Automatic curation methods using text mining have already been developed for other fields in
biology such as protein - protein interactions however; these methods cannot be directly
applied to MicroRNA due to some limitations discussed in section Error! Reference source
ot found..
1.2
Research Question
The overall aim of this research is to determine a good technique of extracting information
about microRNA interactions from biomedical literature. This research can be split into two
problems:
1. Recognising microRNA occurrences and removing ambiguity
2. Determining the relationship between co-mentioned microRNA and some other
biological entities.
4
1.2.1 microRNA Entity Recognition
This research will endeavour to accurately detect occurrences of microRNA in biomedical
literature. There are many challenges faced in this research because each microRNA has
many synonyms and can be ambiguous.
1.2.2 microRNA Relationship Detection
MicroRNA can occur in the same sentence as many different types of biological terms.
Relationship detection will take the microRNA and other biological term and analyse the
relationship in order to classify the information as meaningful or not. An example relationship
would be “MicroRNA (A) inhibits Gene (B) Production” where A and B are microRNA and
gene name respectively.
1.3
Justification
This research has similarities to current research in other fields of biomedical text mining,
such as protein-protein detection and gene name normalisation (Crim, McDonald & Pereira
2005; Sun et al. 2009; Gerold, Simon & Fabio 2011; Xia et al. 2011). However due to
microRNA being a relatively new field of research, there is a clear lack of tools for assisting
in curating microRNA information from biomedical literature. This research will adapt and
extend existing tools for similar biology fields and apply them to microRNA, as discussed in
the literature review in Section 2.2.
5
2.
Literature Review
2.1
Text Mining
This section provides an overview of Text Mining research and current applications.
2.1.1 Definition of Text Mining
Data mining is the endeavour of discovering previously unknown information from data. Text
mining is a subset of data mining with the ultimate aim of discovering new information from
free text literature. The three parts of text mining are Entity Recognition (ER), Information
Retrieval (IR) and Information Extraction (IE).
2.1.2 Entity Recognition
Entity Recognition (ER) is a subset of text mining aimed at recognising important entities in
free-text. For our research this includes recognising microRNA and gene names in biomedical
literature. Some challenges presented in ER research include disambiguating entity names and
normalisation.
2.1.3 Information Retrieval
Information Retrieval (IR) encompasses advanced queries which go beyond simple keyword
searches. IR includes entity recognition and clustering algorithms to provide better results to a
user’s query.
2.1.4 Information Extraction
Information Extraction (IE) goes one step beyond IR in that instead of providing results to a
query, it extracts facts from the literature and returns these instead of the full-text.
2.2
Mining Biomedical Literature
This section will provide an overview of the more specific field of text mining biomedical
literature.
2.2.1 DRENDA Disease Related Enzyme information Database
DRENDA is a system developed by Sohngen, Chang and Schomburg (2011) for detecting and
classifying disease-related enzyme information.
6
Figure 1: Schematic Illustration of DRENDA workflow (Sohngen, Chang & Schomburg 2011)
From the DRENDA workflow diagram in Figure 1 we see that the system uses the BRENDA
database (BRaunschweig ENzyme Database) and MeSH Database (MEdical Subject
Headings) as dictionaries for entity recognition. Literature is obtained by crawling PubMed
and extracting abstracts, initial pre-processing is applied such as sentence splitting. A training
corpus is used to train the SVM (Support Vector Machine) algorithm, which generates a
classification model. Sentences with co-occuring disease and enzyme mentions are extracted
and this SVM classification model is applied. The result is a set of classified sentences which
is evaluated by using a Test corpus. Correctly evaluated sentences are added to the DRENDA
database as facts.
This system cannot be directly applied to microRNA literature; however the workflow can be
followed closely. Before this system can be extended for microRNA literature, an appropriate
microRNA dictionary resource must be identified. The evaluation methods used by Sohngen
et al. are very thorough and evaluate multiple pre-processing methods in order to determine
the best ones.
2.2.2 Gene Name Normalisation
A problem with biomedical literature is that each entity has many different names and there
are complex naming conventions which might not be faithfully followed. Naming
conventions include capitalisation to represent different species, this convention might not be
followed if the context of the literature makes it clear what species is being discussed. Sun,
Wang and Lin (2009) present a multi-level disambiguation framework for gene name
7
normalization. The authors show that human genes have on average 5.5 synonyms for each
identifier. While a human reader would understand these using contextual clues, a machine
has a much harder time understanding.
Sun et. al. endeavour to introduce a context awareness algorithm to disambiguate species
amongst the different synonyms used in the literature. For example if the majority of genes
mentioned in a document are human genes, then we can safely assume that any ambiguous
gene names in the document are also human genes.
The authors use a maximum entropy model and binary classes of meaningful and not
meaningful to disambiguate gene names. This algorithm is similar to Crim, McDonald and
Pereira’s algorithm (2005) except it uses more contextual cues to disambiguate gene names.
2.2.3 BioPPIExtractor: Protein - Protein Interaction Extractor
This system extracts protein – protein interactions from biomedical literature using syntactic
grammar parsers to further understand the relationship between two proteins (Yang, Lin &
Wu 2009). The system presented here was manually evaluated for precision and recall, and
was found to perform better than two other leading systems BioRAT (Corney et al. 2004) and
IntEx (Silberztein 2000).
2.2.4 Biolexicon
The Biolexicon is a large-scale lexical resource of biological terms (Thompson et al. 2011). It
combines multiple data sources into one large resource which can be used at multiple stages
of the text mining process. This system uses its vast knowledge of biological terms to
discover new textual variants which do not occur in the database resources.
Although this system is very useful, it has no knowledge of microRNA entities. It can assist
our efforts in microRNA detection because it has knowledge of biology specific verbs such as
“retro-regulate” which do not occur in a standard dictionary (BOOTStrep Bio-Lexicon 2012).
2.2.5 miRCancer
MiRCancer is a comprehensive database for microRNA expression profiles in human cancers
based on experimental results (Xie 2010). Essentially this framework is specifically designed
to uncover relationships between microRNA and cancers in biomedical literature.
This system has a limitation of which the relationship between the microRNA and the cancer
is not detected or analysed. This would result in false positives or unimportant data in the
miRCancer database.
8
3.
Methodology
The following diagram (Figure 2) represents the process flow that our program will take. The
order is symbolic for the Text Mining processes and will closely match the physical software
representation.
Data Acquisition
Crawl PubMed Database
Extract Abstracts
Preprocessing
Stop word removal
Tokenization
Entity Recognition
mirBase Dictionary
Disambiguation
Relationship Analysis
Classify relationship based on joining words
Results Analysis
Precision
Recall
Figure 2: Process flow diagram
3.1
Data Acquisition
Data will be acquired from the PubMed open access database and will only include titles and
abstracts. There are two reasons for only extracting abstracts and titles for our data
acquisition. Firstly Titles and Abstracts are freely available and do not require any complex
PDF processing, this reduces the complexity and processing time of our algorithm. Secondly
the work by Wei and Collier (2011) suggests that most of the important terms are mentioned
in the Abstract and Title, and repeated with more detail in the Introduction, Results and
Conclusion sections. This suggests that if there are no occurrences of microRNA in the title or
abstract then the full paper is not worth reading. To future proof our research all abstracts will
be stored in a MySQL database and paired to the permanent URL in order to allow full-text
downloads at a later date.
9
3.2
Pre-processing
Common text mining pre-processing tasks will be applied to our data. Firstly tokenisation will
be applied to separate the sentence into tokens (words without any punctuation). Then
commonly occurring English language words called Stop Words will be removed. MicroRNA
and other medical entities will then be removed from the sentence in order to reduce
confusing the classification algorithm. Completely removing medical entities has been
showed to perform better in classification tasks, compared to replacement with a generic word
(Sohngen, Chang & Schomburg 2011).
3.3
Entity Recognition
The MIRBase will be used to facilitate microRNA entity recognition (Kozomara & GriffithsJones 2011). This database contains manually curated microRNA information including deep
sequence data which is the unique sequence of amino acids which make up a microRNA. The
most useful information contained in this database are various synonyms used to refer to
individual microRNA and a unique identifier which can be used to refer to microRNA
without any ambiguity. Various microRNA databases were evaluated for biomedical
applications and MIRBase was shown to be an extensive resource valuable for annotation
tasks (Tan Gana, Victoriano & Okamoto 2012).
3.4
Relationship Analysis
3.5
Results Analysis
Precision and Recall are the standard measure for evaluating text mining algorithms. However
there is no gold standard available for microRNA literature so a manual analysis will need to
be performed. A small test dataset will be compiled manually and used to evaluate our
algorithm. Precision and recall can be used to compare different algorithms even across
different fields, this means that our algorithm can be compared to existing algorithms which
do not related to microRNA. This is useful because there is currently very little research into
automatic microRNA curation.
10
3.6
Expected Results
Table 1: Structured Database Output – randomly chosen examples
MicroRNA
Entity
Class
hsa-mir-150
alpha-1-B glycoprotein
Meaningful
hsa-mir-7a-1
apoptosis-associated tyrosine kinase
Meaningful
If the research is successful, the outcome will be a structured database containing a
microRNA, another biological entity which will initially be gene names but will expand to
include diseases and other entities, and a classifier (See Table 1). At this stage the classifier is
binary of only meaningful or not meaningful, however after analysis of the returned data we
might need to introduce further classifications.
11
4.
Project Schedule
This section outlines the proposed high level schedule of the research project.
Date
Task
February
March
Literature Review
April
May
Research Proposal
June
July
Data Acquisition (Section 3.1)
Pre-processing (Section 3.2)
August
Entity Recognition (Section 3.3)
Relationship Analysis (Section 3.4)
September
October
Testing and Evaluation (Section 3.5)
Preparation of Thesis
November
12
5.
Summary
This research project is motivated to combine computing power with biomedical domain
knowledge to assist in the process of curating microRNA literature. Even though no algorithm
will be infallible and able to replace the manual curation process completely, the added speed
advantage of computer processing will greatly advantage the curator’s task. A challenge
addressed in this research is recognising microRNA entities and their variations in biomedical
literature.
13
6.
References
BOOTStrep Bio-Lexicon 2012, The National Centre for Text Mining - University of
Manchester, <http://www.nactem.ac.uk/biolexicon/>.
Corney, DPA, Buxton, BF, Langdon, WB & Jones, DT 2004, 'BioRAT: extracting biological
information from full-length papers', Bioinformatics, vol. 20, no. 17, November 22, 2004, pp.
3206-3213.
Crim, J, McDonald, R & Pereira, F 2005, 'Automatically annotating documents with
normalized gene lists', BMC Bioinformatics, vol. 6, no. Suppl 1, p. S13.
Gerold, S, Simon, C & Fabio, R 2011, 'Detection of interaction articles and experimental
methods in biomedical literature', BMC Bioinformatics, vol. 12, no. Suppl+8, p. S13.
Jensen, LJ, Saric, J & Bork, P 2006, 'Literature mining for the biologist: from information
retrieval to biological discovery', Nat Rev Genet, vol. 7, no. 2, pp. 119-129.
Kozomara, A & Griffiths-Jones, S 2011, 'miRBase: integrating microRNA annotation and
deep-sequencing data', Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157.
Liu, J, Gao, J, Du, Y, Li, Z, Ren, Y, Gu, J, Wang, X, Gong, Y, Wang, W & Kong, X 2012,
'Combination of plasma microRNAs with serum CA19-9 for early detection of pancreatic
cancer', Int J Cancer, vol. 131, no. 3, Aug 1, pp. 683-691.
Roads, RE 2010, Progress in Molecular and Subcellular Biology, Springer, Shreveport LA.
Selth, LA, Townley, S, Gillis, JL, Ochnik, AM, Murti, K, Macfarlane, RJ, Chi, KN, Marshall,
VR, Tilley, WD & Butler, LM 2012, 'Discovery of circulating microRNAs associated with
human prostate cancer using a mouse model of disease', Int J Cancer, vol. 131, no. 3, Aug 1,
pp. 652-661.
Silberztein, M 2000, 'INTEX: an FST toolbox', Theoretical Computer Science, vol. 231, no. 1,
pp. 33-46.
Sohngen, C, Chang, A & Schomburg, D 2011, 'Development of a classification scheme for
disease-related enzyme information', BMC Bioinformatics, vol. 12, no. 1, p. 329.
Sun, C-J, Wang, X-L, Lin, L & Liu, Y-C 2009, 'A Multi-level Disambiguation Framework for
Gene Name Normalization', Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197.
14
Tan Gana, NH, Victoriano, AFB & Okamoto, T 2012, 'Evaluation of online miRNA resources
for biomedical applications', Genes to cells : devoted to molecular & cellular mechanisms,
vol. 17, no. 1, pp. 11-27.
Thompson, P, McNaught, J, Montemagni, S, Calzolari, N, del Gratta, R, Lee, V, Marchi, S,
Monachini, M, Pezik, P, Quochi, V, Rupp, C, Sasaki, Y, Venturi, G, Rebholz-Schuhmann, D
& Ananiadou, S 2011, 'The BioLexicon: a large-scale terminological resource for biomedical
text mining', BMC Bioinformatics, vol. 12, no. 1, p. 397.
Wei, Q & Collier, N 2011, 'Towards classifying species in systems biology papers using text
mining', BMC Research Notes, vol. 4, no. 1, p. 32.
Xia, N, Lin, H, Yang, Z & Li, Y 2011, 'Combining multiple disambiguation methods for gene
mention normalization', Expert Systems With Applications, vol. 38, no. 7, pp. 7994-7999.
Xie, B 2010, 'miRCancer: a microRNA-Cancer Association Database and Toolkit Based on
Text Mining'.
Yang, Z, Lin, H & Wu, B 2009, 'BioPPIExtractor: A protein–protein interaction extraction
system for biomedical literature', Expert Systems With Applications, vol. 36, no. 2, pp. 22282233.
Zhang, J, Zhao, H, Gao, Y & Zhang, W 2012, 'Secretory miRNAs as novel cancer
biomarkers', Biochim Biophys Acta, vol. 1826, no. 1, Aug, pp. 32-43.
15
Download