BACHELOR OF COMPUTER SCIENCE (HONOURS) - UNIVERSITY OF SOUTH AUSTRALIA Curating Biomedical Literature using Text Mining Research Proposal Samuel O’Malley 110015053 – OYMSJ001 31st of May 2012 Supervisor: Professor Jiuyong Li Associate Supervisor: Dr Jixue Liu Abstract Biomedical literature is increasing exponentially and manual curation processes are not recording the facts fast enough. Advances in natural language processing and text mining enable computers to assist in the curation process by categorising data into meaningful groups so that curators only see the literature they are looking for. Also these tools can be powerful enough that they can automatically curate the data without any human input. Currently a few solutions exist for automatically discovering protein-protein interactions from biomedical literature, however there is a clear lack of tools for microRNA literature. MicroRNA research is increasing as the technology for deep sequencing becomes cheaper and the interest in microRNA is growing. MicroRNA recognition has challenges due to the large number of synonyms and the large number of species which are referred to in the literature. The research proposed here will provide a solution to microRNA recognition and attempt to automatically extract information from biomedical literature abstracts and generate a structured database of facts. i Contents 1. Introduction 4 1.1 Background and Motivation 4 1.2 Research Question 4 1.2.1 microRNA Entity Recognition 1.2.2 microRNA Relationship Detection 5 5 1.3 Justification 2. 5 Literature Review 6 2.1 Text Mining 2.1.1 2.1.2 2.1.3 2.1.4 Definition of Text Mining Entity Recognition Information Retrieval Information Extraction 6 6 6 6 2.2 Mining Biomedical Literature 6 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 3. 6 DRENDA Disease Related Enzyme information Database Gene Name Normalisation BioPPIExtractor: Protein - Protein Interaction Extractor Biolexicon miRCancer Methodology 6 7 8 8 8 9 3.1 Data Acquisition 9 3.2 Pre-processing 10 3.3 Entity Recognition 10 3.4 Relationship Analysis 10 3.5 Results Analysis 10 3.6 Expected Results 11 4. Project Schedule 12 5. Summary 13 6. References 14 2 List of Figures Figure 1: Schematic Illustration of DRENDA workflow (Sohngen, Chang & Schomburg 2011) 7 Figure 2: Process flow diagram 9 List of tables Table 1: Structured Database Output – randomly chosen examples 3 11 1. Introduction 1.1 Background and Motivation MicroRNA are tiny single strand lengths of non-coding RNA which inhibit protein production in our cells. They occur naturally in the body and can potentially cure a disease or condition. Current microRNA research is aimed at discovering the links between different microRNA and protein production. Researchers also aim to artificially introduce microRNA into cells to reduce problem proteins to potentially cure cancers or diseases (Xie 2010; Liu et al. 2012; Selth et al. 2012; Zhang et al. 2012). MicroRNA research measured in the number of published articles and journals is increasing considerably as technology is becoming cheaper and it is becoming relatively easier to discover new MicroRNA. Although their existence was discovered in 1993 by an American molecular biologist Victor Ambros, the technology used to discover and sequence new microRNA has only been widely available and for a few short years (Roads 2010). Due to this volume of new research the data needs to be represented in a structured format in order to be useful. Currently the databases used to store this information are curated manually by teams of domain experts, however these databases to not adequately reflect the current state of research and no one researcher can be an expert in their field (Jensen, Saric & Bork 2006). Literature mining tools are becoming essential for researchers to enable them to partition the information to only relevant publications, and potentially discover new information. Automatic curation methods using text mining have already been developed for other fields in biology such as protein - protein interactions however; these methods cannot be directly applied to MicroRNA due to some limitations discussed in section Error! Reference source ot found.. 1.2 Research Question The overall aim of this research is to determine a good technique of extracting information about microRNA interactions from biomedical literature. This research can be split into two problems: 1. Recognising microRNA occurrences and removing ambiguity 2. Determining the relationship between co-mentioned microRNA and some other biological entities. 4 1.2.1 microRNA Entity Recognition This research will endeavour to accurately detect occurrences of microRNA in biomedical literature. There are many challenges faced in this research because each microRNA has many synonyms and can be ambiguous. 1.2.2 microRNA Relationship Detection MicroRNA can occur in the same sentence as many different types of biological terms. Relationship detection will take the microRNA and other biological term and analyse the relationship in order to classify the information as meaningful or not. An example relationship would be “MicroRNA (A) inhibits Gene (B) Production” where A and B are microRNA and gene name respectively. 1.3 Justification This research has similarities to current research in other fields of biomedical text mining, such as protein-protein detection and gene name normalisation (Crim, McDonald & Pereira 2005; Sun et al. 2009; Gerold, Simon & Fabio 2011; Xia et al. 2011). However due to microRNA being a relatively new field of research, there is a clear lack of tools for assisting in curating microRNA information from biomedical literature. This research will adapt and extend existing tools for similar biology fields and apply them to microRNA, as discussed in the literature review in Section 2.2. 5 2. Literature Review 2.1 Text Mining This section provides an overview of Text Mining research and current applications. 2.1.1 Definition of Text Mining Data mining is the endeavour of discovering previously unknown information from data. Text mining is a subset of data mining with the ultimate aim of discovering new information from free text literature. The three parts of text mining are Entity Recognition (ER), Information Retrieval (IR) and Information Extraction (IE). 2.1.2 Entity Recognition Entity Recognition (ER) is a subset of text mining aimed at recognising important entities in free-text. For our research this includes recognising microRNA and gene names in biomedical literature. Some challenges presented in ER research include disambiguating entity names and normalisation. 2.1.3 Information Retrieval Information Retrieval (IR) encompasses advanced queries which go beyond simple keyword searches. IR includes entity recognition and clustering algorithms to provide better results to a user’s query. 2.1.4 Information Extraction Information Extraction (IE) goes one step beyond IR in that instead of providing results to a query, it extracts facts from the literature and returns these instead of the full-text. 2.2 Mining Biomedical Literature This section will provide an overview of the more specific field of text mining biomedical literature. 2.2.1 DRENDA Disease Related Enzyme information Database DRENDA is a system developed by Sohngen, Chang and Schomburg (2011) for detecting and classifying disease-related enzyme information. 6 Figure 1: Schematic Illustration of DRENDA workflow (Sohngen, Chang & Schomburg 2011) From the DRENDA workflow diagram in Figure 1 we see that the system uses the BRENDA database (BRaunschweig ENzyme Database) and MeSH Database (MEdical Subject Headings) as dictionaries for entity recognition. Literature is obtained by crawling PubMed and extracting abstracts, initial pre-processing is applied such as sentence splitting. A training corpus is used to train the SVM (Support Vector Machine) algorithm, which generates a classification model. Sentences with co-occuring disease and enzyme mentions are extracted and this SVM classification model is applied. The result is a set of classified sentences which is evaluated by using a Test corpus. Correctly evaluated sentences are added to the DRENDA database as facts. This system cannot be directly applied to microRNA literature; however the workflow can be followed closely. Before this system can be extended for microRNA literature, an appropriate microRNA dictionary resource must be identified. The evaluation methods used by Sohngen et al. are very thorough and evaluate multiple pre-processing methods in order to determine the best ones. 2.2.2 Gene Name Normalisation A problem with biomedical literature is that each entity has many different names and there are complex naming conventions which might not be faithfully followed. Naming conventions include capitalisation to represent different species, this convention might not be followed if the context of the literature makes it clear what species is being discussed. Sun, Wang and Lin (2009) present a multi-level disambiguation framework for gene name 7 normalization. The authors show that human genes have on average 5.5 synonyms for each identifier. While a human reader would understand these using contextual clues, a machine has a much harder time understanding. Sun et. al. endeavour to introduce a context awareness algorithm to disambiguate species amongst the different synonyms used in the literature. For example if the majority of genes mentioned in a document are human genes, then we can safely assume that any ambiguous gene names in the document are also human genes. The authors use a maximum entropy model and binary classes of meaningful and not meaningful to disambiguate gene names. This algorithm is similar to Crim, McDonald and Pereira’s algorithm (2005) except it uses more contextual cues to disambiguate gene names. 2.2.3 BioPPIExtractor: Protein - Protein Interaction Extractor This system extracts protein – protein interactions from biomedical literature using syntactic grammar parsers to further understand the relationship between two proteins (Yang, Lin & Wu 2009). The system presented here was manually evaluated for precision and recall, and was found to perform better than two other leading systems BioRAT (Corney et al. 2004) and IntEx (Silberztein 2000). 2.2.4 Biolexicon The Biolexicon is a large-scale lexical resource of biological terms (Thompson et al. 2011). It combines multiple data sources into one large resource which can be used at multiple stages of the text mining process. This system uses its vast knowledge of biological terms to discover new textual variants which do not occur in the database resources. Although this system is very useful, it has no knowledge of microRNA entities. It can assist our efforts in microRNA detection because it has knowledge of biology specific verbs such as “retro-regulate” which do not occur in a standard dictionary (BOOTStrep Bio-Lexicon 2012). 2.2.5 miRCancer MiRCancer is a comprehensive database for microRNA expression profiles in human cancers based on experimental results (Xie 2010). Essentially this framework is specifically designed to uncover relationships between microRNA and cancers in biomedical literature. This system has a limitation of which the relationship between the microRNA and the cancer is not detected or analysed. This would result in false positives or unimportant data in the miRCancer database. 8 3. Methodology The following diagram (Figure 2) represents the process flow that our program will take. The order is symbolic for the Text Mining processes and will closely match the physical software representation. Data Acquisition Crawl PubMed Database Extract Abstracts Preprocessing Stop word removal Tokenization Entity Recognition mirBase Dictionary Disambiguation Relationship Analysis Classify relationship based on joining words Results Analysis Precision Recall Figure 2: Process flow diagram 3.1 Data Acquisition Data will be acquired from the PubMed open access database and will only include titles and abstracts. There are two reasons for only extracting abstracts and titles for our data acquisition. Firstly Titles and Abstracts are freely available and do not require any complex PDF processing, this reduces the complexity and processing time of our algorithm. Secondly the work by Wei and Collier (2011) suggests that most of the important terms are mentioned in the Abstract and Title, and repeated with more detail in the Introduction, Results and Conclusion sections. This suggests that if there are no occurrences of microRNA in the title or abstract then the full paper is not worth reading. To future proof our research all abstracts will be stored in a MySQL database and paired to the permanent URL in order to allow full-text downloads at a later date. 9 3.2 Pre-processing Common text mining pre-processing tasks will be applied to our data. Firstly tokenisation will be applied to separate the sentence into tokens (words without any punctuation). Then commonly occurring English language words called Stop Words will be removed. MicroRNA and other medical entities will then be removed from the sentence in order to reduce confusing the classification algorithm. Completely removing medical entities has been showed to perform better in classification tasks, compared to replacement with a generic word (Sohngen, Chang & Schomburg 2011). 3.3 Entity Recognition The MIRBase will be used to facilitate microRNA entity recognition (Kozomara & GriffithsJones 2011). This database contains manually curated microRNA information including deep sequence data which is the unique sequence of amino acids which make up a microRNA. The most useful information contained in this database are various synonyms used to refer to individual microRNA and a unique identifier which can be used to refer to microRNA without any ambiguity. Various microRNA databases were evaluated for biomedical applications and MIRBase was shown to be an extensive resource valuable for annotation tasks (Tan Gana, Victoriano & Okamoto 2012). 3.4 Relationship Analysis 3.5 Results Analysis Precision and Recall are the standard measure for evaluating text mining algorithms. However there is no gold standard available for microRNA literature so a manual analysis will need to be performed. A small test dataset will be compiled manually and used to evaluate our algorithm. Precision and recall can be used to compare different algorithms even across different fields, this means that our algorithm can be compared to existing algorithms which do not related to microRNA. This is useful because there is currently very little research into automatic microRNA curation. 10 3.6 Expected Results Table 1: Structured Database Output – randomly chosen examples MicroRNA Entity Class hsa-mir-150 alpha-1-B glycoprotein Meaningful hsa-mir-7a-1 apoptosis-associated tyrosine kinase Meaningful If the research is successful, the outcome will be a structured database containing a microRNA, another biological entity which will initially be gene names but will expand to include diseases and other entities, and a classifier (See Table 1). At this stage the classifier is binary of only meaningful or not meaningful, however after analysis of the returned data we might need to introduce further classifications. 11 4. Project Schedule This section outlines the proposed high level schedule of the research project. Date Task February March Literature Review April May Research Proposal June July Data Acquisition (Section 3.1) Pre-processing (Section 3.2) August Entity Recognition (Section 3.3) Relationship Analysis (Section 3.4) September October Testing and Evaluation (Section 3.5) Preparation of Thesis November 12 5. Summary This research project is motivated to combine computing power with biomedical domain knowledge to assist in the process of curating microRNA literature. Even though no algorithm will be infallible and able to replace the manual curation process completely, the added speed advantage of computer processing will greatly advantage the curator’s task. A challenge addressed in this research is recognising microRNA entities and their variations in biomedical literature. 13 6. References BOOTStrep Bio-Lexicon 2012, The National Centre for Text Mining - University of Manchester, <http://www.nactem.ac.uk/biolexicon/>. Corney, DPA, Buxton, BF, Langdon, WB & Jones, DT 2004, 'BioRAT: extracting biological information from full-length papers', Bioinformatics, vol. 20, no. 17, November 22, 2004, pp. 3206-3213. Crim, J, McDonald, R & Pereira, F 2005, 'Automatically annotating documents with normalized gene lists', BMC Bioinformatics, vol. 6, no. Suppl 1, p. S13. Gerold, S, Simon, C & Fabio, R 2011, 'Detection of interaction articles and experimental methods in biomedical literature', BMC Bioinformatics, vol. 12, no. Suppl+8, p. S13. Jensen, LJ, Saric, J & Bork, P 2006, 'Literature mining for the biologist: from information retrieval to biological discovery', Nat Rev Genet, vol. 7, no. 2, pp. 119-129. Kozomara, A & Griffiths-Jones, S 2011, 'miRBase: integrating microRNA annotation and deep-sequencing data', Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157. Liu, J, Gao, J, Du, Y, Li, Z, Ren, Y, Gu, J, Wang, X, Gong, Y, Wang, W & Kong, X 2012, 'Combination of plasma microRNAs with serum CA19-9 for early detection of pancreatic cancer', Int J Cancer, vol. 131, no. 3, Aug 1, pp. 683-691. Roads, RE 2010, Progress in Molecular and Subcellular Biology, Springer, Shreveport LA. Selth, LA, Townley, S, Gillis, JL, Ochnik, AM, Murti, K, Macfarlane, RJ, Chi, KN, Marshall, VR, Tilley, WD & Butler, LM 2012, 'Discovery of circulating microRNAs associated with human prostate cancer using a mouse model of disease', Int J Cancer, vol. 131, no. 3, Aug 1, pp. 652-661. Silberztein, M 2000, 'INTEX: an FST toolbox', Theoretical Computer Science, vol. 231, no. 1, pp. 33-46. Sohngen, C, Chang, A & Schomburg, D 2011, 'Development of a classification scheme for disease-related enzyme information', BMC Bioinformatics, vol. 12, no. 1, p. 329. Sun, C-J, Wang, X-L, Lin, L & Liu, Y-C 2009, 'A Multi-level Disambiguation Framework for Gene Name Normalization', Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197. 14 Tan Gana, NH, Victoriano, AFB & Okamoto, T 2012, 'Evaluation of online miRNA resources for biomedical applications', Genes to cells : devoted to molecular & cellular mechanisms, vol. 17, no. 1, pp. 11-27. Thompson, P, McNaught, J, Montemagni, S, Calzolari, N, del Gratta, R, Lee, V, Marchi, S, Monachini, M, Pezik, P, Quochi, V, Rupp, C, Sasaki, Y, Venturi, G, Rebholz-Schuhmann, D & Ananiadou, S 2011, 'The BioLexicon: a large-scale terminological resource for biomedical text mining', BMC Bioinformatics, vol. 12, no. 1, p. 397. Wei, Q & Collier, N 2011, 'Towards classifying species in systems biology papers using text mining', BMC Research Notes, vol. 4, no. 1, p. 32. Xia, N, Lin, H, Yang, Z & Li, Y 2011, 'Combining multiple disambiguation methods for gene mention normalization', Expert Systems With Applications, vol. 38, no. 7, pp. 7994-7999. Xie, B 2010, 'miRCancer: a microRNA-Cancer Association Database and Toolkit Based on Text Mining'. Yang, Z, Lin, H & Wu, B 2009, 'BioPPIExtractor: A protein–protein interaction extraction system for biomedical literature', Expert Systems With Applications, vol. 36, no. 2, pp. 22282233. Zhang, J, Zhao, H, Gao, Y & Zhang, W 2012, 'Secretory miRNAs as novel cancer biomarkers', Biochim Biophys Acta, vol. 1826, no. 1, Aug, pp. 32-43. 15