Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au Motivation Background Research Question Contribution Implementation References Copyright Notice Do not remove this notice. COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been produced and communicated to you by or on behalf of the University of South Australia pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice. Motivation Background Research Question Contribution Implementation References Overview Motivation Background Research Question Contribution Implementation Examples References DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Motivation microRNA research is increasing exponentially Databases can not be curated fast enough A researcher can not be “current” in the field of microRNA Automatic curation tools exist for other areas of biomedical research DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References microRNA – What are they? microRNA are small non-coding lengths of RNA They inhibit the creation of proteins Video from rossettagenomics.com DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References miRBase A database of microRNA sequences and annotations. Human microRNA 150 is also called MIR150, hsamir-150, MIRN150 etc. miRBase provides the human readable name as well as a machine readable ID Example: hsa-mir-150 has an ID of MI0000479 and HGNC:MIR150 A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deepsequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Disease Related Enzymes Finds occurrences of an Enzyme and a Disease mentioned in the same sentence Classifies their relationship using a Support Vector Machine Uses a training-set of pre-classified sentences. Example: “Chronic granulomatous disease (CGD) results from mutations of phagocyte NADPH oxidase.” Classified as “Causal Interaction” C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for diseaserelated enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Gene Name Disambiguation Genes can have many different names or variations Humans can understand “context”, for machines this is a challenge Example: Five sentences in the paper refer to different genes. Four of these are referring to a human gene, however the fifth is ambiguous as a human gene or a fly gene. C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References LINNAEUS – Species Identification LINNAEUS uses a set of simple regular expressions to find indicators of what species a text is refering to. In my research I use a modified list to incorporate the specific MicroRNA domain knowledge. Example -These words can all be used when talking about humans (ID: 9606): [hH]umans? [pP]atients? [pP]articipants? [wW]oman [wW]omen [mM]en [gG]irls? [bB]oys? [pP]eoples? [Cc]hild(ren)? [Ii]nfants? [Pp]ersons? Gerner, M, Nenadic, G & Bergman, C 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Research Question What is the most suitable technique for discovering and classifying microRNA - gene relationships from biomedical literature? DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Contribution 1. 2. A normalisation and disambiguation technique for gene names will be adapted to fit the unique microRNA ontology. Automatic curation of microRNA and gene relationships in biomedical literature. (Not completed yet) DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References MYSQL Database Backend Table Name Rows Abstracts ID Abstract Title Stop_Abstracts ID Abstract Title Species ID Name Micro_Prefix Prefix Species_ID Species_Mentions Abstract_ID Species_ID Sentence_Num Word_Num MicroRNA_Mentions Abstract_ID Micro_ID Sentence_Num Word_Num Motivation Background Research Question Contribution Implementation References Full Example – Original Abstract microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma. The Epstein-Barr virus (EBV) is an oncogenic human Herpes virus found in ~15% of diffuse large B-cell lymphoma (DLBCL). EBV encodes miRNAs and induces changes in the cellular miRNA profile of infected cells. MiRNAs are small, non-coding RNAs of ~19-26?nt which suppress protein synthesis by inducing translational arrest or mRNA degradation. Here, we report a comprehensive miRNA-profiling study and show that hsa-miR-424, -223, 199a-3p, -199a-5p, -27b, -378, -26b, -23a, -23b were upregulated and hsa-miR-155, -20b, -221, -151-3p, -222, -29b/c, -106a were downregulated more than 2-fold due to EBV-infection of DLBCL. All known EBV miRNAs with the exception of the BHRF1 cluster as well as EBV-miRBART15 and -20 were present. A computational analysis indicated potential targets such as c-MYB, LATS2, c-SKI and SIAH1. We show that c-MYB is targeted by miR-155 and miR-424, that the tumor suppressor SIAH1 is targeted by miR-424, and that c-SKI is potentially regulated by miR-155. Downregulation of SIAH1 protein in DLBCL was demonstrated by immunohistochemistry. The inhibition of SIAH1 is in line with the notion that EBV impedes various pro-apoptotic pathways during tumorigenesis. The down-modulation of the oncogenic c-MYB protein, although counterintuitive, might be explained by its tight regulation in developmental processes. Motivation Background Research Question Contribution Implementation References Full Example – Stopwords Removed Epstein-Barr virus EBV oncogenic human Herpes virus found 15 diffuse large B-cell lymphoma DLBCL … MiRNAs small non-coding RNAs 19-26 nt suppress protein synthesis inducing translational arrest mRNA degradation . we report comprehensive miRNAprofiling study show hsa-miR-424 223 199a-3p 199a5p 27b 378 26b 23a 23b upregulated hsa-miR-155 20b 221 151-3p 222 29b c 106a downregulated 2-fold due EBV-infection DLBCL … DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Full Example – Stopwords Removed First replace all full stops with “ . “ and remove the final full stop: ◦ $abstract =~ s/([^\s])\.\s+/$1 . /gm; ◦ $abstract =~ s/([^\s])\.\s*\Z/$1/gm; ◦ “Ph.D” will not be affected by this Then split the words into the following chunks: ◦ $abstract =~ /(([a-zA-Z0-9']+-)*[a-zA-Z0-9'\.]+)/g) ◦ And remove the word if it matches Lingua’s stopword list (James 2002). ◦ Essentially this algorithm splits each word up but still keeps hyphens, apostrophes and numbers. ◦ Most stopword algorithms remove numbers and hyphens but they are essential for microRNA detection. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Full Example – Analysis These two lines from the text specify 17 different MicroRNAs: hsa-miR-424 223 199a-3p 199a-5p 27b 378 26b 23a 23b hsa-miR-155 20b 221 151-3p 222 29b c 106a The“hsa-” prefix confirms to us that this is a human sequence. If there are competing species in the same document we use a distance function to calculate which one to use, and the others we use as backups. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Full Example – Detection This regular expression captures all microRNA written in the standard format: ◦ m/^((([a-zA-Z]+-)?(mir|let)-?)[\d][\d\-a-z]*$)/mi For example: ◦ ◦ ◦ ◦ hsa-miR-27b hsa-miR-29b-1 let-7b MIR298A It does not capture the following string: ◦ hsa-miR-424 -223 ◦ It would only see the first microRNA, but miss 223 ◦ My algorithm appends each number to the last seen microRNA prefix if the number occurs immediately after a valid microRNA DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Full Example – Real Detection Abstract_ID Micro_ID Sentence Word Micro_Name 21062812 MI0000079 3 13 hsa-mir-23a 21062812 MI0000084 3 12 hsa-mir-26b 21062812 MI0000298 3 18 hsa-mir-221 21062812 MI0000299 3 20 hsa-mir-222 21062812 MI0000300 3 7 hsa-mir-223 21062812 MI0000439 3 14 hsa-mir-23b 21062812 MI0000440 3 10 hsa-mir-27b 21062812 MI0000113 3 11 hsa-mir-106a 21062812 MI0000681 3 16 hsa-mir-155 21062812 21062812 21062812 21062812 21062812 MI0001446 MI0000105 MI0000105 MI0000735 MI0001519 3 3 3 3 3 6 8 8 9 17 hsa-mir-424 hsa-mir-29b-1 hsa-mir-29b-2 hsa-mir-29c hsa-mir-20b Missing Entries: mir-199a-3p New Terminology mir-199a-5p New Terminology mir-378 Ambiguous Entries mir-151-3p New Terminology Motivation Background Research Question Contribution Implementation References Full Example – Review To Review the effectiveness of this algorithm: 1. We will manually annotate a random selection of abstracts with correct MicroRNA information. Pros: Accurate, wide selection of different types of writing Cons: Slow and laborious 2. We will do a reverse lookup from MIRBase (which references pubmed IDs and assume that they contain the microRNA from MIRBase in the abstract. Pros: Fast and Automated Cons: The microRNA might not be mentioned at all in the abstract (False Negatives) The microRNA are likely to be specified with their fully qualified names and perhaps not represent the target population fully. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Some Statistics There are 18,314 entries in my Abstracts table ◦ Of those, there are 17,231 with useable Abstracts 48% of these abstracts contain species indicators. When the abstracts finished downloading (after 2 hours) there were already 16 new abstracts available. My database has 21,222 unique microRNA listed from MIRBase. There are 62,036 MicroRNA with no ambiguity in the abstracts. 53% of total detections were improved by the species detection. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References References Imig, J, Motsch, N, Zhu, JY, Barth, S, Okoniewski, M, Reineke, T, Tinguely, M, Faggioni, A, Trivedi, P, Meister, G, Renner, C & Grasser, FA 2011, 'microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma', Nucleic Acids Res, vol. 39, no. 5, Mar, pp. 1880-1893. M. Gerner, G. Nenadic, and C. Bergman, 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85. L. J. Jensen, J. Saric, and P. Bork, “Literature mining for the biologist: from information retrieval to biological discovery," Nat Rev Genet, vol. 7, no. 2, pp. 119-129, 2006. A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deepsequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011. C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for disease-related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011. C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009. H. C. Wang, Y. H. Chen, H. Y. Kao, and S. J. Tsai, “Inference of transcriptional regulatory network by bootstrapping patterns”, Bioinformatics (Oxford, England), vol. 27, no. 10, pp. 1422-1428, 2011. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Motivation Background Research Question Contribution Implementation References Questions Any Questions? DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.