PPTX Version of Presentation Slides

advertisement
Samuel O’Malley
oymsj001@mymail.unisa.edu.au
Supervisor: Prof. Jiuyong Li
jiuyong.li@unisa.edu.au
Associate Supervisor: Dr. Jixue Liu
jixue.liu@unisa.edu.au
Motivation Background Research Question Contribution Implementation References
Copyright Notice
Do not remove this notice.
COMMONWEALTH OF AUSTRALIA
Copyright Regulations 1969
WARNING
This material has been produced and communicated to you by or on
behalf of the University of South Australia pursuant to Part VB of the
Copyright Act 1968 (the Act).
The material in this communication may be subject to copyright under the
Act. Any further reproduction or communication of this material by you
may be the subject of copyright protection under the Act.
Do not remove this notice.
Motivation Background Research Question Contribution Implementation References
Overview

Motivation
Background
Research Question
Contribution

Implementation





Examples
References
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Motivation




microRNA research is increasing
exponentially
Databases can not be curated fast enough
A researcher can not be “current” in the field
of microRNA
Automatic curation tools exist for other areas
of biomedical research
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
microRNA – What are they?


microRNA are small non-coding lengths of
RNA
They inhibit the creation of proteins
Video from rossettagenomics.com
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
miRBase




A database of microRNA sequences and
annotations.
Human microRNA 150 is also called MIR150, hsamir-150, MIRN150 etc.
miRBase provides the human readable name as well
as a machine readable ID
Example:

hsa-mir-150 has an ID of MI0000479 and
HGNC:MIR150
A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deepsequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Disease Related Enzymes

Finds occurrences of an Enzyme and a Disease
mentioned in the same sentence
Classifies their relationship using a Support Vector
Machine
Uses a training-set of pre-classified sentences.

Example:




“Chronic granulomatous disease (CGD) results from
mutations of phagocyte NADPH oxidase.”
Classified as “Causal Interaction”
C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for diseaserelated enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Gene Name Disambiguation

Genes can have many different names or variations
Humans can understand “context”, for machines
this is a challenge

Example:



Five sentences in the paper refer to different genes.
Four of these are referring to a human gene, however
the fifth is ambiguous as a human gene or a fly gene.
C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene
name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
LINNAEUS – Species Identification



LINNAEUS uses a set of simple regular expressions
to find indicators of what species a text is refering
to.
In my research I use a modified list to incorporate
the specific MicroRNA domain knowledge.
Example -These words can all be used when talking
about humans (ID: 9606):

[hH]umans? [pP]atients? [pP]articipants? [wW]oman
[wW]omen [mM]en [gG]irls? [bB]oys? [pP]eoples?
[Cc]hild(ren)? [Ii]nfants? [Pp]ersons?
Gerner, M, Nenadic, G & Bergman, C 2010, 'LINNAEUS: A species name identification system
for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Research Question
What is the most suitable technique for
discovering and classifying microRNA - gene
relationships from biomedical literature?
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Contribution
1.
2.
A normalisation and disambiguation
technique for gene names will be adapted to
fit the unique microRNA ontology.
Automatic curation of microRNA and gene
relationships in biomedical literature. (Not
completed yet)
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
MYSQL Database Backend
Table Name
Rows
Abstracts
ID
Abstract
Title
Stop_Abstracts
ID
Abstract
Title
Species
ID
Name
Micro_Prefix
Prefix
Species_ID
Species_Mentions
Abstract_ID
Species_ID
Sentence_Num
Word_Num
MicroRNA_Mentions
Abstract_ID
Micro_ID
Sentence_Num
Word_Num
Motivation Background Research Question Contribution Implementation References
Full Example – Original Abstract


microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma.
The Epstein-Barr virus (EBV) is an oncogenic human Herpes virus found in
~15% of diffuse large B-cell lymphoma (DLBCL). EBV encodes miRNAs and
induces changes in the cellular miRNA profile of infected cells. MiRNAs are
small, non-coding RNAs of ~19-26?nt which suppress protein synthesis by
inducing translational arrest or mRNA degradation. Here, we report a
comprehensive miRNA-profiling study and show that hsa-miR-424, -223, 199a-3p, -199a-5p, -27b, -378, -26b, -23a, -23b were upregulated and
hsa-miR-155, -20b, -221, -151-3p, -222, -29b/c, -106a were
downregulated more than 2-fold due to EBV-infection of DLBCL. All known
EBV miRNAs with the exception of the BHRF1 cluster as well as EBV-miRBART15 and -20 were present. A computational analysis indicated potential
targets such as c-MYB, LATS2, c-SKI and SIAH1. We show that c-MYB is
targeted by miR-155 and miR-424, that the tumor suppressor SIAH1 is
targeted by miR-424, and that c-SKI is potentially regulated by miR-155.
Downregulation of SIAH1 protein in DLBCL was demonstrated by
immunohistochemistry. The inhibition of SIAH1 is in line with the notion that
EBV impedes various pro-apoptotic pathways during tumorigenesis. The
down-modulation of the oncogenic c-MYB protein, although counterintuitive, might be explained by its tight regulation in developmental
processes.
Motivation Background Research Question Contribution Implementation References
Full Example – Stopwords Removed




Epstein-Barr virus EBV oncogenic human Herpes
virus found 15 diffuse large B-cell lymphoma DLBCL
…
MiRNAs small non-coding RNAs 19-26 nt suppress
protein synthesis inducing translational arrest mRNA
degradation . we report comprehensive miRNAprofiling study show hsa-miR-424 223 199a-3p 199a5p 27b 378 26b 23a 23b upregulated hsa-miR-155
20b 221 151-3p 222 29b c 106a downregulated 2-fold
due EBV-infection DLBCL
…
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Full Example – Stopwords Removed

First replace all full stops with “ . “ and remove the final
full stop:
◦ $abstract =~ s/([^\s])\.\s+/$1 . /gm;
◦ $abstract =~ s/([^\s])\.\s*\Z/$1/gm;
◦ “Ph.D” will not be affected by this

Then split the words into the following chunks:
◦ $abstract =~ /(([a-zA-Z0-9']+-)*[a-zA-Z0-9'\.]+)/g)
◦ And remove the word if it matches Lingua’s stopword list
(James 2002).
◦ Essentially this algorithm splits each word up but still keeps
hyphens, apostrophes and numbers.
◦ Most stopword algorithms remove numbers and hyphens but
they are essential for microRNA detection.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Full Example – Analysis





These two lines from the text specify 17 different
MicroRNAs:
hsa-miR-424 223 199a-3p 199a-5p 27b 378 26b
23a 23b
hsa-miR-155 20b 221 151-3p 222 29b c 106a
The“hsa-” prefix confirms to us that this is a human
sequence.
If there are competing species in the same
document we use a distance function to calculate
which one to use, and the others we use as
backups.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Full Example – Detection


This regular expression captures all microRNA
written in the standard format:
◦ m/^((([a-zA-Z]+-)?(mir|let)-?)[\d][\d\-a-z]*$)/mi
For example:
◦
◦
◦
◦

hsa-miR-27b
hsa-miR-29b-1
let-7b
MIR298A
It does not capture the following string:
◦ hsa-miR-424 -223
◦ It would only see the first microRNA, but miss 223
◦ My algorithm appends each number to the last seen microRNA
prefix if the number occurs immediately after a valid microRNA
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Full Example – Real Detection
Abstract_ID
Micro_ID
Sentence
Word
Micro_Name
21062812
MI0000079
3
13
hsa-mir-23a
21062812
MI0000084
3
12
hsa-mir-26b
21062812
MI0000298
3
18
hsa-mir-221
21062812
MI0000299
3
20
hsa-mir-222
21062812
MI0000300
3
7
hsa-mir-223
21062812
MI0000439
3
14
hsa-mir-23b
21062812
MI0000440
3
10
hsa-mir-27b
21062812
MI0000113
3
11
hsa-mir-106a
21062812
MI0000681
3
16
hsa-mir-155
21062812
21062812
21062812
21062812
21062812
MI0001446
MI0000105
MI0000105
MI0000735
MI0001519
3
3
3
3
3
6
8
8
9
17
hsa-mir-424
hsa-mir-29b-1
hsa-mir-29b-2
hsa-mir-29c
hsa-mir-20b
Missing Entries:
mir-199a-3p
New Terminology
mir-199a-5p
New Terminology
mir-378
Ambiguous Entries
mir-151-3p
New Terminology
Motivation Background Research Question Contribution Implementation References
Full Example – Review

To Review the effectiveness of this algorithm:
1.
We will manually annotate a random selection of abstracts with
correct MicroRNA information.
 Pros:
 Accurate, wide selection of different types of writing
 Cons:
 Slow and laborious
2.
We will do a reverse lookup from MIRBase (which references
pubmed IDs and assume that they contain the microRNA from
MIRBase in the abstract.
 Pros:
 Fast and Automated
 Cons:
 The microRNA might not be mentioned at all in the abstract (False Negatives)
 The microRNA are likely to be specified with their fully qualified names and
perhaps not represent the target population fully.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Some Statistics

There are 18,314 entries in my Abstracts table
◦ Of those, there are 17,231 with useable Abstracts




48% of these abstracts contain species indicators.
When the abstracts finished downloading (after 2
hours) there were already 16 new abstracts
available.
My database has 21,222 unique microRNA listed
from MIRBase.
There are 62,036 MicroRNA with no ambiguity in the
abstracts. 53% of total detections were improved by
the species detection.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
References







Imig, J, Motsch, N, Zhu, JY, Barth, S, Okoniewski, M, Reineke, T, Tinguely, M, Faggioni, A,
Trivedi, P, Meister, G, Renner, C & Grasser, FA 2011, 'microRNA profiling in Epstein-Barr
virus-associated B-cell lymphoma', Nucleic Acids Res, vol. 39, no. 5, Mar, pp. 1880-1893.
M. Gerner, G. Nenadic, and C. Bergman, 2010, 'LINNAEUS: A species name identification
system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.
L. J. Jensen, J. Saric, and P. Bork, “Literature mining for the biologist: from information
retrieval to biological discovery," Nat Rev Genet, vol. 7, no. 2, pp. 119-129, 2006.
A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deepsequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.
C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for
disease-related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.
C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene
name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.
H. C. Wang, Y. H. Chen, H. Y. Kao, and S. J. Tsai, “Inference of transcriptional regulatory
network by bootstrapping patterns”, Bioinformatics (Oxford, England), vol. 27, no. 10, pp.
1422-1428, 2011.
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Motivation Background Research Question Contribution Implementation References
Questions
Any Questions?
DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the
copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.
Download