Learning to Extract Proteins and their Interactions from Medline Abstracts

advertisement
Machine Learning Group
Learning to Extract Proteins and their
Interactions from Medline Abstracts
Raymond J. Mooney
Department of Computer Sciences
Razvan Bunescu, Ruifang Ge,
Rohit J. Kate, Yuk Wah Wong
Edward M. Marcotte,
Arun Ramani
Department of
Computer Sciences
Institute for Cellular
and Molecular Biology
University of Texas at Austin
University of Texas at Austin
Machine Learning Group
Biological Motivation
• Human Genome Project has produced huge
amounts of genetic data.
• Next step is analyzing and interpreting this data.
University of Texas at Austin
2
Machine Learning Group
University of Texas at Austin
3
Machine Learning Group
Starting at the tip of chromosome 1...
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
5641
5701
5761
5821
5881
5941
6001
6061
6121
6181
6241
6301
6361
6421
6481
6541
6601
taaccctaac
taaccctaac
aaccctaacc
aaccctaacc
taaccctaaa
cccaacccca
cctaacccta
accctaaccc
agccggcccg
ccgaaatctg
gaggagaacg
agacacatgc
cggcgcaggc
gagaggcgca
gcgtggcgca
ggagcaaagt
aaactcacgt
gggatcgacc
tgccagggcg
...
gctccagggc
ccttcatgct
gagtggccag
gaggtgggga
gtccaagagc
aggccggacc
gcgcgggcat
tggacccctg
gttttgtgcc
gaaaaatgtg
tggggaaccc
gctctacagt
aactattcaa
gtgcactgag
gcgctgctgc
gggagtgggg
ggtggtgtta
cctaacccta
cctaacccta
ctaaccctaa
ctaaccctaa
ccctaaaccc
accccaaccc
accctaaccc
taaccctaac
cccgcccggg
tgcagaggac
caactccgcc
tagcgcgtcg
gcagagacac
ccgcgccggc
ggcgcagaga
cgcacggcgc
cacggtggcg
gccccttgct
ccccctgctg
accctaaccc
accctaaccc
ccctaacccc
ccctaaccct
taaccctaac
caaccctaac
taaccctaac
ccctaaccct
tctgacctga
aacgcagctc
ggcgcaggcg
gggtggaggc
atgctaccgc
gcaggcgcag
cgcaagccta
cgggctgggg
cggcgcagag
tgcagccggg
gcgactaggg
taaccctaac
taaccctaac
taaccctaac
aaccctaacc
cctaacccta
ccctaaccct
ccctaacccc
aaccctaacc
ggagaactgt
cgccctcgcg
cagagaggcg
gtggcgcagg
gtccaggggt
agacacatgc
cgggcggggg
cggggggagg
acgggtagaa
cactacagga
caactgcagg
cctaacccta
cctaacccaa
cctaacccta
ctaaccctaa
accctaaccc
aaccctaacc
taaccctaac
ctaaccctcg
gctccgcctt
gtgctctccg
cgccgcgccg
cgcagagagg
ggaggcgtgg
tagcgcgtcc
ttgggggggc
gtggcgccgt
cctcagtaat
cccgcttgct
gctctcttgc
accctaaccc
ccctaaccct
accctaacct
cccctaaccc
caaccccaac
ctaccctaac
cctaacccta
cggtaccctc
cagagtacca
ggtctgtgct
gcgcaggcgc
cgcgccgcgc
cgcaggcgca
aggggtggag
gtgtgttgca
gcacgcgcag
ccgaaaagcc
cacggtgctg
ttagagtggt
ccgctcacct
gcgcagcttg
ccaccggagg
acagggcaag
ctgctgggag
tttggagact
cctgtgtgca
agctagccat
acttctggat
ttgctgtagt
agagcctcac
ttgaaaacca
aaaattgaga
cacgccagaa
tgctgtcgtc
gtgcactggc
gtaccccatc
tgctcctgct
gccttgccga
ggtcaaccac
gaggaaaggc
ggaagtcacc
gtgtgtgggg
gatactccct
gctctgacag
gctagggtta
ttgttattag
ttgttcaggc
ctattttatg
atttctgacc
atcaggtggc
ctgcctggcg
cagcacctca
ttgtaggtct
ccttctgctg
tgcccccagc
ttccctggga
tgctcaggca
tcccctcaaa
gcctgggcac
gcttcctctc
tctcagttgc
cactgggaga
accccttctt
tccctctgcc
aaccaagtag
acttaacaaa
ctcaaagagc
ccttggccta
ggagctgggg
gaaacacaaa
ctgcttctcc
ttggcggatg
gctccctgga
gggctgggga
cgaggagccc
tgacttctgc
tagcccccac
acacacgagc
cacagcagtg
tccattggtt
ctagaagtga
aacaagatat
cccacagaaa
tgctcccacc
caggggccgc
gtggtggtgg
gtgtggggtg
agctttcgct
gactctagca
ctggagccgg
agcttactgt
tgcgctgggg
aaccacctga
cctgcagagc
cagcagaggg
aagctgaaat
taattaggaa
gaagtccaga
ttgaaatgga
atccacccga
tgaaggagac
ggttgagggt
gggcggtggg
tctagggaag...
University of Texas at Austin
and 3x109 more...
4
Machine Learning Group
Proteomics 101
• Genes code for proteins.
• Proteins are the basic components of biological
machinery.
• Proteins accomplish their functions by interacting
with other proteins.
• Knowledge of protein interactions is fundamental to
understanding gene function.
• Chains of interactions compose large, complex gene
networks.
University of Texas at Austin
5
Machine Learning Group
Sample Gene Network
University of Texas at Austin
6
Machine Learning Group
Yeast Gene Network
Yeast
~5,800 genes
~5,800 proteins x 2-10 interactions/protein
~12,000 - 60,000 interactions
~10-20,000 known==> ~1/3 of the way to a complete map!
University of Texas at Austin
7
Machine Learning Group
Human Gene Network
~40,000 genes
>>40,000 proteins x 2-10 interactions/protein
>>80,000 - 400,000 interactions
<5,000 known
==> approx. 1% of
the complete map!
==> We’re a long ways from the complete map
University of Texas at Austin
8
Machine Learning Group
Relevant Sources of Data
Biological literature
~14 million documents
DNA sequence data
~1010 nucleotides
Gene expression data
~108 measurements, but...
DNA polymorphisms
~107 known
Gene inactivation (knockout) studies ~105
Protein structure data
~104 structures
Protein interaction data ~104 interactions, but…
Protein expression data ~104 measurements, but...
Protein location data
~104 measurements
University of Texas at Austin
9
Machine Learning Group
Extraction from Biomedical Literature
• An ever increasing wealth of biological information is
present in millions of published articles but retrieving it in
structured form is difficult.
• Much of this literature is available through the NIH NLM’s Medline repository.
• 11 million abstracts in electronic form are available
through Medline.
• Excellent source of information on protein interactions.
• Need automated information extraction to easily locate
and structure this information.
University of Texas at Austin
10
Machine Learning Group
TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine
phosphorylation and physical association with the Rb protein
AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral
subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor
(SPF) as well as a candidate proto-oncogene.
Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.
However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to
govern cell cycle progression remain unresolved.
In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.
The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.
Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro.
In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdkbinding subunit.
Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is
associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase
activity.
Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the
p105Rb protein.
This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to
interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the
regulation of gene expression.
University of Texas at Austin
11
Machine Learning Group
TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine
phosphorylation and physical association with the Rb protein
AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral
subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor
(SPF) as well as a candidate proto-oncogene.
Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.
However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to
govern cell cycle progression remain unresolved.
In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.
The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.
Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro.
In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdkbinding subunit.
Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is
associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase
activity.
Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the
p105Rb protein.
This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to
interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the
regulation of gene expression.
University of Texas at Austin
12
Machine Learning Group
TI - Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine
phosphorylation and physical association with the Rb protein
AB - Originally identified as a ‘mitotic cyclin’, cyclin A exhibits properties of growth factor sensitivity, susceptibility to viral
subversion and association with a tumor-suppressor protein, properties which are indicative of an S-phase-promoting factor
(SPF) as well as a candidate proto-oncogene.
Other recent studies have identified human cyclin D1 (PRAD1) as a putative G1 cyclin and candidate proto-oncogene.
However, the specific enzymatic activities and, hence, the precise biochemical mechanisms through which cyclins function to
govern cell cycle progression remain unresolved.
In the present study we have investigated the coordinate interactions between these two potentially oncogenic cyclins, cyclindependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein.
The distribution of cyclin D isoforms was modulated by serum factors in primary fetal rat lung epithelial cells.
Moreover, cyclin D1 was found to be phosphorylated on tyrosine residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro.
In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdkbinding subunit.
Immunoprecipitation experiments with human osteosarcoma cells and Ewing’s sarcoma cells demonstrated that cyclin D1 is
associated with both p34cdc2 and p33cdk2, and that cyclin D1 immune complexes exhibit appreciable histone H1 kinase
activity.
Immobilized, recombinant cyclins A and D1 were found to associate with cellular proteins in complexes that contain the
p105Rb protein.
This study identifies several common aspects of cyclin biochemistry, including tyrosine phosphorylation and the potential to
interact directly or indirectly with the Rb protein, that may ultimately relate membrane-mediated signaling events to the
regulation of gene expression.
University of Texas at Austin
13
Machine Learning Group
Manually Developed IE Systems for Medline
• A number of projects have focused on the manual
development of information extraction (IE) systems for
biomedical literature.
• KeX for extracting protein names (Fukuda et al., 1998):
Extract words with special symbols excluding those with more than half of the
characters being special symbols, hence eliminating strings such as “+/−”.
• Suiseki for extracting protein interactions (Blaschke et al.,
2001):
PROT (0-2) PROT (0-2) complex
NOUN between (0-3) PROT (0-3) and (0-3) PROT
University of Texas at Austin
14
Machine Learning Group
Learning Information Extractors
• Manually developing IE systems is tedious and time-consuming and
they do not capture all possible formats and contexts for the desired
information.
• Machine learning from supervised corpora, is becoming the standard
approach to building information extractors.
• Recently, several learning approaches have been applied to Medline
extraction (Craven & Kumlein, 1999; Tanabe & Wilbur, 2002;
Raychaudhuri et al., 2002).
• We have explored the use of a variety of machine learning techniques
to develop IE systems for extracting human protein names and
interactions, presenting uniform results on a single, reasonably
large, human-annotated corpus.
University of Texas at Austin
15
Machine Learning Group
Non-Learning Protein Extractors
• Dictionary-based extraction
• KEX (Fukuda et al., 1998)
University of Texas at Austin
16
Machine Learning Group
Learning Methods for Protein Extraction
• Rule-based pattern induction
– Rapier (Califf & Mooney, 1999)
– BWI (Freitag & Kushmerick, 2000)
• Token classification (chunking approach):
– K-nearest neighbor
– Transformation-Based Learning
Abgene (Tanabe & Wilbur, 2002)
– Support Vector Machine
– Maximum entropy
• Hidden Markov Models
• Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001)
• Relational Markov Networks (Taskar, Abbeel, and Koller, 2002)
University of Texas at Austin
17
Machine Learning Group
Our Biomedical Corpora
• 750 abstracts that contain the word human were randomly chosen
from Medline for testing protein name extraction. They contain a total
of 5,206 protein references.
• 200 abstracts previously known to contain protein interactions
were obtained from the Database of Interacting Proteins. They contain
1,101 interactions and 4,141 protein names.
• As negative examples for interaction extraction are rare, an extra set
of 30 abstracts containing sentences with non-interacting proteins
are included.
• The resulting 230 abstracts are used for testing protein interaction
extraction.
University of Texas at Austin
18
Machine Learning Group
The Yapex Corpus
• 200 abstracts from Medline, manually tagged
for protein names.
• 147 randomly chosen such that they contain the
Mesh terms “protein binding”, “interaction”,
“molecular”.
• 53 randomly chosen from the GENIA corpus
http://www.sics.se/humle/projects/prothalt/
University of Texas at Austin
19
Machine Learning Group
Evaluation Metrics for Information Extraction
• Precision is the percentage of extracted items that are
correct.
• Recall is the percentage of correct items that are extracted.
• Extracted protein names are considered correct if the same
character sequences have been human-tagged as protein
names in the exact positions.
• Extracted protein interactions from an abstract are
considered correct if both proteins have been human-tagged
as interacting in that abstract. Positions are not taken into
account.
University of Texas at Austin
20
Machine Learning Group
Dictionary as Source of Domain Knowledge
• Before applying machine learning, abstracts are tagged by matching ngrams against entries from a dictionary. Tagged abstracts are used as
input for subsequent methods.
• A dictionary of 42,000 protein names is used (synonyms included).
• Generalization of protein names leads to increased coverage:
Original Protein Name Generalized Name
Interleukin-1 beta
Interleukin num greek
Interferon alpha-D
Interferon greek roman
NF-IL6-beta
NF IL num greek
TR2
TR num
University of Texas at Austin
21
Machine Learning Group
Rule-based Learning Algorithms:
Rapier and BWI
• Rule-based learning algorithms are used for inducing patterns for
extracting protein names.
• For Rapier (Califf & Mooney, 1999), each rule consists of a pre-filler
pattern, a filler pattern and a post-filler pattern.
[ human ] [ (2) transcriptase ] [ ( ]
• For BWI (Freitag & Kushmerick, 2000), rules are composed of
contextual patterns called wrappers, recognizing the start or end of a
protein name.
[ human ] []
[ transcriptase ] [ ( ]
• High precision (> 70%) but low recall (< 25%).
University of Texas at Austin
22
Machine Learning Group
Hidden Markov Models
• We use part-of-speech information in HMMs as described in (Ray &
Craven, 2001).
• We train a positive model that generates sentences containing proteins,
and a null model that generates sentences containing no proteins.
• Select the model which gives the highest likelihood of generating a
particular sentence, and tag the sentence using the Viterbi path in that
model.
NN:PROT
…
START
END
START
NN
…
END
NN
• Moderate precision (~60%) and moderate recall (~40%).
University of Texas at Austin
23
Machine Learning Group
Name Extraction by Token Classification
(“Chunking” Approach)
• Since in our data no protein names directly abut each other,
we can reduce the extraction problem to classification of
individual words as being part of a protein name or not.
• Protein names are extracted by identifying the longest
sequences of words classified as being part of a protein
name.
Two potentially oncogenic cyclins , cyclin A and cyclin D1 ,
share common properties of subunit configuration , tyrosine
phosphorylation and physical association with the Rb protein
University of Texas at Austin
24
Machine Learning Group
Constructing Feature Vectors for
Classification
• For each token, we take the following as features:
–
–
–
–
–
Current token
Last 2 tokens and next 2 tokens
Output of dictionary-based tagger for these 5 tokens
Suffix for each of the 5 tokens (last 1, 2, and 3 characters)
Class labels for last 2 tokens
Two potentially oncogenic cyclins , cyclin A and cyclin D1 ,
share common properties of subunit configuration , tyrosine
phosphorylation and physical association with the Rb protein
University of Texas at Austin
25
Machine Learning Group
Maximum-Entropy Token Classifier
• Distinguish among 5 types of tags:
• S(-tart), C(-ontinue), E(-nd), U(-nique), O(-ther)
• Feature templates:
– current, previous, next word, and previous tag
– part-of-speech for current, previous, next word
– word class (full) ex: FGF1 => AAA0
– word class (brief) ex: FGF1 => A0 (Collins, ACL02)
• An extraction’s confidence is the minimum of its transition probabilities.
Example (4 tokens):
t(y) is the forward probability of getting to state y at time step t
University of Texas at Austin
26
Machine Learning Group
MaxEnt: Greedy Extraction
• Use a Viterbi-like algorithm to find the most likely
complete sequence of tags.
• Drawback: many low confidence extractions are
missed.
•Want to be able to increase recall beyond Viterbi
results to control precision-recall trade-off.
• Solution: use a greedy extraction algorithm on all
token sequences between any two consecutive Viterbi
extractions.
University of Texas at Austin
27
Machine Learning Group
Experimental Method
• 10-fold cross-validation: Average results over 10 trials with
different training and (independent) test data.
• For methods which produce confidence in extractions, vary
threshold for extraction in order to explore recall-precision
trade-off.
• Use standard methods from information-retrieval to
generate a complete precision-recall curve.
• Maximizing F-measure assumes a particular cost-benefit
trade-off between incorrect and missed extractions.
University of Texas at Austin
28
Machine Learning Group
Protein Name Extraction Results
(Bunescu et al., 2004)
University of Texas at Austin
29
Machine Learning Group
Graphical Models
An intuitive representation of conditional independence between domain
variables.
 Directed Models => well suited to represent temporal and causal
relationships (Bayesian Networks, HMMs)
 Undirected Models => appropriate for representing statistical correlation
between variables (Markov Networks)
 Generative Models => define a joint probability over observations and labels
(HMMs)
 Discriminative Models => specifies a probability over labels given a set of
observations (Conditional Random Fields [Lafferty et al. 2001]).
 Allow for arbitrary, overlapping features over the observation sequence.
University of Texas at Austin
30
Machine Learning Group
Discriminative Markov Networks
G = (V, E) – an undirected graph
V = X  Y – a set of discrete random variables
X – observed variables
Y – hidden variables (labels)
C(G) – the cliques of G
Vc = Xc  Yc – the set of vertices in a clique cC(G)
  {c | c : Vc  R , c  C (G)} – the set of clique potentials
A clique potential c specifies the compatibility of any possible assignment
of values over the nodes in the associated clique c.
1
P(Y | X ) 
c ( X c , Yc )

Z ( X ) cC (G )
University of Texas at Austin
31
Machine Learning Group
Conditional Random Fields
[Lafferty et al. 2001]
 CRF’s are a type of discriminative Markov networks used for
tagging sequences.
 CRF’s have shown superior or competitive performance in various
tasks as:
 Shallow Parsing [Sha & Pereira 2003]
 Entity Recognition [McCallum & Li 2003]
 Table Extraction [Pinto et al 2003]
University of Texas at Austin
32
Machine Learning Group
Conditional Random Fields (CRFs)
Lafferty, McCallum & Pereira 2001
•Undirected graphical model for sequence segmentation.
• Log-linear model, different from MaxEnt model because of
“global normalization”
tags
Start
T1.tag
T2.tag
T3.tag
…
Tn.tag
End
tw
T1.w
T2.w
T3.w
T2.cap
T3.cap
…
Tn.w
cap
T1.cap
Tn.cap
• Tj.tag – the tag (one of S, C, E, U, O) at position j
• Tj.w – true if word w occurs at position j
• Tj.cap – true if word at position j begins with capital letter, …
University of Texas at Austin
…
33
Machine Learning Group
Protein Name Extraction Results (Yapex)
University of Texas at Austin
34
Machine Learning Group
Collective Classification of Web Pages
[Taskar, Abbeel & Koller 2002]
University of Texas at Austin
35
Machine Learning Group
Collective Information Extraction
Task:
 Extracting protein/gene names from Medline abstracts.
Approach:
 Collectively classify all candidate phrases from the same abstract.
 Binary classification:
 e.label = 0 => e is not a protein name
 e.label = 1 => e is a protein name
Use two types of label correlations:
 Acronyms and their long forms.
 Repetitions of the same phrase.
University of Texas at Austin
36
Machine Learning Group
Collective Information Extraction
The control of human ribosomal protein L22 ( rpL22 ) to enter into the nucleolus
and its ability to be assembled into the ribosome is regulated by its sequence .
The nuclear import of rpL22 depends on a classical nuclear localization signal of
four lysines at positions 13 – 16 … Once it reaches the nucleolus , the question of
whether rpL22 is assembled into the ribosome depends upon the presence of the
N - domain .
e3
of rpL22 depends
L22
overlap
e1
acronym
ribosomal protein L22
e2
( rpL22 )
repetition
e5
whether rpL22 is
e4
University of Texas at Austin
37
Machine Learning Group
[Taskar, Abbeel & Koller 2002]
Relational Markov Networks
Discriminative Markov Networks, augmented with clique templates:
 Overlap Template (OT)
 Acronym Template (AT)
 Repeat Template (RT)
e3
 RT
e5
L22
 OT
e1
 AT
ribosomal protein L22
of rpL22 depends
e2
( rpL22 )
 RT
 RT
whether rpL22 is
e4
University of Texas at Austin
38
Machine Learning Group
Candidate Entities: Definition
Candidate Entities:
 The set of candidate entities usually depends on the type of named entity.
 In general, could consider as candidates all phrases of length < L, where L
may be task dependent.
Two examples:
 [Genes, Proteins] Most entity names are base noun phrases or parts of them.
Thus a candidate extraction is any contiguous sequence of tokens whose POS
tags are from {“JJ”, “VBN”, “VBG”, “POS”, “NN”, “NNS”, “NNP”,
“NNPS”, “CD”, “”}, and whose head is either a noun or a number.
 [People, Organizations, Locations] Most entity names are sequences of proper
names potentially interspersed with definite articles and prepositions.
University of Texas at Austin
39
Machine Learning Group
Candidate Entities: Local Features
“… to the antioxidant superoxide dismutase  1 ( SOD1 ) enzyme and …”
Entity Features: based on features introduced in [Collins ’02]
 head word, with generic placeholder for numbers => “HD = 0”
 entity text => “TXT = superoxide dismutase – 1”
 entity type e.g. concatenation of its words types => “TYPE = a a – 0”
 bigrams / trigrams at entity left / right boundaries based on combinations
of lexical tokens, and word types.
 Bigrams left => “BL = antioxidant superoxide”, “BL = antioxidant a”, …
 Bigrams right => “BR = 0 (“,…
 Trigrams left => “TL = the antioxidant superoxide”, “TL = the antioxidant
a”, …
 Trigrams right => “TR = 0 ( SOD1”, “TR = 0 ( A0”, …
 suffix / prefix lists of words and word types
 Preffixes => “PF = superoxide”, “PF = superoxide dismutase”, …
 Suffixes => “SF = 0”, “SF = – 0”, “SF = dismutase – 0”, …
University of Texas at Austin
40
Machine Learning Group
Overlap Template
 Entity names should not overlap => hardwired overlap potential OT.
e1
“… to the antioxidant superoxide dismutase  1 ( SOD1 ) enzyme and …”
e2
e1
e2
 OT
OT (e1 , e2 )
e1.label=0
e1.label=1
e2.label=0
1
1
e2.label=1
1
0
d.OT  {( e1, e2 ), ...}
University of Texas at Austin
41
Machine Learning Group
Repeat Template
Production of nitric oxide ( NO ) in endothelial cells is regulated by direct interactions of
endothelial nitric oxide synthase ( eNOS ) …
Here we have used the yeast two - hybrid system and identified a novel 34 kDa protein ,
termed NOSIP ( eNOS interaction protein ) , which avidly binds to the carboxyl – terminal
region of the eNOS oxygenase domain .
u  “eNOS”
uOR  0
d .RT  {( u,v), ...}
 RT
u
v
uOR
vOR
v  “eNOS”
vOR  v1 v2
v1  “eNOS interaction”
v2  “eNOS interaction protein”
 OR
 OR
…
University of Texas at Austin
u1
u2
um
v1
v2
…
vn
42
Machine Learning Group
Acronym Template
v2
d
v1
v
“to the antioxidant superoxide dismutase  1 ( SOD1 ) enzyme and ”
vOR  v1 v2 v3
d . AT  {v, ...}
v3
v
 AT
vOR
 OR
v1
University of Texas at Austin
v2
…
vn
43
Machine Learning Group
Experimental Results
Datasets:
 Yapex – a dataset of 200 Medline abstracts, manually tagged for protein
names.
 Aimed – a dataset of 225 Medline abstracts, of which 200 are known to
mention protein interactions.
 CoNLL – the CoNLL 2003 English dataset.
Compared three approaches:
 LT–RMN  RMN extraction using local templates + Overlap Template
 GLT–RMN  RMN extraction using both local and global templates.
 CRF  extraction as token classification using Conditional Random Fields
[Lafferty et al 2001], with features based on current word, previous/next words,
words short/long types and POS tags [Bunescu et al 2004].
University of Texas at Austin
44
Machine Learning Group
Experimental Results – Yapex
LT-RMN
GLT_RMN
CRF
Recall
F-measure
75
70
Yapex
65
60
55
50
Precision
University of Texas at Austin
45
Machine Learning Group
Experimental Results – Aimed
LT-RMN
GLT_RMN
CRF
Recall
F-measure
90
85
Aimed
80
75
70
65
60
Precision
University of Texas at Austin
46
Machine Learning Group
Experimental Results – CoNLL
LT-RMN
GLT_RMN
CRF
Recall
F-measure
85
80
75
CoNLL 2003
70
65
60
Precision
University of Texas at Austin
47
Machine Learning Group
Protein Interaction Extraction
• Most IE methods focus on extracting individual
entities.
• Protein interaction extraction requires extracting
relations between entities.
• Our current results on relation extraction have
focused on rule-based learning approaches.
University of Texas at Austin
48
Machine Learning Group
Rapier and BWI Revisited:
the Inter-filler Approach
• Existing rule-based learning algorithms are used for
inducing patterns for identifying protein interactions.
• Rules are learned for extracting inter-fillers.
SHPTPW interacts with another signaling protein, Grb7.
• Inter-fillers are sometimes very long (~9 tokens on
average; 215 tokens maximum!). For some rule-based
learning algorithms (e.g. Rapier), the time complexity can
grow exponentially in the length of inter-fillers.
University of Texas at Austin
49
Machine Learning Group
Rapier and BWI Revisited:
the Role-filler Approach
• In the role-filler approach, we extract two interacting proteins into
different slots, which we call the interactor and the interactee.
• A sentence is divided into segments. Interactors are associated with
interactees in the same segment using simple heuristics.
We show that the S252W mutation allows the mesenchymal splice form of
FGFR2 (FGFR2c) to bind and to be activated by the mesenchymally
expressed ligands FGF7 or FGF10 and the epithelial splice form of FGFR2
(FGFR2b) to be activated by FGF2, FGF6, and FGF9.
• Moderately high precision (> 60%) but low recall (< 40%).
University of Texas at Austin
50
Machine Learning Group
ELCS
(Extraction using Longest Common Subsequences)
• A new method for inducing rules that extract interactions between
previously tagged proteins.
• Each rule consists of a sequence of words with allowable word gaps
between them (similar to Blaschke & Valencia, 2001, 2002).
- (7) interactions (0) between (5) PROT (9) PROT (17) .
• Any pair of proteins in a sentence if tagged as interacting forms a
positive example, otherwise it forms a negative example.
• Positive examples are repeatedly generalized to form rules until the
rules become overly general and start matching negative examples.
University of Texas at Austin
51
Machine Learning Group
Generalizing Rules using
Longest Common Subsequence
The self - association site appears to be formed by interactions between
helices 1 and 2 of beta spectrin repeat 17 of one dimer with helix 3 of
alpha spectrin repeat 1 of the other dimer to form two combined alpha beta triple - helical segments .
Title – Physical and functional interactions between the transcriptional
inhibitors Id3 and ITF-2b .
- (7) interactions (0) between (5) PROT (9) PROT (17) .
University of Texas at Austin
52
Machine Learning Group
The ELCS Framework
• A greedy-covering, bottom-up rule induction method is
used to cover all the positive examples without covering
many negative examples.
• We use an algorithm similar to beam search that considers
only the n = 25 best rules for generalization at any time.
• The confidence level of a rule is based on the number of
positive and negative examples the rule covers while
allowing some margin for noise (Cestnik, 1990).
University of Texas at Austin
53
Machine Learning Group
Protein Interaction Extraction Results
University of Texas at Austin
54
Machine Learning Group
Protein Interaction Extraction Results (full)
University of Texas at Austin
55
Machine Learning Group
Ongoing and Future Work
•
Extracted proteins and their interactions from 753,459 Medline abstracts on
human biology. Evaluation of results in progress.
•
Improve RMN approach with better local and global templates, better
candidate entity generation, and better algorithms for probabilistic inference.
•
Extend RMN approach to handle extracting relations between entities.
•
Evaluate RMN approach on other biological entities and relations and on other
non-biological corpora.
•
Reduce human efforts by actively selecting the best training examples for
human labeling.
•
Combine evidence from text with other biological data sources to derive
accurate, comprehensive gene networks.
University of Texas at Austin
56
Machine Learning Group
Conclusions
• We have compared a wide variety of existing machine-learning
methods for extracting human protein names and interactions.
• CRFs approach performs the best of existing methods.
• We developed a new more-general approach based on RMN’s that
allows collective extraction that integrates information across all
potential extractions.
• For extracting protein interactions, we found that several methods for
learning extraction rules outperform hand-written rules with respect to
precision and noisy protein tags.
University of Texas at Austin
57
Machine Learning Group
The End
University of Texas at Austin
58
Download