Towards Automatic Extraction of Epistemic Items from Scientific Publications Tudor Groza Siegfried Handschuh

advertisement
Towards Automatic Extraction of Epistemic Items from
Scientific Publications
Tudor Groza
Siegfried Handschuh
Georgeta Bordea
Digital Enterprise Research Institute
National University of Ireland, Galway
IDA Business Park, Lower Dangan
Galway, Ireland
{tudor.groza, siegfried.handschuh, georgeta.bordea}@deri.org
ABSTRACT
The exponential growth of the World Wide Web in the last decade,
brought with it an explosion in the information space. One, heavily
affected area is the scientific literature, where finding relevant work
in a particular field, and exploring links between relevant publications represents a cumbersome task. In this paper we make the initial steps in the direction of automatic extraction of epistemic items
(i.e. claims, positions, arguments) from scientific publications. Our
approach will provide the foundation for a comprehensive solution
that will partly alleviate the information overload problem. We detail the actual extraction process, the evaluation we have performed
and relevant use-cases for our work.
Categories and Subject Descriptors
I.2.7 [Artificial Intelligence]: Natural Language Processing; H.3.1
[Information Storage and Retrieval]: Content Analysis and Indexing—linguistic processing; I.7.5 [Document and Text Processing]: Document Capture
General Terms
Algorithms, Experimentation
Keywords
Rhetorical structure of text, Information extraction
1.
INTRODUCTION
The exponential growth of the World Wide Web in the last decade,
brought with it an explosion in the information space. Similarly,
the same phenomenon can be observed also in the area of scientific
literature, where the number of publishing spheres (journals, conferences, workshops, etc) increased substantially. As an example,
in the biomedical domain, the well-known MedLine 1 now hosts
over 18 million articles, having a growth rate of 0.5 million / year,
1
http://medline.cos.com/
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SAC’10 March 22-26, 2010, Sierre, Switzerland.
Copyright 2010 ACM 978-1-60558-638-0/10/03 ...$10.00.
which represents around 1300 articles / day [14]. This makes the
process of finding relevant work for a particular field a cumbersome
task.
One of the main reasons is that indexing publications based only
on syntactic resources is no longer sufficient. The typical dissemination process consists of authors stating claims, positions or arguments in relation to their own achievements or the results achieved
by other researchers. These epistemic items represent the key to
decoding the semantics hidden in the publications’ texts, and thus
could represent a possible solution to the information overload problem. More than this, creating links between them in an inverse
chronological order and starting from a particular author, allows
us to explore the tacit virtual argumentative discourse networks
spanned across multiple publications.
Externalization [9] represents the process of articulating tacit
knowledge into explicit concepts. As such, it holds the key to
knowledge creation. Consequently, the knowledge becomes crystallized, thus allowing it to be shared with and by others. Although
made explicit, the externalized knowledge is dependent on the degree of formalization. In the case of the argumentation discourse
based on claims (or epistemic items in general), it can be a couple
of keywords, or a weakly structured text, both possibly including
direct references to the publications stating the actual claims. Existing previous work focused on defining the appropriate models
for the externalization of epistemic items, both at the local (intrapublication) [3] and at the global (inter-publication) [4] levels. The
foundation of this research relies in the Rhetorical Structure of Text
Theory (RST) [6], that introduces a coherence structure of the discourse based on elementary text chunks and the rhetorical relations
between them. The main drawback of our work until now represents the manual approach that was followed.
In this paper we lay the foundation necessary for the automatic
extraction of epistemic items, and thus advancing closer towards
our ultimate goal of building the virtual argumentative discourse
networks automatically. The entire extraction process, as we have
envisioned it, consists of multiple complex stages. Therefore, here,
we will focus only on the first stage, i.e. the general extraction of
epistemic items, without differentiating between their several possible types (claims, arguments, or positions). This process comprises an empirical analysis of a publication corpus, the development of a knowledge acquisition module and a preliminary assignment of probabilities. At the end of the process we will have two
main results: (i) a list of epistemic items (with associated initial
probabilities) which gives the chance of a quick (but quite comprehensive) view on a publication, and (ii) an ontological model of the
publication in terms of epistemic items and the rhetorical relations
between them. Subsequent stages of the overall extraction include
a proper type-based differentiation among the extracted items, capturing their temporal aspects and the development of a learning
mechanism that improves the assessed initial probabilities based
on a pool of established features.
The remainder of the paper is structured as follows: in Sect. 2 we
introduce briefly the foundation of our work, i.e. RST and SALT
(Semantically Annotated LATEX). Then, Sect. 3 details the proposed
approach for the extraction of epistemic items and the preliminary
evaluation we have performed. In Sect. 4 we present a possible application, and before concluding in Sect. 6, we analyze the research
related our work in Sect. 5.
local scope, i.e considering the publication on its own, without the
external references. The model has two layers: a syntactic layer
and a semantic layer. The semantic layer comprises three ontologies: (i) the Document Ontology, modeling the linear structure of
a document, (ii) the Rhetorical Ontology, capturing the rhetorical
and argumentation structure of the publication, and (iii) the Annotation Ontology, that connects the rhetorical hidden in the document’s content with the physical document itself. The Rhetorical
Ontology has its roots in RST, but it limits the modeling to a set of
11 rhetorical relations, that have a higher chance of being present
in the scientific discourses.
2.
3.
2.1
BACKGROUND
Rhetorical Structure of Text
Figure 1: Example of rhetorical relations
The Rhetorical Structure of Text Theory (RST) was first introduced by Mann and Thompson in 1987 [6], with the goal of providing a descriptive theory for the organization and coherence of
natural text. The application domains of RST varies from computational linguistics, cross-linguistic studies and dialogue to multimedia presentations and natural language generation. The theory comprises a series of elements, as follows: (i) Text spans as
uninterrupted linear intervals of text, that can have the roles of
Nucleus or Satellite. A Nucleus represents the core (main part /
idea) of a sentence or phrase, while the Satellite represents a text
span that complements the Nucleus with additional information;
(ii) Schemas that define the structural constituency arrangements
of text. They mainly provide a set of conventions that are either
independent or inclusive of particular rhetorical relations that connect different text spans. The theory proposes a set of 23 rhetorical
relations, having an almost flat structure (e.g. Circumstance, Elaboration, Antithesis, etc). Figure 1 depicts a simple example of a
chain of nuclei and a satellites connected via the Justify and Antithesis rhetorical relations; (iii) Schema applications that introduce higher level conventions to determine possible applications of
a schema; (iv) Structures representing a set of schema applications
satisfying particular constraints.
2.2
Semantically Annotated LATEX
SALT represents a semantic authoring framework targeting the
enrichment of scientific publications with semantic metadata. SALT
currently adopts from RST two elements: the text spans and the
schemas. In [3] we introduce an approach for externalizing epistemic items captured within scientific publications, and having a
3.1
EXTRACTION OF EPISTEMIC ITEMS
Research Rationale
Before describing the actual extraction process, we will introduce the terminology used throughout this section and explain the
rationale behind the method we have followed. We believe that
the semantics provided by the rhetorical relations (according to the
RST theory) can be used to externalize the epistemic items hidden
scientific publications. In turn, having these items externalized one
can improve for example the retrieval of scientific publications and
increase the overall value a user receives.
Our main goal for the current stage of extraction is to detect those
text spans in the discourse that have a certain probability to act as
epistemic items for the publication. From an extraction perspective, this resumes to computing the afore-mentioned probability,
based on a series of factors. Currently, we considered three main
factors: (i) rhetorical relations; (ii) power items, and (iii) the block
placement.
Rhetorical relations model the coherence of the discourse, and
therefore, can be used to extract epistemic items. The rhetorical relations taken into account are the ones present in the SALT Rhetorical Ontology, and subscribe to the definitions given in [6]: antithesis, cause, circumstance, concession, condition, consequence, elaboration, evidence, means, preparation, purpose, restatement. Since
we use only a limited set of rhetorical relations, we agree from the
start with a compromise inclined more towards soundness rather
than completeness of our solution. Nevertheless, when dealing with
natural language processing, such a compromise should be acceptable.
To equilibrate this balance between soundness and completeness,
we also considered a special type of text span that has high chances
not to be part of a rhetorical relation, but to act as an epistemic item
for the publication. We called this type power item. As we shall
see, such power items are detected based on a series of particular
discourse markers. For example, the text: ”We present an experimental comparison of existing storage ...” represents a power item.
In addition to the two above linguistic features, we believe that
the role of a particular text span is also influenced by its placement
in the linear discourse structure (i.e. the block placement). There
are multiple ways of structuring a publication into blocks, either
based directly on the linear flow or based on the rhetorical roles
a block of text can have. We followed a mixt approach, by splitting the publication into five blocks: abstract, introduction, body,
related work and conclusion. Three of them, i.e. abstract, related
work and conclusion, have a rhetorical role (see also the rhetorical
blocks in SALT), while the other two are part of the usual linear
discourse structure. If analyzed in the order imposed by the typical
linear discourse structure, the length of text corresponding to each
block follows a standard normal distribution with a mean of 0 and
a variance between 0.3 and 1, depending of the overall publication
length. Consequently, we believe that from an epistemic value, the
Figure 2: Cue-phrases to relation mappings
two tails of the distribution will create local peaks, or focal pools
of epistemic items.
Having these three factors, from an engineering perspective, we
needed: (i) to develop the appropriate acquisition module, following a computational linguistic approach, which takes as input the
content of a publication, the mapping between discourse markers
and rhetorical relations (and power items) and the block structure
of the publication and outputs epistemic item candidates, and then
(ii) to compute a set of initial probabilities of these candidates to
actually represent epistemic items. In the following, we detail each
of these steps.
3.2
Empirical Analysis
To automate the process of identifying the elementary text spans
and the rhetorical relations that hold among them, we rely mostly
on the discourse function of cue phrases, i.e. words such as however, although and but. An exploratory study of such cue phrases
provided us with an empirical grounding for the development of
an extraction algorithm. The primary function of cue phrases or
connectives is to structure the discourse. They also have the highly
elaborate pragmatic functions, such as signaling shifts in the subjective perspective or presupposing various states of beliefs. At
the current stage, we focused our attention only on discourse connectives and lexico-grammatical constructs that can be detected by
means of a shallow analysis of natural language texts.
The number of discourse markers in a typical text, approximately
one marker for every two clauses [11], is sufficiently large to enable the derivation of rich rhetorical structures of texts. More importantly, the absence of markers correlates with a preference of
readers to interpret the unmarked textual units as continuations of
the topics of the units that precede them. We assume that the texts
we process are well-formed from a discourse perspective, much as
researchers in sentence parsing assume that they are well-formed
from a syntactic perspective.
Having as inspiration the work performed by Marcu [7] we analyzed a corpus of around 130 publications from the Computer Science field and identified 75 cue phrases that signal the rhetorical
relations mentioned above. Fig. 2 summarizes some of the cuephrase relation mappings (including power items). For example,
the cue phrase when signals a circumstance relation in the following text:
When [an ontology is selected as the underlying knowledge
base,]1 [the Lexicon Builder automatically extracts entities out of
the ontology to build the Lexicon.]2
with the first text span being a satellite and the second a nucleus.
Similarly, the cue phrase in this paper signals a power item in the
following snippet:
In this paper [we lay the foundation necessary for the automatic
extraction of epistemic items, and thus advancing . . . ]1
For each cue phrase we extracted a number of text fragments,
in order to identify two types of information: (i) discourse related
information, and (ii) algorithmic information. The discourse related information is concerned with the cue phrases under scrutiny,
the rhetorical relations that are marked by the cue phrases and the
roles of the related text spans (nuclei or satellites). In contrast to
the discourse related information, which has a general linguistic interpretation, the algorithmic information is specific for determining
the elementary text units of a text, in a particular context. Basically,
for each cue phrase we collected its position in the sentence, its position according to the neighboring text spans and the surrounding
punctuation. This information constitutes the empirical foundation
of our algorithm that identifies the elementary unit boundaries and
discourse usages of the cue phrases. It helps us in the disambiguation process (as mentioned above) and hypothesizes rhetorical relations that hold among text spans.
3.3
Knowledge Acquisition
The actual knowledge acquisition was implemented as a GATE 2
plugin. The algorithmic information gathered previously was encoded in forms of annotations as part of the JAPE grammars executed by the plugin, which we called CUE_PHRASE annotations.
For each cue phrases we have identified as signaling a rhetorical
relation from our set, we have created an associated CUE_PHRASE
annotation, based on JAPE rules. An example of such a rule is
shown below:
Rule:CuePhRule1 ({Token.string == ’,’})
({Token.string == ”while”}):cuePhrase1
−→
:cuePhrase1.CUE_PHRASE = {kind = ’cuePh’,
relation = "antithesis", place = B,
breakAction = NORMAL_THEN_COMMA,
statuses = NN, whereToLink = B, rule =
"CuePhRule1"}
On the left hand side of the rule, we have the discourse marker
, while and on the right hand side of the rule, we create the proper
CUE_PHRASE annotation, by assigning the corresponding discourse
related and algorithmic information. The orthographic environment of the cue phrase is encoded in the left side of the rule. It
contains the marker under consideration and all the punctuation that
precedes or follows it. Using the information derived from the empirical analysis, we have identified a series of fields to be important
for the CUE_PHRASE annotation, detailed in the following. Also,
a complete example of discourse related and algorithmic information, coded in fields for the Concession rhetorical relation, is shown
in Fig. 3.
3.3.1 relation field
The relation field specifies the rhetorical relation type that is signaled by the cue phrase under scrutiny. The list of relations used
was the one presented above.
3.3.2 whereToLink field
This field describes whether the textual unit that contains the
discourse marker under scrutiny is related to a textual unit found
Before (B) or After (A) it. For example, in the following
text, the textual unit that contains the marker since, is rhetorically
related to the textual unit that is placed before it (B).
E.g. 1: [Incremental methods should be developed][, since it
would be inefficient to compute a materialized repair of the
database to a query from scratch after every update.]
2
http://gate.ac.uk/
Figure 3: Example of discourse related and algorithmic information for the Concession rhetorical relation
In contrast to this example, the clause that contains the discourse
marker While in the following snippet, is rhetorically related to the
clause that comes immediately after it (A).
E.g. 2: [While we think that the right repair semantics may be
application dependent,][being able to compare the possible
semantics in terms of complexity may also shed some light on
what may be the repair semantics of choice.]
3.3.3
statuses field
The statuses field describes the rhetorical statuses of the textual
units that are connected through a rhetorical relation that is signaled by the marker under scrutiny. The status of a textual unit can
be Nucleus (N) or Satellite (S). The field contains two
rhetorical statuses showing also the coherence order. For example
the statuses field for the marker but in E.g. 1, is (NS) because
the clause-like unit [Incremental methods should be developed] is
the Nucleus and the clause-like unit [, since it would be inefficient to compute a materialized repair of the database to a query
from scratch after every update.] is the Satellite of a rhetorical relation of Evidence. Similarly, the statuses field for the
marker While in E.g. 2 is (NN) because both clauses represent a
Nucleus.
3.3.4
breakAction field
The breakAction field contains one member of a set of instructions for a shallow analyzer that determines the elementary units of
text. The shallow analyzer assumes that text is processed in a leftto-right fashion. Whenever a cue phrase is encountered, the shallow analyzer executes an action from the set {NONE, NORMAL,
NORMAL_THEN_COMMA}. The effect of this actions can be to create an elementary textual unit boundary in the input text. Such a
boundary corresponds to the square brackets used in the examples
that were discussed so far.
3.3.5
place field
The place field specifies the position of the discourse marker under scrutiny in the textual units to which it belongs. The possible
values taken by this field are: Beginning (B), when the cue
phrase occurs at the beginning of the textual units to which it belongs, Middle (M), when it is in the middle of the unit, and End
(E), when it is at the end. As an example, the content of the field
place in E.g. 2 is B.
3.4
Experiment
The annotation of epistemic items in a document is a highly
subjective task. Different people have diverse mental representations of a given document, depending on their domain, the depth
of knowledge of the document in question, and their attitude towards its content. Therefore, probably the most reliable annotator
for a scientific publication would be its author. In order to capture
the way in which people find (and maybe interpret) the epistemic
items, we ran an experiment. The goal of the experiment was to
allow us to compute initial values for the probabilities of text spans
to be epistemic items.
The setup of the experiment included the involvement of ten researchers (authors of scientific publications), two corpora and two
tasks. The tasks represented at their basis the same task, just that
each time performed on a different corpus. The first corpus comprised a set of ten publications chosen by us, while the second corpus had 20 publications, provided by the annotators. Each annotator provided us two of her own publications. For each publication,
we extracted a list of text spans (similar to the previous examples),
using the knowledge acquisition module, and presented this list to
the annotators. On an average each list had around 110 items.
The assignment of the publications was done as follows: each
annotator received, on one side, her own two publications (more
exactly the list of text spans extracted from their publications), and
on the other side, four publications from the corpus compiled by
us. In this way, each publication in our corpus was assigned to
four different annotators. The reason for choosing this assignment
algorithm was that we wanted a moderate diversity in opinions on
a larger number of documents, rather than a very high diversity in
opinions on a small number of documents.
The annotators’ task was to detect and mark (is vs. is not) on
the given lists, the text spans that they consider to be epistemic
items, i.e. they are claims, positions or arguments. There are a
series of remarks to be noted at this point. Although the complexity of the task was high enough by its nature, we chose not to do
experiment in a controlled environment. Unlike Teufel et al. [13]
that performed a similar experiment, including an initial training
of around 20 hours and proper definitions of what they would expect as a result, we did not to provide a formal definition of what an
epistemic item represents, or any linguistic constructs which the annotators should search for. Nevertheless, we did provide some indications that would speed up the task. These indications resumed to
mentioning that we refer to claims, positions and arguments when
talking about epistemic items, and to a list of questions that the annotators should ask themselves when in doubt about one particular
item. The questions were the following: (i) Is this text span reflecting or referring to part of the publication’s contributions? (ii) Is
this text span reflecting the authors position on a particular topic,
relevant for the publication? (iii) Is this text span comparing the
publication’s contribution to similar approaches? (iv) Would I use
this text span for searching on the web for similar publications?
By following this approach, we wanted the annotators to react based on their own knowledge and perception, rather than following a given schema. We are aware that training would have
boosted the inter-annotator agreements presented later in the section. Still, training usually has a short-term memory span, and the
rules we would have imposed, might have not overlapped completely with the annotators’ psychological background, i.e. with
the way in which they think and observe the rhetorical and argumentation within publications.
Another remark is related to the lists of candidates presented to
the annotator. These lists were compiled based only on the presence of rhetorical relations and power items. Thus, as already mentioned, did not insure the presence of all possible epistemic items
in the publication. Nevertheless, at this stage, we wanted to measure only the relation between the presence of rhetorical relations
within the given text spans (including power items) and the chance
Figure 4: Example of inter-annotator agreement results for a particular publication
of them being epistemic items. Consequently, the annotators had
only the YES/NO options, applicable on the list. At a later stage,
in the preliminary evaluation, the annotators (though a different set
of people) were able to annotate any text span, possibly including
also the ones that might have missed by the knowledge acquisition
module. Last but not least, the annotators were not formally aware
of the presence of the rhetorical relations, nor they had knowledge
of RST. The list of candidates contained plain text items with no additional information attached, i.e. they were not tagged according
to the particular relations present in them.
On the resulted marked lists, i.e. the participants’ votes on the
lists of candidates, we computed two metrics. The candidates extracted from the publications part of the corpus we provided, served
as input for computing the inter-annotator agreement, while the rest
of the candidates, the ones extracted from the publications provided
by the annotators, were used as input for computing a specific recall.
The first metric we computed was the proportional specific raw
inter-annotator agreement per publication, defined as follows:
K
X
njk (njk − 1)
S(j)
k=1
= K
ps (j) =
Sposs (j)
X
njk (nk − 1)
k=1
where, ps is the proportion of agreement specific to category j, K
is the total number of cases (for us annotators), njk is the number
of times category j is applied to a case, and
nk =
C
X
njk
j=1
with C = 2 in our case, as the categories we had, were j = 1
(positive agreement) and j = 0 (negative agreement). The overall
agreement is then computed as:
C
X
po =
O
Oposs (j)
=
S(j)
j=1
K
X
k=1
For each publication we calculated the proportions of specific
(positive, negative and overall) raw inter-annotator agreements for
all combinations of three out of four assigned annotators, and for
the one considering all four annotators at once. In addition, when
creating the result tables per publication, we split the text spans according to their involvement into a rhetorical relation and based on
the block placement. Figure 4 depicts an example of results table
for a particular publication. Considering for example, the group
formed by the annotators 1-2-4, we can observe a positive agreement of 1.0 on text spans containing antithesis in the abstract or,
of 0.75 on power items in conclusion. At the same time, the group
reached a 0.5 negative agreement on the same text spans labeled as
power items in conclusion. To make it more explicit, these agreements translate into the following: (i) on all the text spans located
in the abstract block of the publication under scrutiny and involved
into an antithesis rhetorical relation (without knowing these details), all the annotators part of group 1-2-4, agreed that the text
spans represent epistemic items; (ii) on all the text spans located in
the conclusion block of the publication and labeled as power items,
the annotators part of group 1-2-4, agreed that they do represent
epistemic items with a 0.75 specific raw inter-annotator agreement
and they do not represent epistemic items with a 0.5 specific raw
inter-annotator agreement.
We chose to compute this metric on different groups in addition
to the overall one because: (i) we wanted to eliminate possible ”malicious” annotators, i.e. annotators that either marked items very
selectively or they marked almost everything, and (ii) we wanted
to find the groups that have the highest homogeneity in annotation,
and thus highest overall agreement. For the computation of the initial probabilities we considered these last groups and their positive
agreement values.
For testing the statistical significance of our results, we performed
Pearson’s χ2 test with Yates’ correction, considering the null hypothesis that the annotators are independent with the observed agreement proportions higher than the estimated proportions if the agreement would occur by chance.
nk (nk − 1)
χ2 =
na
N X
X
(Oi,j − Ei,j )2
Ei,j
i=1 j=1
where N is the number of degrees of freedom, na is the num-
Figure 6: Table of computed weights for the items based on the relation and block placement
Figure 5: Chi-square statistics per publication
ber of annotators, Oi,j is the observed agreement and Ei,j is the
expected agreement. In our case N is 1 as we calculated the agreement based on a 2 x 2 contingency table and na = 3 because in all
cases the group that we chose with the highest homogeneity comprised three annotators. Fig. 5 summarizes the values for χ2 and the
p − value, thus showing that our results are statistically significant.
The second metric we computed, based on the items extracted
from their own publications, was a specific recall for each block
placement, i.e. the proportion of times a text span was chosen,
considering that is was involved in a particular rhetorical relation,
from the total of text spans that were involved in the same relation,
in the given block. For example, if the introduction of the publication had 10 text spans each involved into an antithesis relation and
the annotator marked 4 of them, the recall is 0.4. Fig. 6 presents
all the overall computed results, with the IAA column representing
the inter-annotator agreement and the OWN column representing the
recall mentioned above.
3.5
Preliminary Evaluation
To test our assumptions at this initial stage, we performed a preliminary evaluation. The setup of the evaluation was similar to the
one of the experiment described above. We used two corpora (with
a total of 20 publications), one with the evaluators’ own papers (15
papers, as we had 15 evaluators) and one containing a set of paper
we provided (another 5 papers). The evaluators were different than
the ones involved in the experiment. Each evaluator was asked to
read her own paper and one paper we have assigned to her, and
freely mark on them those text spans that she considers to have
the role of epistemic items. Similar to the experiment, there was
no training involved before the evaluation (for the same reasons),
although we did provide the same four questions as hints.
On our side, we developed a small information extraction module that considered as input for each type of rhetorical relation
the initial probabilities (conditioned by the placement in the linear discourse structure) previously found. The actual final probability (column Pf inal in Fig. 6) was computed following a naïve
approach that gives more weight to the specific recall than to the
positive inter-annotator agreement. As an example, if by analyzing
the text we find a power item in the publication’s abstract, we assign the probability of 0.84 for the corresponding text span to act
as an epistemic item (i.e. a claim, position or argument).
Regarding the final probability, in a real setting, the actual balance between the raw positive inter-annotator agreement and the
specific recall depends on several factors, such as, other parameters to be considered, or the way in which the results will be used.
We will discuss more about this in the following section. In this
particular case, we chose to give more weight to the specific recall, with the presumption that it will be reflected correspondingly
in the final performance measures we will calculate. Our expectations were that, by using this balance, the module will perform
better on the corpus provided by the authors, when compared with
the corpus we have provided. For the actual information extraction,
we set a linear threshold given by chance (i.e. 0.5), thus all the text
spans involved rhetorical relations that had the probability greater
than this threshold were considered proper epistemic items. These
were then returned as a list of candidates for the processed paper.
Returning to the evaluation setup, in parallel to the manual annotation done by the evaluators, we ran our module on the same set of
publications and compiled the predicted list of candidate epistemic
items. At the end, we considered the items extracted manually by
the evaluators as the ground truth (or golden standard) and compared it with our candidates, by computing the usual performance
measures according to the following formulæ:
{relevant items} ∩ {retrieved items}
retrieved items
{relevant items} ∩ {retrieved items}
Recall =
relevant items
2 ∗ P recision ∗ Recall
F measure =
P recision + Recall
The evaluation results are summarized in Table 1.
P recision =
Table 1: Evaluation results
Corpus
Prec.
Recall
F-Measure
I (own)
0.5
0.3
0.18
II (provided)
0.43
0.31
0.19
3.6
Discussion
There are a series of interesting issues that could be discussed
(and challenged) both regarding the way in which we performed the
evaluation and regarding the results. We shall take them stepwise.
Firstly, although it might raise some challenges, we believe that
our evaluation setup, including the way in which we considered
(or created) the “golden” standard, is valid. The entire extraction
process we described in this section is driven by human perception,
and thus the evaluation should have reflected this. We are aware
of the fact that annotation of epistemic items is a highly subjective
task, and the intention of capturing an agreement on it, can lead to
no results. And although the numbers we presented (especially in
Fig. 6, on the first two columns) reflect this reality, we did manage
to find a proper balance that can lead to promising results for the
following extraction steps.
Secondly, the formula we have used for computing the final probabilities, for the preliminary evaluation, has a very important influence on the extraction results. As already mentioned, within this
evaluation, we opted for a simple formula that gives more weight
to the probabilities emerged from the annotation of own papers.
Such an approach should be used when the automatic extraction
is performed by an author on her own papers, for example, in real
time while authoring them. This is clearly reflected in the positive
difference in precision between the own corpus and the provided
one. On the other hand, if used for information retrieval purposes,
by readers and not by authors, the computation formula should be
changed, so that it gives more weight to the probabilities emerged
from the annotation of given papers. Consequently, this translates
into shaping the extraction results in a form closer to what a reader
would expect.
Last but not least, one could interpret of the performance measures of the extraction results in different ways. On one side, we
see them as satisfactory, because they represent the effect of merely
the first step from a more complex extraction mechanism we have
envisioned. At the same time, if we compare them, for example,
with the best precision reached by Teufel’s approach [12], or 70.5,
we find our 0.5 precision to be encouraging. And this is mainly
because in our case there was no training involved, and we considered only two parameters in the extraction process, i.e. the rhetorical relations and the block placement, while Teufel employed a
very complex naïve Bayes classifier with a pool of 16 parameters,
and 20 hours of training for the empirical experiment. On the other
hand, these results clearly show that we need to consider as well
other parameters, such as, the presence of anaphora, a proper distinction between the different types of epistemic items, or the used
verb tense, parameters which are already part of our future plans.
4.
APPLICATIONS
One can envision a multitude of applications that would gain
from the externalization of epistemic items. And, as we will mention in Sect. 5, they depend to a large extent on the different directions approached by the research. Here, we focus on one particular application, in which we have integrated the extraction module
mentioned in the previous section, i.e. a personal scientific publication assistant.
The application was designed to be especially suited for earlystage researchers that are in the phase of researching the state of
the art of a particular field. Its main goal is to enrich the information space around a publication by using the extracted shallow and
deep metadata to query known linked data repositories of scientific publications, like Faceted DBLP3 or the Semantic Web Dog
food Server4 . The information extraction and expansion is currently done in multiple directions, based on the title of the publication, authors and references, from scientific publications encoded
as PDF documents. We believe that this approach will help students
(and not only) to learn the most relevant authors and publications
in a certain area.
Embedded in the application’s functionalities is also the automatic extraction of epistemic items. This gives the user the chance
to have a quick overview on the publications’ contributions. Due
to the fact that we targeted a high precision, rather than recall, the
length of this list usually varies from 10 to 15 items, depending
on the total length of the publication. As an additional feature, the
user has the possibility of embedding visual annotations of these
contributions in the original (PDF) document, only by pressing a
single button. Thus, she will be able to visualize the epistemic
items (i.e. claims, positions, arguments) in any PDF viewer, without the need of using our application. For the future, we plan to use
the list of extracted epistemic items for information expansion purposes, in the same way we currently use the shallow metadata. The
user will be able to browse the argumentative discourse networks
spanned across multiple publications without the need of any manual intervention. A full demo of the application can be found at:
http://sclippy.semanticauthoring.org/movie/sclippy.htm
5.
RELATED WORK
The research presented in this paper (or its foundation) can be
divided into several directions, each direction having a rich sphere
of related work. One could analyze similar models for discourse
structuring, based on the same foundational theory (i.e. RST) or
on others, or consider previous work on automatic extraction of
epistemic items with different goals. In the following, we will deal
with a mixture of both, and put an emphasis on approaches that
started from a discourse representation model and evolved from
manual annotation to automatic extraction.
In the last decade, several models for describing and structuring
the scientific discourse were developed. Most of the models partly
overlap on two aspects: (i) they have the same goal of modeling the
argumentation semantics captured in the publications’ texts, and
(ii) they all share, at the abstract level, a core group of concepts
representing the epistemic items, but use different terminologies.
The main difference is given by the types of relations connecting
the epistemic items and the use (or not) of the items’ placement in
the linear structure of the publication.
One of the first models was introduced by Teufel et. al [12] and
attempted at categorizing phrases from scientific publications into
seven types of zones based on their rhetorical role. The zones represented a block structure of the publication, similar to the Rhetorical
Blocks in the SALT Rhetorical Ontology. Later, the authors developed an automatic extraction approach, following a similar method
to ours, starting from a corpus of manually annotated documents
and a set of probabilities emerged from inter-annotator agreement
studies [13]. The automatic extraction was using particular cuephrases compiled empirically (e.g. we employ ..., we will adopt
...) and was considering the placement of the phrase in the cat3
4
http://dblp.l3s.de/
http://data.semanticweb.org/
egories defined by the authors. A very similar approach inspired
from Teufel et al., and with focus on biology articles was developed by Mizuka et. al [8].
There are several similarities and distinctions between Teufel’s
approach and ours. In terms of similarities, both approaches use
discourse markers for the extraction of epistemic items. The main
difference is given by the actual goal, i.e. we target the general extraction of epistemic items, while Teufel targets the classification
of different text spans in the seven rhetorical categories they propose. In addition, we use the coherence structure provided by the
rhetorical relations as main features for the extraction.
Buckingham Shum et. al [10] were the first to describe one of the
most comprehensive models for argumentation in scientific publications, using as links between the epistemic items Cognitive Coherence Relations. They developed a series of tools for the annotation, search and visualization of scientific publications based on
this model [15], which represent our main inspiration. The automatic extraction approach they followed was simpler than the one
developed by Teufel et al. They relied only on compiling and using
a list of particular cue-phrases (e.g. this paper presents ...). Although their model is richer than the previous, due to the presence
of relations, they do not make actual use of them. Consequently,
their approach is reflected in our extraction of power items.
Lisacek et. al [5] use similar techniques (based on cue-phrases)
to detect paradigm shifts in biological articles. Their approach is
particularly interesting because they try to mine the presence of
’time’ in the publications’ texts, in order to make the distinction
between past solutions and predict future ones. Later, de Waard
et. al [1] partly adopt this trend and focus on mining ’epistemic
segments’, with the goal of detecting paradigm shifts. In [2], the
authors describe the relation structure that they propose for linking
epistemic segments, while in [1] they make the first steps toward
automatic extraction, based on particular verbs and verb tenses.
Compared to the model proposed by de Waard et. al, our model
captures a more complex structure of relations between the discourse segments. At the same time, from the extraction perspective,
we started by exploiting the semantics provided by the relations,
and will consider verbs and verb tenses at a later stage.
6.
CONCLUSION
In this paper we made an initial step towards the automatic extraction of epistemic items from scientific publications. Following
our previous work, in which we introduced the models for achieving externalization, and by adopting RST as foundational block, we
developed a process that provided as results an initial assignment
of probabilities to text spans that act as epistemic items, considering the presence of rhetorical relations and their placement in the
linear discourse structure. The preliminary evaluation encourages
us to continue our pursue towards achieving the ultimate goal of
building argumentative discourse networks automatically.
Future work will focus on improving the extraction by considering word co-occurrence, anaphora resolution, semantic chains, verb
tense analysis and possibly also other rhetorical relations. These
improvements will also be reflected into a new iteration over the
initial weights (probabilities) assigned to the epistemic items, resulted from this paper. This iteration will probably feature a doubleround of empirical evaluation, with crossed results, to ensure a
more objective view on the initial probabilities. A further natural development will be the definition of a formal Bayesian model
for expressing the probability of an epistemic item, conditioned by
the participation into a rhetorical relation, its position in the linear
structure of the discourse, or the presence of anaphora. Here we
will introduce the distinction between the different types of epis-
temic items, and branch the research into two directions that we
deal with the development learning algorithms for: (i) claim clustering intra and inter-publications, and (ii) the detection of positions
and arguments and their relations to original claims.
7.
ACKNOWLEDGMENTS
The work presented in this paper has been funded by Science
Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
8.
REFERENCES
[1] A. de Waard, P. Buitelaar, and T. Eigner. Identifying the
Epistemic Value of Discourse Segments in Biology Texts. In
Proc. of the 8th Int. Conf. on Computational Semantics
(IWCS-8 2009), January 2009.
[2] A. de Waard and J. Kircz. Modeling Scientific Research
Articles – Shifting Perspectives and Persistent Issues. In
Proc. of the 12th Int. Conf. on Electronic Publishing (ElPub
2008), June 2008.
[3] T. Groza, S. Handschuh, K. Möller, and S. Decker. SALT Semantically Annotated LATEX for Scientific Publications. In
ESWC 2007, Innsbruck, Austria, 2007.
[4] T. Groza, K. Möller, S. Handschuh, D. Trif, and S. Decker.
SALT: Weaving the claim web. In ISWC 2007, Busan, Korea.
[5] F. Lisacek, C. Chichestera, A. Kaplan, and A. Sandor.
Discovering Paradigm Shift Patterns in Biomedical
Abstracts: Application to Neurodegenerative Diseases. In
Proc. of the 1st Int. Symp. on Semantic Mining in
Biomedicine (SMBM), 2005.
[6] W. C. Mann and S. A. Thompson. Rhetorical structure
theory: A theory of text organization. Technical Report
RS-87-190, Information Science Institute, 1987.
[7] D. Marcu. The Rhetorical Parsing, Summarization, and
Generation on Natural Language Texts. PhD thesis,
University of Toronto, 1997.
[8] Y. Mizuka, A. Korhonen, T. Mullen, and N. Collier. Zone
analysis in biology articles as a basis for information
extraction. Int. J. of Medical Informatics, 75:468–487, 2006.
[9] I. Nonaka and H. Takeuchi. The Knowledge-Creating
Company: How Japanese Companies Create the Dynamics
of Innovation. Oxford University Press, 1995.
[10] S. J. B. Shum, V. Uren, G. Li, B. Sereno, and C. Mancini.
Modeling naturalistic argumentation in research literatures:
Representation and interaction design issues. Int. J. of
Intelligent Systems, 22(1):17–47, 2006.
[11] M. Taboada and W. C. Mann. Rhetorical structure theory:
looking back and moving ahead. Discourse Studies, 8, No.
3:423–459, 2006.
[12] S. Teufel, J. Carletta, and M. Moens. An annotation scheme
for discourse-level argumentation in research articles. In
Proc. of the 9th Conf. on European Chapter of the ACL,
pages 110–117, Morristown, NJ, USA, 1999. ACL.
[13] S. Teufel and M. Moens. Summarizing scientific articles:
Experiments with relevance and rhetorical status.
Computational Linguistics, 28, 2002.
[14] J. Tsujii. Refine and pathtext, which combines text mining
with pathways. Keynote at Semantic Enrichment of the
Scientific Literature 2009 (SESL 2009), March 2009.
[15] V. Uren, S. B. Shum, G. Li, and M. Bachler. Sensemaking
tools for understanding research literatures: Design,
implementation and user evaluation. Int. Jnl. Human
Computer Studies, 64, No.5:420–445, 2006.
Download