Towards Automatic Extraction of Epistemic Items from Scientific Publications Tudor Groza Siegfried Handschuh Georgeta Bordea Digital Enterprise Research Institute National University of Ireland, Galway IDA Business Park, Lower Dangan Galway, Ireland {tudor.groza, siegfried.handschuh, georgeta.bordea}@deri.org ABSTRACT The exponential growth of the World Wide Web in the last decade, brought with it an explosion in the information space. One, heavily affected area is the scientific literature, where finding relevant work in a particular field, and exploring links between relevant publications represents a cumbersome task. In this paper we make the initial steps in the direction of automatic extraction of epistemic items (i.e. claims, positions, arguments) from scientific publications. Our approach will provide the foundation for a comprehensive solution that will partly alleviate the information overload problem. We detail the actual extraction process, the evaluation we have performed and relevant use-cases for our work. Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—linguistic processing; I.7.5 [Document and Text Processing]: Document Capture General Terms Algorithms, Experimentation Keywords Rhetorical structure of text, Information extraction 1. INTRODUCTION The exponential growth of the World Wide Web in the last decade, brought with it an explosion in the information space. Similarly, the same phenomenon can be observed also in the area of scientific literature, where the number of publishing spheres (journals, conferences, workshops, etc) increased substantially. As an example, in the biomedical domain, the well-known MedLine 1 now hosts over 18 million articles, having a growth rate of 0.5 million / year, 1 http://medline.cos.com/ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’10 March 22-26, 2010, Sierre, Switzerland. Copyright 2010 ACM 978-1-60558-638-0/10/03 ...$10.00. which represents around 1300 articles / day [14]. This makes the process of finding relevant work for a particular field a cumbersome task. One of the main reasons is that indexing publications based only on syntactic resources is no longer sufficient. The typical dissemination process consists of authors stating claims, positions or arguments in relation to their own achievements or the results achieved by other researchers. These epistemic items represent the key to decoding the semantics hidden in the publications’ texts, and thus could represent a possible solution to the information overload problem. More than this, creating links between them in an inverse chronological order and starting from a particular author, allows us to explore the tacit virtual argumentative discourse networks spanned across multiple publications. Externalization [9] represents the process of articulating tacit knowledge into explicit concepts. As such, it holds the key to knowledge creation. Consequently, the knowledge becomes crystallized, thus allowing it to be shared with and by others. Although made explicit, the externalized knowledge is dependent on the degree of formalization. In the case of the argumentation discourse based on claims (or epistemic items in general), it can be a couple of keywords, or a weakly structured text, both possibly including direct references to the publications stating the actual claims. Existing previous work focused on defining the appropriate models for the externalization of epistemic items, both at the local (intrapublication) [3] and at the global (inter-publication) [4] levels. The foundation of this research relies in the Rhetorical Structure of Text Theory (RST) [6], that introduces a coherence structure of the discourse based on elementary text chunks and the rhetorical relations between them. The main drawback of our work until now represents the manual approach that was followed. In this paper we lay the foundation necessary for the automatic extraction of epistemic items, and thus advancing closer towards our ultimate goal of building the virtual argumentative discourse networks automatically. The entire extraction process, as we have envisioned it, consists of multiple complex stages. Therefore, here, we will focus only on the first stage, i.e. the general extraction of epistemic items, without differentiating between their several possible types (claims, arguments, or positions). This process comprises an empirical analysis of a publication corpus, the development of a knowledge acquisition module and a preliminary assignment of probabilities. At the end of the process we will have two main results: (i) a list of epistemic items (with associated initial probabilities) which gives the chance of a quick (but quite comprehensive) view on a publication, and (ii) an ontological model of the publication in terms of epistemic items and the rhetorical relations between them. Subsequent stages of the overall extraction include a proper type-based differentiation among the extracted items, capturing their temporal aspects and the development of a learning mechanism that improves the assessed initial probabilities based on a pool of established features. The remainder of the paper is structured as follows: in Sect. 2 we introduce briefly the foundation of our work, i.e. RST and SALT (Semantically Annotated LATEX). Then, Sect. 3 details the proposed approach for the extraction of epistemic items and the preliminary evaluation we have performed. In Sect. 4 we present a possible application, and before concluding in Sect. 6, we analyze the research related our work in Sect. 5. local scope, i.e considering the publication on its own, without the external references. The model has two layers: a syntactic layer and a semantic layer. The semantic layer comprises three ontologies: (i) the Document Ontology, modeling the linear structure of a document, (ii) the Rhetorical Ontology, capturing the rhetorical and argumentation structure of the publication, and (iii) the Annotation Ontology, that connects the rhetorical hidden in the document’s content with the physical document itself. The Rhetorical Ontology has its roots in RST, but it limits the modeling to a set of 11 rhetorical relations, that have a higher chance of being present in the scientific discourses. 2. 3. 2.1 BACKGROUND Rhetorical Structure of Text Figure 1: Example of rhetorical relations The Rhetorical Structure of Text Theory (RST) was first introduced by Mann and Thompson in 1987 [6], with the goal of providing a descriptive theory for the organization and coherence of natural text. The application domains of RST varies from computational linguistics, cross-linguistic studies and dialogue to multimedia presentations and natural language generation. The theory comprises a series of elements, as follows: (i) Text spans as uninterrupted linear intervals of text, that can have the roles of Nucleus or Satellite. A Nucleus represents the core (main part / idea) of a sentence or phrase, while the Satellite represents a text span that complements the Nucleus with additional information; (ii) Schemas that define the structural constituency arrangements of text. They mainly provide a set of conventions that are either independent or inclusive of particular rhetorical relations that connect different text spans. The theory proposes a set of 23 rhetorical relations, having an almost flat structure (e.g. Circumstance, Elaboration, Antithesis, etc). Figure 1 depicts a simple example of a chain of nuclei and a satellites connected via the Justify and Antithesis rhetorical relations; (iii) Schema applications that introduce higher level conventions to determine possible applications of a schema; (iv) Structures representing a set of schema applications satisfying particular constraints. 2.2 Semantically Annotated LATEX SALT represents a semantic authoring framework targeting the enrichment of scientific publications with semantic metadata. SALT currently adopts from RST two elements: the text spans and the schemas. In [3] we introduce an approach for externalizing epistemic items captured within scientific publications, and having a 3.1 EXTRACTION OF EPISTEMIC ITEMS Research Rationale Before describing the actual extraction process, we will introduce the terminology used throughout this section and explain the rationale behind the method we have followed. We believe that the semantics provided by the rhetorical relations (according to the RST theory) can be used to externalize the epistemic items hidden scientific publications. In turn, having these items externalized one can improve for example the retrieval of scientific publications and increase the overall value a user receives. Our main goal for the current stage of extraction is to detect those text spans in the discourse that have a certain probability to act as epistemic items for the publication. From an extraction perspective, this resumes to computing the afore-mentioned probability, based on a series of factors. Currently, we considered three main factors: (i) rhetorical relations; (ii) power items, and (iii) the block placement. Rhetorical relations model the coherence of the discourse, and therefore, can be used to extract epistemic items. The rhetorical relations taken into account are the ones present in the SALT Rhetorical Ontology, and subscribe to the definitions given in [6]: antithesis, cause, circumstance, concession, condition, consequence, elaboration, evidence, means, preparation, purpose, restatement. Since we use only a limited set of rhetorical relations, we agree from the start with a compromise inclined more towards soundness rather than completeness of our solution. Nevertheless, when dealing with natural language processing, such a compromise should be acceptable. To equilibrate this balance between soundness and completeness, we also considered a special type of text span that has high chances not to be part of a rhetorical relation, but to act as an epistemic item for the publication. We called this type power item. As we shall see, such power items are detected based on a series of particular discourse markers. For example, the text: ”We present an experimental comparison of existing storage ...” represents a power item. In addition to the two above linguistic features, we believe that the role of a particular text span is also influenced by its placement in the linear discourse structure (i.e. the block placement). There are multiple ways of structuring a publication into blocks, either based directly on the linear flow or based on the rhetorical roles a block of text can have. We followed a mixt approach, by splitting the publication into five blocks: abstract, introduction, body, related work and conclusion. Three of them, i.e. abstract, related work and conclusion, have a rhetorical role (see also the rhetorical blocks in SALT), while the other two are part of the usual linear discourse structure. If analyzed in the order imposed by the typical linear discourse structure, the length of text corresponding to each block follows a standard normal distribution with a mean of 0 and a variance between 0.3 and 1, depending of the overall publication length. Consequently, we believe that from an epistemic value, the Figure 2: Cue-phrases to relation mappings two tails of the distribution will create local peaks, or focal pools of epistemic items. Having these three factors, from an engineering perspective, we needed: (i) to develop the appropriate acquisition module, following a computational linguistic approach, which takes as input the content of a publication, the mapping between discourse markers and rhetorical relations (and power items) and the block structure of the publication and outputs epistemic item candidates, and then (ii) to compute a set of initial probabilities of these candidates to actually represent epistemic items. In the following, we detail each of these steps. 3.2 Empirical Analysis To automate the process of identifying the elementary text spans and the rhetorical relations that hold among them, we rely mostly on the discourse function of cue phrases, i.e. words such as however, although and but. An exploratory study of such cue phrases provided us with an empirical grounding for the development of an extraction algorithm. The primary function of cue phrases or connectives is to structure the discourse. They also have the highly elaborate pragmatic functions, such as signaling shifts in the subjective perspective or presupposing various states of beliefs. At the current stage, we focused our attention only on discourse connectives and lexico-grammatical constructs that can be detected by means of a shallow analysis of natural language texts. The number of discourse markers in a typical text, approximately one marker for every two clauses [11], is sufficiently large to enable the derivation of rich rhetorical structures of texts. More importantly, the absence of markers correlates with a preference of readers to interpret the unmarked textual units as continuations of the topics of the units that precede them. We assume that the texts we process are well-formed from a discourse perspective, much as researchers in sentence parsing assume that they are well-formed from a syntactic perspective. Having as inspiration the work performed by Marcu [7] we analyzed a corpus of around 130 publications from the Computer Science field and identified 75 cue phrases that signal the rhetorical relations mentioned above. Fig. 2 summarizes some of the cuephrase relation mappings (including power items). For example, the cue phrase when signals a circumstance relation in the following text: When [an ontology is selected as the underlying knowledge base,]1 [the Lexicon Builder automatically extracts entities out of the ontology to build the Lexicon.]2 with the first text span being a satellite and the second a nucleus. Similarly, the cue phrase in this paper signals a power item in the following snippet: In this paper [we lay the foundation necessary for the automatic extraction of epistemic items, and thus advancing . . . ]1 For each cue phrase we extracted a number of text fragments, in order to identify two types of information: (i) discourse related information, and (ii) algorithmic information. The discourse related information is concerned with the cue phrases under scrutiny, the rhetorical relations that are marked by the cue phrases and the roles of the related text spans (nuclei or satellites). In contrast to the discourse related information, which has a general linguistic interpretation, the algorithmic information is specific for determining the elementary text units of a text, in a particular context. Basically, for each cue phrase we collected its position in the sentence, its position according to the neighboring text spans and the surrounding punctuation. This information constitutes the empirical foundation of our algorithm that identifies the elementary unit boundaries and discourse usages of the cue phrases. It helps us in the disambiguation process (as mentioned above) and hypothesizes rhetorical relations that hold among text spans. 3.3 Knowledge Acquisition The actual knowledge acquisition was implemented as a GATE 2 plugin. The algorithmic information gathered previously was encoded in forms of annotations as part of the JAPE grammars executed by the plugin, which we called CUE_PHRASE annotations. For each cue phrases we have identified as signaling a rhetorical relation from our set, we have created an associated CUE_PHRASE annotation, based on JAPE rules. An example of such a rule is shown below: Rule:CuePhRule1 ({Token.string == ’,’}) ({Token.string == ”while”}):cuePhrase1 −→ :cuePhrase1.CUE_PHRASE = {kind = ’cuePh’, relation = "antithesis", place = B, breakAction = NORMAL_THEN_COMMA, statuses = NN, whereToLink = B, rule = "CuePhRule1"} On the left hand side of the rule, we have the discourse marker , while and on the right hand side of the rule, we create the proper CUE_PHRASE annotation, by assigning the corresponding discourse related and algorithmic information. The orthographic environment of the cue phrase is encoded in the left side of the rule. It contains the marker under consideration and all the punctuation that precedes or follows it. Using the information derived from the empirical analysis, we have identified a series of fields to be important for the CUE_PHRASE annotation, detailed in the following. Also, a complete example of discourse related and algorithmic information, coded in fields for the Concession rhetorical relation, is shown in Fig. 3. 3.3.1 relation field The relation field specifies the rhetorical relation type that is signaled by the cue phrase under scrutiny. The list of relations used was the one presented above. 3.3.2 whereToLink field This field describes whether the textual unit that contains the discourse marker under scrutiny is related to a textual unit found Before (B) or After (A) it. For example, in the following text, the textual unit that contains the marker since, is rhetorically related to the textual unit that is placed before it (B). E.g. 1: [Incremental methods should be developed][, since it would be inefficient to compute a materialized repair of the database to a query from scratch after every update.] 2 http://gate.ac.uk/ Figure 3: Example of discourse related and algorithmic information for the Concession rhetorical relation In contrast to this example, the clause that contains the discourse marker While in the following snippet, is rhetorically related to the clause that comes immediately after it (A). E.g. 2: [While we think that the right repair semantics may be application dependent,][being able to compare the possible semantics in terms of complexity may also shed some light on what may be the repair semantics of choice.] 3.3.3 statuses field The statuses field describes the rhetorical statuses of the textual units that are connected through a rhetorical relation that is signaled by the marker under scrutiny. The status of a textual unit can be Nucleus (N) or Satellite (S). The field contains two rhetorical statuses showing also the coherence order. For example the statuses field for the marker but in E.g. 1, is (NS) because the clause-like unit [Incremental methods should be developed] is the Nucleus and the clause-like unit [, since it would be inefficient to compute a materialized repair of the database to a query from scratch after every update.] is the Satellite of a rhetorical relation of Evidence. Similarly, the statuses field for the marker While in E.g. 2 is (NN) because both clauses represent a Nucleus. 3.3.4 breakAction field The breakAction field contains one member of a set of instructions for a shallow analyzer that determines the elementary units of text. The shallow analyzer assumes that text is processed in a leftto-right fashion. Whenever a cue phrase is encountered, the shallow analyzer executes an action from the set {NONE, NORMAL, NORMAL_THEN_COMMA}. The effect of this actions can be to create an elementary textual unit boundary in the input text. Such a boundary corresponds to the square brackets used in the examples that were discussed so far. 3.3.5 place field The place field specifies the position of the discourse marker under scrutiny in the textual units to which it belongs. The possible values taken by this field are: Beginning (B), when the cue phrase occurs at the beginning of the textual units to which it belongs, Middle (M), when it is in the middle of the unit, and End (E), when it is at the end. As an example, the content of the field place in E.g. 2 is B. 3.4 Experiment The annotation of epistemic items in a document is a highly subjective task. Different people have diverse mental representations of a given document, depending on their domain, the depth of knowledge of the document in question, and their attitude towards its content. Therefore, probably the most reliable annotator for a scientific publication would be its author. In order to capture the way in which people find (and maybe interpret) the epistemic items, we ran an experiment. The goal of the experiment was to allow us to compute initial values for the probabilities of text spans to be epistemic items. The setup of the experiment included the involvement of ten researchers (authors of scientific publications), two corpora and two tasks. The tasks represented at their basis the same task, just that each time performed on a different corpus. The first corpus comprised a set of ten publications chosen by us, while the second corpus had 20 publications, provided by the annotators. Each annotator provided us two of her own publications. For each publication, we extracted a list of text spans (similar to the previous examples), using the knowledge acquisition module, and presented this list to the annotators. On an average each list had around 110 items. The assignment of the publications was done as follows: each annotator received, on one side, her own two publications (more exactly the list of text spans extracted from their publications), and on the other side, four publications from the corpus compiled by us. In this way, each publication in our corpus was assigned to four different annotators. The reason for choosing this assignment algorithm was that we wanted a moderate diversity in opinions on a larger number of documents, rather than a very high diversity in opinions on a small number of documents. The annotators’ task was to detect and mark (is vs. is not) on the given lists, the text spans that they consider to be epistemic items, i.e. they are claims, positions or arguments. There are a series of remarks to be noted at this point. Although the complexity of the task was high enough by its nature, we chose not to do experiment in a controlled environment. Unlike Teufel et al. [13] that performed a similar experiment, including an initial training of around 20 hours and proper definitions of what they would expect as a result, we did not to provide a formal definition of what an epistemic item represents, or any linguistic constructs which the annotators should search for. Nevertheless, we did provide some indications that would speed up the task. These indications resumed to mentioning that we refer to claims, positions and arguments when talking about epistemic items, and to a list of questions that the annotators should ask themselves when in doubt about one particular item. The questions were the following: (i) Is this text span reflecting or referring to part of the publication’s contributions? (ii) Is this text span reflecting the authors position on a particular topic, relevant for the publication? (iii) Is this text span comparing the publication’s contribution to similar approaches? (iv) Would I use this text span for searching on the web for similar publications? By following this approach, we wanted the annotators to react based on their own knowledge and perception, rather than following a given schema. We are aware that training would have boosted the inter-annotator agreements presented later in the section. Still, training usually has a short-term memory span, and the rules we would have imposed, might have not overlapped completely with the annotators’ psychological background, i.e. with the way in which they think and observe the rhetorical and argumentation within publications. Another remark is related to the lists of candidates presented to the annotator. These lists were compiled based only on the presence of rhetorical relations and power items. Thus, as already mentioned, did not insure the presence of all possible epistemic items in the publication. Nevertheless, at this stage, we wanted to measure only the relation between the presence of rhetorical relations within the given text spans (including power items) and the chance Figure 4: Example of inter-annotator agreement results for a particular publication of them being epistemic items. Consequently, the annotators had only the YES/NO options, applicable on the list. At a later stage, in the preliminary evaluation, the annotators (though a different set of people) were able to annotate any text span, possibly including also the ones that might have missed by the knowledge acquisition module. Last but not least, the annotators were not formally aware of the presence of the rhetorical relations, nor they had knowledge of RST. The list of candidates contained plain text items with no additional information attached, i.e. they were not tagged according to the particular relations present in them. On the resulted marked lists, i.e. the participants’ votes on the lists of candidates, we computed two metrics. The candidates extracted from the publications part of the corpus we provided, served as input for computing the inter-annotator agreement, while the rest of the candidates, the ones extracted from the publications provided by the annotators, were used as input for computing a specific recall. The first metric we computed was the proportional specific raw inter-annotator agreement per publication, defined as follows: K X njk (njk − 1) S(j) k=1 = K ps (j) = Sposs (j) X njk (nk − 1) k=1 where, ps is the proportion of agreement specific to category j, K is the total number of cases (for us annotators), njk is the number of times category j is applied to a case, and nk = C X njk j=1 with C = 2 in our case, as the categories we had, were j = 1 (positive agreement) and j = 0 (negative agreement). The overall agreement is then computed as: C X po = O Oposs (j) = S(j) j=1 K X k=1 For each publication we calculated the proportions of specific (positive, negative and overall) raw inter-annotator agreements for all combinations of three out of four assigned annotators, and for the one considering all four annotators at once. In addition, when creating the result tables per publication, we split the text spans according to their involvement into a rhetorical relation and based on the block placement. Figure 4 depicts an example of results table for a particular publication. Considering for example, the group formed by the annotators 1-2-4, we can observe a positive agreement of 1.0 on text spans containing antithesis in the abstract or, of 0.75 on power items in conclusion. At the same time, the group reached a 0.5 negative agreement on the same text spans labeled as power items in conclusion. To make it more explicit, these agreements translate into the following: (i) on all the text spans located in the abstract block of the publication under scrutiny and involved into an antithesis rhetorical relation (without knowing these details), all the annotators part of group 1-2-4, agreed that the text spans represent epistemic items; (ii) on all the text spans located in the conclusion block of the publication and labeled as power items, the annotators part of group 1-2-4, agreed that they do represent epistemic items with a 0.75 specific raw inter-annotator agreement and they do not represent epistemic items with a 0.5 specific raw inter-annotator agreement. We chose to compute this metric on different groups in addition to the overall one because: (i) we wanted to eliminate possible ”malicious” annotators, i.e. annotators that either marked items very selectively or they marked almost everything, and (ii) we wanted to find the groups that have the highest homogeneity in annotation, and thus highest overall agreement. For the computation of the initial probabilities we considered these last groups and their positive agreement values. For testing the statistical significance of our results, we performed Pearson’s χ2 test with Yates’ correction, considering the null hypothesis that the annotators are independent with the observed agreement proportions higher than the estimated proportions if the agreement would occur by chance. nk (nk − 1) χ2 = na N X X (Oi,j − Ei,j )2 Ei,j i=1 j=1 where N is the number of degrees of freedom, na is the num- Figure 6: Table of computed weights for the items based on the relation and block placement Figure 5: Chi-square statistics per publication ber of annotators, Oi,j is the observed agreement and Ei,j is the expected agreement. In our case N is 1 as we calculated the agreement based on a 2 x 2 contingency table and na = 3 because in all cases the group that we chose with the highest homogeneity comprised three annotators. Fig. 5 summarizes the values for χ2 and the p − value, thus showing that our results are statistically significant. The second metric we computed, based on the items extracted from their own publications, was a specific recall for each block placement, i.e. the proportion of times a text span was chosen, considering that is was involved in a particular rhetorical relation, from the total of text spans that were involved in the same relation, in the given block. For example, if the introduction of the publication had 10 text spans each involved into an antithesis relation and the annotator marked 4 of them, the recall is 0.4. Fig. 6 presents all the overall computed results, with the IAA column representing the inter-annotator agreement and the OWN column representing the recall mentioned above. 3.5 Preliminary Evaluation To test our assumptions at this initial stage, we performed a preliminary evaluation. The setup of the evaluation was similar to the one of the experiment described above. We used two corpora (with a total of 20 publications), one with the evaluators’ own papers (15 papers, as we had 15 evaluators) and one containing a set of paper we provided (another 5 papers). The evaluators were different than the ones involved in the experiment. Each evaluator was asked to read her own paper and one paper we have assigned to her, and freely mark on them those text spans that she considers to have the role of epistemic items. Similar to the experiment, there was no training involved before the evaluation (for the same reasons), although we did provide the same four questions as hints. On our side, we developed a small information extraction module that considered as input for each type of rhetorical relation the initial probabilities (conditioned by the placement in the linear discourse structure) previously found. The actual final probability (column Pf inal in Fig. 6) was computed following a naïve approach that gives more weight to the specific recall than to the positive inter-annotator agreement. As an example, if by analyzing the text we find a power item in the publication’s abstract, we assign the probability of 0.84 for the corresponding text span to act as an epistemic item (i.e. a claim, position or argument). Regarding the final probability, in a real setting, the actual balance between the raw positive inter-annotator agreement and the specific recall depends on several factors, such as, other parameters to be considered, or the way in which the results will be used. We will discuss more about this in the following section. In this particular case, we chose to give more weight to the specific recall, with the presumption that it will be reflected correspondingly in the final performance measures we will calculate. Our expectations were that, by using this balance, the module will perform better on the corpus provided by the authors, when compared with the corpus we have provided. For the actual information extraction, we set a linear threshold given by chance (i.e. 0.5), thus all the text spans involved rhetorical relations that had the probability greater than this threshold were considered proper epistemic items. These were then returned as a list of candidates for the processed paper. Returning to the evaluation setup, in parallel to the manual annotation done by the evaluators, we ran our module on the same set of publications and compiled the predicted list of candidate epistemic items. At the end, we considered the items extracted manually by the evaluators as the ground truth (or golden standard) and compared it with our candidates, by computing the usual performance measures according to the following formulæ: {relevant items} ∩ {retrieved items} retrieved items {relevant items} ∩ {retrieved items} Recall = relevant items 2 ∗ P recision ∗ Recall F measure = P recision + Recall The evaluation results are summarized in Table 1. P recision = Table 1: Evaluation results Corpus Prec. Recall F-Measure I (own) 0.5 0.3 0.18 II (provided) 0.43 0.31 0.19 3.6 Discussion There are a series of interesting issues that could be discussed (and challenged) both regarding the way in which we performed the evaluation and regarding the results. We shall take them stepwise. Firstly, although it might raise some challenges, we believe that our evaluation setup, including the way in which we considered (or created) the “golden” standard, is valid. The entire extraction process we described in this section is driven by human perception, and thus the evaluation should have reflected this. We are aware of the fact that annotation of epistemic items is a highly subjective task, and the intention of capturing an agreement on it, can lead to no results. And although the numbers we presented (especially in Fig. 6, on the first two columns) reflect this reality, we did manage to find a proper balance that can lead to promising results for the following extraction steps. Secondly, the formula we have used for computing the final probabilities, for the preliminary evaluation, has a very important influence on the extraction results. As already mentioned, within this evaluation, we opted for a simple formula that gives more weight to the probabilities emerged from the annotation of own papers. Such an approach should be used when the automatic extraction is performed by an author on her own papers, for example, in real time while authoring them. This is clearly reflected in the positive difference in precision between the own corpus and the provided one. On the other hand, if used for information retrieval purposes, by readers and not by authors, the computation formula should be changed, so that it gives more weight to the probabilities emerged from the annotation of given papers. Consequently, this translates into shaping the extraction results in a form closer to what a reader would expect. Last but not least, one could interpret of the performance measures of the extraction results in different ways. On one side, we see them as satisfactory, because they represent the effect of merely the first step from a more complex extraction mechanism we have envisioned. At the same time, if we compare them, for example, with the best precision reached by Teufel’s approach [12], or 70.5, we find our 0.5 precision to be encouraging. And this is mainly because in our case there was no training involved, and we considered only two parameters in the extraction process, i.e. the rhetorical relations and the block placement, while Teufel employed a very complex naïve Bayes classifier with a pool of 16 parameters, and 20 hours of training for the empirical experiment. On the other hand, these results clearly show that we need to consider as well other parameters, such as, the presence of anaphora, a proper distinction between the different types of epistemic items, or the used verb tense, parameters which are already part of our future plans. 4. APPLICATIONS One can envision a multitude of applications that would gain from the externalization of epistemic items. And, as we will mention in Sect. 5, they depend to a large extent on the different directions approached by the research. Here, we focus on one particular application, in which we have integrated the extraction module mentioned in the previous section, i.e. a personal scientific publication assistant. The application was designed to be especially suited for earlystage researchers that are in the phase of researching the state of the art of a particular field. Its main goal is to enrich the information space around a publication by using the extracted shallow and deep metadata to query known linked data repositories of scientific publications, like Faceted DBLP3 or the Semantic Web Dog food Server4 . The information extraction and expansion is currently done in multiple directions, based on the title of the publication, authors and references, from scientific publications encoded as PDF documents. We believe that this approach will help students (and not only) to learn the most relevant authors and publications in a certain area. Embedded in the application’s functionalities is also the automatic extraction of epistemic items. This gives the user the chance to have a quick overview on the publications’ contributions. Due to the fact that we targeted a high precision, rather than recall, the length of this list usually varies from 10 to 15 items, depending on the total length of the publication. As an additional feature, the user has the possibility of embedding visual annotations of these contributions in the original (PDF) document, only by pressing a single button. Thus, she will be able to visualize the epistemic items (i.e. claims, positions, arguments) in any PDF viewer, without the need of using our application. For the future, we plan to use the list of extracted epistemic items for information expansion purposes, in the same way we currently use the shallow metadata. The user will be able to browse the argumentative discourse networks spanned across multiple publications without the need of any manual intervention. A full demo of the application can be found at: http://sclippy.semanticauthoring.org/movie/sclippy.htm 5. RELATED WORK The research presented in this paper (or its foundation) can be divided into several directions, each direction having a rich sphere of related work. One could analyze similar models for discourse structuring, based on the same foundational theory (i.e. RST) or on others, or consider previous work on automatic extraction of epistemic items with different goals. In the following, we will deal with a mixture of both, and put an emphasis on approaches that started from a discourse representation model and evolved from manual annotation to automatic extraction. In the last decade, several models for describing and structuring the scientific discourse were developed. Most of the models partly overlap on two aspects: (i) they have the same goal of modeling the argumentation semantics captured in the publications’ texts, and (ii) they all share, at the abstract level, a core group of concepts representing the epistemic items, but use different terminologies. The main difference is given by the types of relations connecting the epistemic items and the use (or not) of the items’ placement in the linear structure of the publication. One of the first models was introduced by Teufel et. al [12] and attempted at categorizing phrases from scientific publications into seven types of zones based on their rhetorical role. The zones represented a block structure of the publication, similar to the Rhetorical Blocks in the SALT Rhetorical Ontology. Later, the authors developed an automatic extraction approach, following a similar method to ours, starting from a corpus of manually annotated documents and a set of probabilities emerged from inter-annotator agreement studies [13]. The automatic extraction was using particular cuephrases compiled empirically (e.g. we employ ..., we will adopt ...) and was considering the placement of the phrase in the cat3 4 http://dblp.l3s.de/ http://data.semanticweb.org/ egories defined by the authors. A very similar approach inspired from Teufel et al., and with focus on biology articles was developed by Mizuka et. al [8]. There are several similarities and distinctions between Teufel’s approach and ours. In terms of similarities, both approaches use discourse markers for the extraction of epistemic items. The main difference is given by the actual goal, i.e. we target the general extraction of epistemic items, while Teufel targets the classification of different text spans in the seven rhetorical categories they propose. In addition, we use the coherence structure provided by the rhetorical relations as main features for the extraction. Buckingham Shum et. al [10] were the first to describe one of the most comprehensive models for argumentation in scientific publications, using as links between the epistemic items Cognitive Coherence Relations. They developed a series of tools for the annotation, search and visualization of scientific publications based on this model [15], which represent our main inspiration. The automatic extraction approach they followed was simpler than the one developed by Teufel et al. They relied only on compiling and using a list of particular cue-phrases (e.g. this paper presents ...). Although their model is richer than the previous, due to the presence of relations, they do not make actual use of them. Consequently, their approach is reflected in our extraction of power items. Lisacek et. al [5] use similar techniques (based on cue-phrases) to detect paradigm shifts in biological articles. Their approach is particularly interesting because they try to mine the presence of ’time’ in the publications’ texts, in order to make the distinction between past solutions and predict future ones. Later, de Waard et. al [1] partly adopt this trend and focus on mining ’epistemic segments’, with the goal of detecting paradigm shifts. In [2], the authors describe the relation structure that they propose for linking epistemic segments, while in [1] they make the first steps toward automatic extraction, based on particular verbs and verb tenses. Compared to the model proposed by de Waard et. al, our model captures a more complex structure of relations between the discourse segments. At the same time, from the extraction perspective, we started by exploiting the semantics provided by the relations, and will consider verbs and verb tenses at a later stage. 6. CONCLUSION In this paper we made an initial step towards the automatic extraction of epistemic items from scientific publications. Following our previous work, in which we introduced the models for achieving externalization, and by adopting RST as foundational block, we developed a process that provided as results an initial assignment of probabilities to text spans that act as epistemic items, considering the presence of rhetorical relations and their placement in the linear discourse structure. The preliminary evaluation encourages us to continue our pursue towards achieving the ultimate goal of building argumentative discourse networks automatically. Future work will focus on improving the extraction by considering word co-occurrence, anaphora resolution, semantic chains, verb tense analysis and possibly also other rhetorical relations. These improvements will also be reflected into a new iteration over the initial weights (probabilities) assigned to the epistemic items, resulted from this paper. This iteration will probably feature a doubleround of empirical evaluation, with crossed results, to ensure a more objective view on the initial probabilities. A further natural development will be the definition of a formal Bayesian model for expressing the probability of an epistemic item, conditioned by the participation into a rhetorical relation, its position in the linear structure of the discourse, or the presence of anaphora. Here we will introduce the distinction between the different types of epis- temic items, and branch the research into two directions that we deal with the development learning algorithms for: (i) claim clustering intra and inter-publications, and (ii) the detection of positions and arguments and their relations to original claims. 7. ACKNOWLEDGMENTS The work presented in this paper has been funded by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2). 8. REFERENCES [1] A. de Waard, P. Buitelaar, and T. Eigner. Identifying the Epistemic Value of Discourse Segments in Biology Texts. In Proc. of the 8th Int. Conf. on Computational Semantics (IWCS-8 2009), January 2009. [2] A. de Waard and J. Kircz. Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues. In Proc. of the 12th Int. Conf. on Electronic Publishing (ElPub 2008), June 2008. [3] T. Groza, S. Handschuh, K. Möller, and S. Decker. SALT Semantically Annotated LATEX for Scientific Publications. In ESWC 2007, Innsbruck, Austria, 2007. [4] T. Groza, K. Möller, S. Handschuh, D. Trif, and S. Decker. SALT: Weaving the claim web. In ISWC 2007, Busan, Korea. [5] F. Lisacek, C. Chichestera, A. Kaplan, and A. Sandor. Discovering Paradigm Shift Patterns in Biomedical Abstracts: Application to Neurodegenerative Diseases. In Proc. of the 1st Int. Symp. on Semantic Mining in Biomedicine (SMBM), 2005. [6] W. C. Mann and S. A. Thompson. Rhetorical structure theory: A theory of text organization. Technical Report RS-87-190, Information Science Institute, 1987. [7] D. Marcu. The Rhetorical Parsing, Summarization, and Generation on Natural Language Texts. PhD thesis, University of Toronto, 1997. [8] Y. Mizuka, A. Korhonen, T. Mullen, and N. Collier. Zone analysis in biology articles as a basis for information extraction. Int. J. of Medical Informatics, 75:468–487, 2006. [9] I. Nonaka and H. Takeuchi. The Knowledge-Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford University Press, 1995. [10] S. J. B. Shum, V. Uren, G. Li, B. Sereno, and C. Mancini. Modeling naturalistic argumentation in research literatures: Representation and interaction design issues. Int. J. of Intelligent Systems, 22(1):17–47, 2006. [11] M. Taboada and W. C. Mann. Rhetorical structure theory: looking back and moving ahead. Discourse Studies, 8, No. 3:423–459, 2006. [12] S. Teufel, J. Carletta, and M. Moens. An annotation scheme for discourse-level argumentation in research articles. In Proc. of the 9th Conf. on European Chapter of the ACL, pages 110–117, Morristown, NJ, USA, 1999. ACL. [13] S. Teufel and M. Moens. Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28, 2002. [14] J. Tsujii. Refine and pathtext, which combines text mining with pathways. Keynote at Semantic Enrichment of the Scientific Literature 2009 (SESL 2009), March 2009. [15] V. Uren, S. B. Shum, G. Li, and M. Bachler. Sensemaking tools for understanding research literatures: Design, implementation and user evaluation. Int. Jnl. Human Computer Studies, 64, No.5:420–445, 2006.