Using Part-of-Speech N-grams for Sensitive-Text Classification Graham McDonald Craig Macdonald Iadh Ounis University of Glasgow Scotland, UK g.mcdonald.1@ research.gla.ac.uk University of Glasgow Scotland, UK craig.macdonald@ glasgow.ac.uk University of Glasgow Scotland, UK iadh.ounis@ glasgow.ac.uk ABSTRACT Freedom of Information legislations in many western democracies, including the United Kingdom (UK) and the United States of America (USA), state that citizens have typically the right to access government documents. However, certain sensitive information is exempt from release into the public domain. For example, in the UK, FOIA Exemption 27 (International Relations) excludes the release of Information that might damage the interests of the UK abroad. Therefore, the process of reviewing government documents for sensitivity is essential to determine if a document must be redacted before it is archived, or closed until the information is no longer sensitive. With the increased volume of digital government documents in recent years, there is a need for new tools to assist the digital sensitivity review process. Therefore, in this paper we propose an automatic approach for identifying sensitive text in documents by measuring the amount of sensitivity in sequences of text. Using government documents reviewed by trained sensitivity reviewers, we focus on an aspect of FOIA Exemption 27 which can have a major impact on international relations, namely information supplied in confidence. We show that our approach leads to markedly increased recall of sensitive text, while achieving a very high level of precision, when compared to a baseline that has been shown to be effective at identifying sensitive text in other domains. 1. INTRODUCTION Freedom of Information (FOI) laws exist in many countries around the world, including the United Kingdom (UK)1 and the United States of America (USA)2 . FOI states that government documents should be open to the public. However, many government documents contain sensitive information, such as personal or confidential information. Therefore, FOI laws make provisions that exempt sensitive information from being open. To avoid the accidental release 1 2 http://www.legislation.gov.uk/ukpga/2000/36/contents http://www.foia.gov Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ICTIR’15, September 27–30, Northampton, MA, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3833-2/15/09 ...$15.00. DOI: http://dx.doi.org/10.1145/2808194.2809496 . of sensitive information, it is essential that all government documents are sensitivity reviewed prior to release. Sensitivity reviewers are required to identify any sequences of sensitive text in a document, for example individual terms, sentences, paragraphs or the full content of the document, so that the sensitive text can be redacted or the document can be closed. However, the recent increase in volume of digital government documents means that the traditional manual review process is not feasible for digital government documents. Therefore, the UK and the USA governments have recently recognised the need for automatic tools to assist the sensitivity review process [2, 3]. In this work, we address the problem of automatically identifying sensitive information in government documents by directly classifying the text within documents. We present an approach that uses POS n-grams to measure the amount of sensitivity contained within a sequence of text, i.e. its sensitivity load. Moreover, we show how our approach can be used to deploy an effective sensitivity classifier that classifies sensitivity at the term-level. We propose to perform classification at the term-level because correctly identifying partial sensitive sequences will be beneficial to a sensitivity reviewer. For example, by drawing a reviewer’s attention to the positions of sensitivities within a document it will likely reduce the time taken to review the document. We initially focus on an aspect of FOIA Exemption 27 International Relations, namely “information supplied in confidence”, since this sensitivity has a clear potential to cause damage to international relations if inadvertently released into the public domain. We show that our approach can markedly improve recall of in confidence sensitivities compared to a baseline approach that has been shown to achieve high levels of recall for sensitive text in other domains [11]. The remainder of this paper is as follows. We present related work in Section 2. We present our approach in Section 3, before presenting two classification methods, in Section 4 and Section 5, that implement our approach. In Section 6 we present a baseline approach that we compare our results against. We present our experimental setup in Section 7 and our results in Section 8. Finally, we present our conclusions and future work in Section 9. 2. RELATED WORK Gollins et al. [5] provided an overview of the challenges presented by digital sensitivity review and how these challenges can be addressed by Information Retrieval (IR) techniques. In that work, they noted that although the type of sensitivity can be identified, for example personal information, the existence of many sensitivities rely not only on the terms or entities in the document but also on the context of the information, i.e. what is said about an entity, how it is said and who said it. In [9], we presented an initial approach to address the issue of context dependent sensitivity, highlighted by Gollins et al., by deploying a text classification approach with additional features such as the entities in the document, a country risk score and a subjective sentences count to identify documents that contained personal information and international relations sensitivities. Differently from that work, in this paper we investigate automatically identifying sequences of sensitive text, within documents, that relates to information supplied in confidence. Outwith the field of sensitivity review, most work on automatically identifying sensitive text in documents is in the field of document sanitization [1, 4]. Document sanitization tries to automatically mask personal information, or information that could reveal the identity of the person the document is about. A popular approach for this is Named Entity Recognition (NER). NER typically identifies person, location and organisation entities in text and NER approaches to document sanitization typically assume that all named entities in a document are sensitive. However, in this work we try to automatically identify information that has been supplied in confidence and, as such, we need a more general solution that can identify sensitivities in what is said about, or by, an entity. Sánchez et al. [11] presented an approach to sensitive text identification that is more general than NER. They assumed that sensitive text is likely to be more specific than non-sensitive text, and used the Information Content (IC) of noun phrases as a measure of how sensitive the phrase is. Sánchez et al. focused on identifying personal information sensitivities. However, they also identified textual phrases that are potentially confidential. Therefore, their work is more closely aligned to identifying FOIA Exemption 27 sensitivities than NER approaches. Moreover, identifying all potential sensitivities in government documents is the first stage of the sensitivity review process and Sánchez et al. found that approach achieved higher recall for sensitive information when compared to NER. For these reasons, we use that approach as a baseline for comparing our work against and we describe our implementation of it in Section 6. 3. IDENTIFYING SENSITIVITY LOADED PART-OF-SPEECH N-GRAMS The approach we present in this paper uses the sensitivity load of POS n-grams to identify sequences of sensitive text in documents. The approach is inspired by Lioma and Ounis [8] who showed that the distribution of POS n-grams in a corpus can indicate the amount of information they contain. More specifically, Lioma and Ounis showed that high frequency POS n-grams are typically content rich and removing content poor POS n-grams from search engine queries results in an improved overall retrieval performance. Differently from their work, we use the content load of POS n-grams to try to measure the sensitivity load of specific sequences of text. Our intuition is that certain POS sequences might be more frequent in specific sensitivities. For example, sensitivities relating to information supplied in confidence would likely contain variations of the sequence noun - verb - pronoun, indicating that someone has supplied information to someone else. Our approach uses the distribution of POS n-grams to try to identify sequences that are specific to this sensitivity. To do this, we first represent documents by the POS n-grams they contain. For example, the sequence “The envoy will report on Tuesday” results in the POS tags “DT NN MD VB IN NN”. Representing this sequence as POS 3-grams results in the following “DTNNMD NNMDVB MDVBIN VBINNN” Having represented the documents by their POS n-grams, we then use a probabilistic method to measure the sensitivity load of a POS n-gram. More specifically, following the work of Li et al. [7], we first construct a 2-way contingency table as shown in Table 1, where pos, s is the number of documents in which the POS n-gram appears in sensitive text, pos, ¬s is the number of documents in which the POS n-gram appears in non-sensitive text, ¬pos, s is the number of documents that do not contain the POS n-gram in sensitive text and ¬pos, ¬s is the number of documents that do not contain the POS n-gram in non-sensitive text. Table 1: 2-way contingency table used to calculate the Chi-square statistic of a part-of-speech n-gram. sensitive non-sensitive Containing POS pos, s pos, ¬s Not Containing POS ¬pos, s ¬pos, ¬s Having constructed the contingency table, we use the ChiSquare test of independence to measure the degree of dependency between a POS n-gram and sensitive text. The 2 Chi-square score for a POS n-gram, (Xpos ) is calculated as follows, 2 Xpos = Nd (p(pos,s)p(¬pos,¬s)−p(pos,¬s)p(¬pos,s))2 p(pos)p(¬pos)p(¬s)p(s) (1) where p(pos, s) is the probability that the sensitive text of a document contains the POS n-gram, p(pos, ¬s) is the probability that the non-sensitive text of a document contains the POS n-gram, p(¬pos, s) is the probability that the sensitive text of a document does not contain the POS ngram, p(¬pos, ¬s) is the probability that the non-sensitive text of a document does not contain the POS n-gram, p(pos) is the probability that a document contains the POS n-gram, p(¬pos) is the probability that a document does not contain the POS n-gram, p(s) is the probability that text in the collection is sensitive, p(¬s) is the probability that text in the collection is not sensitive and Nd is the total number of documents in the collection. The Chi-square test of independence measures how much the observed frequency of a POS n-gram diverges from its expected frequency within a corpus. If a POS n-gram’s ChiSquare score is greater than the Chi-Square distribution’s critical value for a 95% confidence level with one degree of freedom, then we assume that the distribution of the POS n-gram in the corpus is related to sensitive text. We refer to these n-grams as being sensitivity loaded. By applying our approach, the identified sensitivity loaded POS n-grams can then be used as features in any method for automatic sensitive-text classification. In Section 4 and Section 5, we present two such methods that integrate our approach. 4. SENSITIVITY LOAD FILTERING Confidential information in documents is rare. Indeed, 95% of terms in our collection are in fact part of sequences that the reviewers did not believe to be sensitive. Moreover, many documents adhere to template structures in which certain sections of a document are particularly unlikely to contain confidential information, for example the header section of an email stating the sender, recipients and a subject title. Therefore, we can use the sensitivity loaded POS n-grams identified by our approach, presented in Section 3, to filter out sequences of text that are most likely to contain nonsensitive information. To do this, we count the number of times a sensitivity loaded POS n-gram appears in sensitive and non-sensitive text and use the ratio of sensitive and non-sensitive occurrences to identify POS n-grams that are representative of non-sensitive sequences. Equation 2 shows how we calculate the ratio, lRatiopos,s , where Sensitive correspondes to when lRatiopos,s > 1 and Non-Sensitive corresponds to when lRatiopos,s < 1. lRatiopos,s = p(pos,s)p(¬pos,¬s)−p(pos,¬s)p(¬pos,s) p(pos)p(s) + 1 (2) To classify terms in a given sequence of text k, we represent k by its POS n-grams kpos . Then, for any kpos in the set of previously identified non-sensitive POS n-grams, each term t in kpos is classified as being Non-sensitive, all other terms are classified as sensitive. We refer to this method as the F iltering method. 5. SENSITIVITY LOAD SEQUENCES The identified sensitivity loaded POS n-grams can be used to generate features for a Conditional Random Fields (CRF) sequence tagger [6]. A CRF is a probabilistic framework that models the conditional distribution p(y|x) in sequential data, where x is a sequence of observations, an observation is a term with a set of features that describe the term and y is a sequence of class labels. To deploy the CRF method we use two term features. The first feature we use is the term’s POS tag. Additionally, we use the sensitivity loaded POS n-grams identified by our approach to generate a feature tag that indicates whether the term is part of a sequence that maps to a sensitivity loaded POS n-gram. To generate the feature tag, we look at a sequence of n (POS tagged) terms at a time and check if the sequence maps to a sensitivity loaded POS n-gram. If a mapping exists, we tag each term in the sequence. We then move the sliding window by one term to the next sequence and repeat the process. To illustrate this, we return to our example from Section 3. Recall that the sequence “The envoy will report on Tuesday” is represented by the POS 3-grams “DTNNMD NNMDVB MDVBIN VBINNN”. If our approach identifies “NNMDVB” Figure 1: Illustration showing the POS tag and generated sensitivity loaded feature tags for a sequence. as the only sensitivity loaded n-gram, this would result in the sequence being tagged as shown in Figure 1. When the learned CRF model is deployed, it predicts class labels for each term in a sequence based on its previous observations. We refer to this method as CRF +P OS +T AG. We also present the results for the CRF method without the sensitivity loaded feature tag, CRF +P OS, and the simple CRF (i.e. using just the term) as CRF . 6. INFORMATION CONTENT As previously mentioned in Section 2, we compare our approach against a baseline from the literature that uses the Information Content (IC) of a noun phrase as a measure of the phrase’s sensitivity [11]. A noun phrase is a sequence of terms that has a noun or indefinite pronoun at the head of the phrase. IC measures the amount of information provided by the sequence of terms, within the context of a background corpus. More specific term sequences are considered as having a higher likelihood of being sensitive. To calculate the IC of noun phrases in a document, the document is first parsed to extract its syntactic structure. Noun phrases are then extracted from the resulting syntax tree and submitted to a Web search engine as a query. The IC of the noun phrase is calculated using the number of returned results as an indication of the phrase’s specificity. Formally, the IC of a noun phrase, np, is computed res(np) , where res(np) as IC(np) = − log2 p(np) = − log2 totalpages is the number of returned search results and totalpages is the number of sites indexed by the search engine. For our experiments, each term within a noun phrases with an IC score greater than an empirically defined threshold, β, is classified as being sensitive. All other terms are classified as non-sensitive. This baseline is referred to as Inf Content. 7. EXPERIMENTAL SETUP Collection: The collection consists of government documents with “information supplied in confidence” sensitivities. The documents have been sensitivity reviewed by trained sensitivity reviewers. Reviewers were asked to annotate the sensitive sequences within the documents, therefore the documents have term-level class labels i.e. each term within a sensitive sequence is labelled sensitive and all terms that were not annotated are labelled non-sensitive. There are a total of 231893 terms in the collection with 10838 sensitive terms and 221055 non-sensitive terms, in a set of 143 documents. For our experiments, we split the collection at the document level to retain the context of the terms and perform a 5-fold Cross Validation. Baselines: We compare our approach against the IC baseline presented in Section 6. We use Open NLP3 to extract noun phrases from documents and for calculating the IC score of noun phrases we use the Bing search engine4 and set the total number of sites indexed, totalpages, to 3.5 billion. For our experiments we test IC threshold values of β = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 75, 100}. In future work, we intend to investigate learning the β parameter. We also report the results of a simple classifier that classifies all terms as the majority class, referred to as All NonSensitive. Conversely, we also report a classifier that classifies all terms as sensitive, referred to as All Sensitive. 3 4 https://opennlp.apache.org http://www.bing.com/ Sensitivity Load: For identifying sensitivity loaded POS n-grams in Section 3, we use the TreeTagger5 for POS tagging and, following Lioma and Ounis [8], we use a reduced set of 15 POS tags. We calculate the Chi-Square statistic on the training data for n = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. In future work, we intend learning the optimum value for n. Classification: For Sensitivity Load Filtering (Section 4) and Sensitivity Load Sequences (Section 5), we test the methods for each value of n = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. For the Sensitivity Load Sequences classification method, we use the Mallet6 linear chain CRF tagger. Metrics: We report balanced accuracy (BAC) due to the imbalanced nature of our collection. Moreover, sensitivity review is a recall-oriented task since reviewers must identify all sensitivities to avoid inadvertent release. Therefore, we report the F2 measure that provides a weighted average of precision and recall where recall is attributed more importance and, therefore, a greater weight. We also report the standard accuracy, precision and recall metrics. 8. RESULTS Table 2 shows the best achieved performance for the Filtering method (F iltering), the CRF method using the POS tag and sensitivity load as features (CRF +P OS +T AG), the CRF method with the POS tag only (CRF +P OS) and the simple CRF (CRF ). Table 2 also shows the best performance for the Information Content baseline (Inf Content) and the All Sensitive and All Non-Sensitive classifiers scores. The first thing we note from Table 2 is that the CRF classification metod using the sensitivity loaded POS n-grams identified by our approach outperforms the IC baseline for all metrics. The CRF method using sensitivity loaded 10-grams (CRF +P OS +T AGn=10 ) achieves 0.4573 recall. Importantly, the method also achieves 0.9992 precision and, therefore, achieves a balanced accuracy of 0.7282. On closer inspection of the results we found that the CRF method with this setting identified 99% of sensitive text in 67% of the documents and, therefore, we believe this method would be useful in assisting the sensitivity review process. Moreover, we note that the CRF method without the sensitivity loaded POS n-gram feature (CRF +P OS) correctly identified less than 5% of the sensitive text. This further demonstrates the effectiveness of our approach. The next point we note from Table 2 is that the Filtering approach also achieves its highest recall (0.9129) when n is set to 10. However, the approach achieves a low precision score (0.0563) resulting in an F2 score of (0.2257). This low level of precision shows that, on this collection, this method is markedly over-predicting sensitivity. We also see from Table 2 that the CRF and the Filtering methods achieve their best recall scores using POS 10-grams. Finally, we note that the IC baseline, Inf Contentic>7 , achieved 0.0116 recall and 0.2564 precision. The IC baseline identifies syntacticly complex phrases and specific terms as being sensitive. However, not all complex or specific phrases are confidential and, moreover, not all sensitivities are noun phrases. We also note that the classification methods that use our approach for identifying sensitive sequences of text achieve markedly better recall, and notably better balanced accuracy, than the IC baseline. 5 6 http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger http://mallet.cs.umass.edu Table 2: Results for the Filtering and CRF methods. The table also shows the performance of the All Sensitive, All Non-Sensitive and Information Content baselines. all Non-Sensitive all Sensitive Inf Contentic>7 Inf Contentic>8 Inf Contentic>9 Inf Contentic>10 Inf Contentic>20 F ilteringn=7 F ilteringn=8 F ilteringn=9 F ilteringn=10 CRF CRF +P OS CRF +P OS +T AGn=7 CRF +P OS +T AGn=8 CRF +P OS +T AGn=9 CRF +P OS +T AGn=10 9. accuracy 0.9533 0.0467 0.9457 0.9457 0.9457 0.9458 0.9459 0.3763 0.3360 0.2616 0.1815 0.9226 0.9221 0.9600 0.9637 0.9657 0.9712 balanced Acc 0.0000 0.5234 0.1340 0.1245 0.1188 0.1279 0.0983 0.3805 0.4007 0.4434 0.4846 0.0390 0.0647 0.5702 0.6264 0.6608 0.7282 precision 0.0000 0.0467 0.2564 0.2388 0.2281 0.2466 0.1911 0.0579 0.0573 0.0570 0.0563 0.0179 0.0856 0.8310 0.8932 0.9454 0.9992 recall 0.0000 1.0000 0.0116 0.0103 0.0094 0.0093 0.0055 0.7032 0.7441 0.8298 0.9129 0.0283 0.0446 0.3094 0.3595 0.3762 0.4573 F2 0.0000 0.1969 0.0143 0.0127 0.0116 0.0115 0.0069 0.2178 0.2191 0.2236 0.2257 0.0201 0.0494 0.3539 0.4083 0.4277 0.5129 CONCLUSIONS AND FUTURE WORK In this work, we proposed an approach for automatically detecting information supplied in confidence in government documents. Our approach uses part-of-speech n-grams to measure the sensitivity load of sequences of text. On a collection of sensitivity-reviewed government documents, we showed that the sensitivity load of these sequences can be used to accurately classify in confidence sensitivities within government documents. In particular, using a CRF sequence tagger with the sensitivity load of POS n-grams as features, this approach achieved over 0.45% recall and 0.99% precision, markedly outperforming a baseline approach from the literature that has been shown to achieve high levels of recall for sensitive text in other domains. As future work, we intend to conduct a user study to quantify the benefits of our approach for assisting in the sensitivity review of digital government documents. 10. ACKNOWLEDGMENTS The authors would like to thank the sensitivity reviewers for their help in constructing the test collection. 11. REFERENCES [1] D. Abril, G. Navarro-Arribas, and V. Torra. On the declassification of confidential documents. In Mod. Dec. for Art. Intel.. 6820:235–246, 2011. [2] D. A. R. P. Agency. Darpa, new technologies to support declassification. Request for Information, DARPA-SN-10-73, 2010. [3] A. Allan. Records Review. UK Government. 2014. [4] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proc. of ACL, 2005. [5] T. Gollins, G. McDonald, C. Macdonald, and I. Ounis. On using information retrieval for the selection and sensitivity review of digital public records. In Proc. of PIR at CIKM, 2014. [6] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, 2001. [7] Y. Li, C. Luo, and S. Chung. Text clustering with feature selection by using statistical data. In Trans. on Knowledge and Data Engineering, 20(5):641–652, 2008. [8] C. Lioma and I. Ounis. Examining the Content Load of Part-of-Speech Blocks for Information Retrieval. In Proc. of COLING/ACL, 2006. [9] G. McDonald, C. Macdonald, I. Ounis, and T. Gollins. Towards a classifier for digital sensitivity review. In Proc. of ECIR, 2014. [10] T. Mendel. Freedom of information: a comparative legal survey. Paris: Unesco, 2008. [11] D. Sánchez, M. Batet, and A. Viejo. Detecting sensitive information from textual documents: an information-theoretic approach. In Mod. Dec. for Art. Intel., 7647:173-184, 2012.