Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Department of Computer and Information Science Philadelphia, PA 19104 lannie@seas.upenn.edu browing experience. Further, we can also apply our predictions to automatically evaluate the linguistic quality of systems which produce summaries of science articles. Abstract My work aims to build a system to automatically predict the writing quality in scientific articles from two genres—academic publications and science journalism. Our goal is to employ these predictions for article recommendation systems and to provide feedback during writing. Corpus of Good and Poor Science Writing For journal articles, citations and publication venue are good indicators of the quality of the content in these articles but are not directly associated with writing quality. In my prior work, I have analyzed factors related to quality of content and those affecting linguistic quality for summaries of generic news articles (Louis and Nenkova 2009; Pitler, Louis, and Nenkova 2010). But for science writing, the correlates of linguistic quality have not been explored so far. To start this line of research, I will build the first large scale corpus of science articles with text quality ratings. We plan to use journal articles from different domains within biology for our corpus of academic writing. I will ask graduate students in these areas to annotate the writing quality of these articles. I have performed a pilot annotation on 17 articles from the ACL conference proceedings on the topic of Machine Translation. Each article was rated on a scale of 1 - 5 for individual sections—abstract, introduction, related work, and conclusion. Around 50% of the abstracts were rated poor and 30% of the other sections also had low ratings. So the quality of writing does vary considerably. For the news domain, I have collected a corpus of good science writing in the form of New York Times articles which were selected to appear in the ‘Best American Science Writing’ series. These books, published starting from 1999, comprise of articles from newspapers each year with excellent science writing. This set of 65 well-written articles forms our development data for analysis. Later, we plan to collect ratings on more articles to obtain examples of both good and poor writing. I will also ask people to provide finer level ratings for interest level of leads, background knowledge required, explanation of research and the storyline. Introduction For people in academia and research institutions, writing is an important and regular activity to produce journal articles, grants, patent applications, and reports to funding agencies. However, scientists and students receive very little instruction on science writing. Tools which can provide automated feedback on writing can greatly aid academic writers by giving quick and ready comments. Another mode of science writing is in the form of news where journalists recount research findings and their impact. Here, identifying wellwritten and interesting articles could help to develop article recommendation systems. Such applications would depend on identifying linguistic properties of texts which are correlated with how readers perceive their quality. But a comprehensive and computational study of science writing quality has not been done so far. In my thesis, I aim to identify indicators of writing quality for scientific articles. Specifically, I consider two forms of science reporting—academic writing in journal publications and science journalism where findings from academia and industry are explained in lay terms. Using a corpus of quality ratings obtained from the respective target audience, I will study which linguistic properties of the texts can best predict the ratings. I will investigate use of these predictions in two applications—authoring tools for academic writing and article recommendation service for science news. My research will have a wider impact than the specific tasks I explore in my thesis. Typically, information retrieval for scientific articles is guided by citation analysis and impact factor of journals. For news writing, only relevance and recency is considered. Article rankings which also incorporate writing quality will significantly improve the user’s Automatic Prediction of Text Quality Good academic writing involves a number of skills: ability to compare ideas with other work, motivate the research, substantiate and defend claims, and clearly detail the experiments. Further, the writing must be non-verbose and direct. In news writing, several other factors also come into play. c 2011, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 1853 The research should be reported in a plain manner and made interesting for lay readers. Writers should also provide adequate background knowledge and employ comparison with known items to explain the research. I aim to identify measurable indicators of such properties related to good writing and use them to build an automatic predictor of text quality. Some of the specific tasks I plan to address involve the identification of general versus specific sentences, identification of rhetorical zones, prediction of verbosity level, detecting the use of visual and concrete words, and analysis of article topic. The idea of general and specific sentences in writing relates to the process of presenting a claim and then substantiating it with more details. In fact, it has been observed that in academic writing, articles take the shape of an hour-glass with the introduction and conclusion presenting general material, and the experiment and methods sections containing a lot of details (Swales and Feak 1994). A measure of the distribution of these two types of sentences in an article could be indicative of quality. But there are no annotations for such sentences available. So I have built a classifier by making use of discourse relations (Louis and Nenkova 2011). In the Penn Discourse Treebank corpus, there are annotations for Instantiation type discourse relations which relate one general and one specific sentence. By training on these sentences using features related to words, polarity, language models and specificity, our accuracy has reached 75% for predicting a sentence as general versus specific. Another approach is to analyze the different rhetorical zones in an article: problem definition, comparisons with other approaches and motivation for the proposed approach. In prior work, I have built discourse parsers based on semantic features to identify rhetorical relations in general news articles (Pitler, Louis, and Nenkova 2009). I plan to explore similar methods for zone identification in science articles. I will then analyze how the size and location of these zones is correlated with quality. Further, we can build a Markov chain to record the succession of zones in well-written articles and use it to predict shifts which lead to poor quality. I am also building a method to identify verbose sentences. Using aligned uncompressed and compressed sentences, I have identified which syntactic productions have a high probability of deletion. Using this information, I will score each sentence by the deletion probability of the productions it comprises. Sentences with a high probability under this model can be called more verbose. Apart from the above generic indicators of quality, my preliminary studies have pointed towards a few other properties which are unique to science writing in the news. Better articles employ visual words which invoke images in the minds of the readers. Similarily, words that have an ‘audible’ characteristic, for example, “swooshes”, “rattle” are also appealing when used in articles. Further, concrete words would be more desirable compared to abstracts ones such as “important” and “significant”. For these three dimensions, I will learn lexicons over a large corpus of news using some seed words and a label propagation mechanism to find similar words. Certain types of sentence constructions also appear to be suited for invoking surprise and creating interest. I have noticed such a trend particularly in lead paragraphs. By examining how the syntax of lead sentences varies from average sentences elsewhere in the article, I will obtain some syntactic correlates of quality writing. In addition, certain topics when presented in the news tend to generate more interest. For example, articles on medicine and health comprise the majority of well-written articles in our corpus. So I will analyze the correlation of topic with text quality. Authoring Tools and Article Recommendation A straightforward application of our predictions would be in the context of news article recommendation. To evaluate this, I plan to conduct a user study in a browsing scenario. The user will be asked to identify topics which he is interested in. Then we can provide articles on the topic ordered by our writing quality predictions and compare the reported user experience to an interface where articles are chosen based only on relevance to the topic. I will also develop a tool to provide authoring support for academic writing. Simply providing ratings for sections will be less useful in this setting. Rather the tool will provide annotations on drafts based on different factors that are correlated with writing quality. For example, we can highlight general sentences which are not followed by proper substantiation, offer suggestions for reorganizing by tracking the layout of different zones and identify inadequate analyses by recording sizes of the rhetorical zones. These annotations would be a useful resource for comments which are quick, readily available, and consistent. Current and Future Work I have collected news articles and publications for our experiments and am preparing to conduct the annotations. I have built the general versus specific classifier. I have also experimentally validated some indicators of quality for science journalism. The studies on academic writing and authoring support is mostly future work. References Louis, A., and Nenkova, A. 2009. Automatically evaluating content selection in summarization without human models. In Proceedings of EMNLP, 306–314. Louis, A., and Nenkova, A. 2011. General versus specific sentences: automatic identification and application to analysis of news summaries. Technical Report MS-CIS-11-07, University of Pennsylvania. Pitler, E.; Louis, A.; and Nenkova, A. 2009. Automatic sense prediction for implicit discourse relations in text. In Proceedings of ACL-IJCNLP, 683–691. Pitler, E.; Louis, A.; and Nenkova, A. 2010. Automatic evaluation of linguistic quality in multi-document summarization. In Proceedings of ACL, 544–554. Swales, J. M., and Feak, C. 1994. Academic writing for graduate students: A course for non-native speakers of English. Ann Arbor: University of Michigan Press. 1854