Automatic News Summarization with a Dependency Structure Aaron Fetterman Advisor: Janice T. Searleman Software Engineering The Problem Compare today to fifty or a hundred years ago, and you will find that we have access to far more information today. Our bookshelves are bigger, we subscribe to more magazines, and our newspapers are much thicker. With this luxury of access, we have a problem of accessibility: it is impossible for one person to read all these sources. We can, however, use summaries, abstracts, or excerpts to maintain a broad coverage of current events despite the wealth of information on them. My thesis will deal with the creation of an automatic text summarization method for news-type articles based on the dependency structure. Text Summarization Summarization, a subset of Natural Language Processing (NLP) and Natural Language Understanding (NLU), really blossomed after the creation of more online corpora and the widespread adoption of the internet. Automatic summarization (which happens on computers without any human aid) must overcome a problem which most other natural language processing areas have: there is no methodology for creating a good summary. To do it easily requires an understanding of the topic and a thorough read of the material. The majority of summarization systems proposed use extraction to generate a summary. Through a variety of methods, they rank the sentences in the document and then select as many of the highest ranked sentences as requested. While this leads to good sentence-level coherency, it inherently has a noticeable disconnect between the sentences. It ensures that each statistically significant topic is represented, but it cannot guarantee the best or even a generally coherent summary. Another method is to use information extraction and discourse analysis to generate sentences from scratch. This allows the summarizer more control over the summary, and therefore the ability to create a better one. While the problem of generation might seem to add a layer of complexity to the summarizer, it is easily simplified by limiting the complexity of the sentences being generated. Using simple sentences also gives the advantage of making a more readable summary. Dependency structure is used in part-of-speech tagging to help describe relationships between words. The idea behind it is to connect each word to a parent, such that each sentence has a single root. From a summarization prospective, this network of related words describes objects and how they relate to each other. This Thesis For my thesis, I will design an algorithm incorporating the ideas of natural language processing and linguistics, then design and write an implementation of it. It will utilize a web of the dependencies from the original article to generate a summary of around 100 words. Because this is only a year-long project, the complexity of the summarizer should be limited in scope, but ideally it would be designed modularly such that future refinement can be incorporated. The complexity of the summarizer as a whole will be limited by limiting the intended source and audience. In this case, it will be focusing on news-style articles, and producing general summaries, as opposed to queryspecific summaries. It will also attempt to create a context-free summarization, without outside knowledge on the subject or the world. There will be a significant amount of research required for this thesis, because little of this fascinating subject is taught at Clarkson. I will need to research and decide on a model for the part-of-speech tagger that I will implement and combine with the summarizer. I will need to research into linguistics on how to translate tenses and forms into relational meaning, and into psychology for a better idea of the organization of memory and how that translates to the creation of sentences. I will need a solid understanding of both Natural Language Processing and its statistical techniques, and reference implementations of several other summarizers for comparison. In the end, I will evaluate of the summarized documents with both a statistical analysis and a survey, compared to a number of other selected automatic summarization methods. Timeline March o Delve heavily into basic Natural Language Processing o Continue the literature search April o Move towards the statistical techniques in NLP o Start searching for a part-of-speech tagger May o Research more in depth into current summarization techniques o Find implementations for comparisons o Find a number of corpora to test on o REQUIREMENTS June o Decide the algorithm for this summarizer o Implement the part-of-speech tagger o DESIGN July o Implement the summarizer o Refine the algorithm as necessary o IMPLEMENT->DESIGN August o Gather preliminary data o Present at the SURE conference o TEST September o Analyze initial data o Suggest improvements October o Implement improvements o Gather more data, produce the survey, send it out November o Analyze the data o See if the improvements helped December o Try the summarizer on non-news articles o Write first draft of the thesis January o Continue revising the thesis o Create a presentation