Automatic News Summarization with a Dependency Structure

advertisement
Automatic News Summarization with a
Dependency Structure
Aaron Fetterman
Advisor: Janice T. Searleman
Software Engineering
The Problem
Compare today to fifty or a hundred years ago, and you will find that we have
access to far more information today. Our bookshelves are bigger, we subscribe to more
magazines, and our newspapers are much thicker. With this luxury of access, we have a
problem of accessibility: it is impossible for one person to read all these sources. We can,
however, use summaries, abstracts, or excerpts to maintain a broad coverage of current
events despite the wealth of information on them. My thesis will deal with the creation of
an automatic text summarization method for news-type articles based on the dependency
structure.
Text Summarization
Summarization, a subset of Natural Language Processing (NLP) and Natural
Language Understanding (NLU), really blossomed after the creation of more online
corpora and the widespread adoption of the internet. Automatic summarization (which
happens on computers without any human aid) must overcome a problem which most
other natural language processing areas have: there is no methodology for creating a good
summary. To do it easily requires an understanding of the topic and a thorough read of
the material.
The majority of summarization systems proposed use extraction to generate a
summary. Through a variety of methods, they rank the sentences in the document and
then select as many of the highest ranked sentences as requested. While this leads to good
sentence-level coherency, it inherently has a noticeable disconnect between the sentences.
It ensures that each statistically significant topic is represented, but it cannot guarantee
the best or even a generally coherent summary.
Another method is to use information extraction and discourse analysis to
generate sentences from scratch. This allows the summarizer more control over the
summary, and therefore the ability to create a better one. While the problem of generation
might seem to add a layer of complexity to the summarizer, it is easily simplified by
limiting the complexity of the sentences being generated. Using simple sentences also
gives the advantage of making a more readable summary.
Dependency structure is used in part-of-speech tagging to help describe
relationships between words. The idea behind it is to connect each word to a parent, such
that each sentence has a single root. From a summarization prospective, this network of
related words describes objects and how they relate to each other.
This Thesis
For my thesis, I will design an algorithm incorporating the ideas of natural
language processing and linguistics, then design and write an implementation of it. It will
utilize a web of the dependencies from the original article to generate a summary of
around 100 words. Because this is only a year-long project, the complexity of the
summarizer should be limited in scope, but ideally it would be designed modularly such
that future refinement can be incorporated. The complexity of the summarizer as a whole
will be limited by limiting the intended source and audience. In this case, it will be
focusing on news-style articles, and producing general summaries, as opposed to queryspecific summaries. It will also attempt to create a context-free summarization, without
outside knowledge on the subject or the world.
There will be a significant amount of research required for this thesis, because
little of this fascinating subject is taught at Clarkson. I will need to research and decide
on a model for the part-of-speech tagger that I will implement and combine with the
summarizer. I will need to research into linguistics on how to translate tenses and forms
into relational meaning, and into psychology for a better idea of the organization of
memory and how that translates to the creation of sentences. I will need a solid
understanding of both Natural Language Processing and its statistical techniques, and
reference implementations of several other summarizers for comparison. In the end, I will
evaluate of the summarized documents with both a statistical analysis and a survey,
compared to a number of other selected automatic summarization methods.
Timeline











March
o Delve heavily into basic Natural Language Processing
o Continue the literature search
April
o Move towards the statistical techniques in NLP
o Start searching for a part-of-speech tagger
May
o Research more in depth into current summarization techniques
o Find implementations for comparisons
o Find a number of corpora to test on
o REQUIREMENTS
June
o Decide the algorithm for this summarizer
o Implement the part-of-speech tagger
o DESIGN
July
o Implement the summarizer
o Refine the algorithm as necessary
o IMPLEMENT->DESIGN
August
o Gather preliminary data
o Present at the SURE conference
o TEST
September
o Analyze initial data
o Suggest improvements
October
o Implement improvements
o Gather more data, produce the survey, send it out
November
o Analyze the data
o See if the improvements helped
December
o Try the summarizer on non-news articles
o Write first draft of the thesis
January
o Continue revising the thesis
o Create a presentation
Download