Revision table

advertisement
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
System Requirements Document
By:
-
Bar Dromi
Yevgeni Krapivin
1
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Table of Contents
Table of Contents .........................................................................................................2
Revision table ...............................................................................................................4
Distribution table..........................................................................................................4
Introduction .................................................................................................................5
Description of problem ..............................................................................................5
Importance of solving the problem..............................................................................5
Related work ............................................................................................................5
Multilingual Sentence Extractor ..................................................................................6
Purposed updates ..................................................................................................7
Workflow .................................................................................................................9
Text input .............................................................................................................9
Text Pre-processing................................................................................................9
Feature Extraction .................................................................................................9
Summarization ......................................................................................................9
Training .............................................................................................................. 10
Representation.................................................................................................... 10
Functional Requirements ............................................................................................. 10
Input consumption .................................................................................................. 10
Text pre-processing ................................................................................................. 10
Statistics gathering .................................................................................................. 10
Model construction ................................................................................................. 11
Structure model .................................................................................................. 11
Vector-Space model ............................................................................................. 11
Graph model ....................................................................................................... 11
Feature extraction ................................................................................................... 11
Linear-combination using a weighting vector.............................................................. 11
Output representation ............................................................................................. 11
Training.................................................................................................................. 11
Quality Assurance ................................................................................................ 12
Non-Functional Requirements ...................................................................................... 12
Portability .............................................................................................................. 12
Modularity ............................................................................................................. 12
Extensibility ............................................................................................................ 12
Performance........................................................................................................... 13
2
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
System Requirements ................................................................................................. 13
Operating system .................................................................................................... 13
Java Virtual Machine (JRE) ........................................................................................ 13
Hardware Requirements.............................................................................................. 14
Development Requirements ........................................................................................ 14
Environment........................................................................................................... 14
Operating System ................................................................................................ 14
Editor ................................................................................................................. 14
Java Development Kit ........................................................................................... 14
Source Control .................................................................................................... 14
Maven ................................................................................................................ 14
Testing ............................................................................................................... 14
Diagrams ................................................................................................................... 15
Class diagrams ........................................................................................................ 15
Sequence ............................................................................................................... 16
Use case ................................................................................................................. 17
Figure Table ............................................................................................................... 18
References ................................................................................................................. 19
3
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Revision table
Revision #
1
Revision Date
28/11/2013
Description of Change
Original document conception
2
30/11/2013
3
2/12/2013
Added revision and distribution table,
Added performance section to nonfunctional requirements
Added purposed features and
optimizations, Diagrams
Author
Yevgeni Krapivin,
Bar Dromi
Yevgeni Krapivin
Bar Dromi,
Eugene Krapivin
Distribution table
Recipient Name
Dr. Litvak Marina
Recipient Organization
Project consultant
4
Distribution Method
e-mail
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Introduction
Description of problem
In a world of growing data finding the relevant information in the sea of ever growing data
becomes a real time consuming problem. Every year researchers submit new articles that,
sometimes, may contain duplicate data and almost always contain irrelevant data.
There is a major problem of summarizing texts in exotic languages. Most summarization
algorithm need golden standards to "learn" how to summarize a language, or use language
specific features making those algorithm language dependent.
Importance of solving the problem
There is a major importance of solving this problem, because of the unstoppable growth of
the data in the coming years and decades. Moreover, it is very important to have the ability
to summarize information in exotic languages for the intelligence community.
Related work
Extractive summarization is aimed at the selection of a subset of the most relevant
fragments, which can be paragraphs, sentences, key phrases, or keywords from a given
source text. The extractive summarization process usually involves ranking, such that each
fragment of a summarized text gets a relevance score, and extraction, during which the topranked fragments are extracted and arranged in a summary in the same order they
appeared in the original text. Statistical methods for calculating the relevance score of each
fragment can rely on such information as: fragment position inside the document, its length,
whether it contains keywords or title words. [1]
A research conducted by Dr. Litvak [1] we see the use of 31 different linguistic features to
rank sentences. The 31 scores are combined using a weighting function that receives the
weights from a genetic algorithm [2] that finds the optimal weights ratio that provides the
best summarization (according to evaluations).
Figure 1: Original MUSE features taxonomy
5
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Most of the current algorithms for sentence extraction use sentence specific features
rendering them useless on a large variety of languages without being retrained specifically
for the needed language.
The process of retraining the algorithms may not even be possible because of a tight
dependency on the language. Retraining some algorithms require preparing a special corpus
of texts called "Golden Standard" with a corresponding set of human made summarizations.
The creation of such a "golden standard" is very expensive and time demanding. In cases of
rare and exotic languages may not even be possible due to insufficient man power.
Multilingual Sentence Extractor
The MUltilingual Sentence Extractor algorithm (MUSE) tries to deal with some of the
problems presented previously.
The MUSE algorithm is a language independent algorithm (hence Multi-lingual) that could be
trained once and used for a variety of languages with success providing high standard
results.
The original MUSE algorithm [3] described in Dr. Litvak's paper contains 31 features [4].
Those features are divided into three primary groups:
-
Structural features – Involved with the placement of the sentences in the text and
the length of those sentences.
Vector-Space features – Involved with the term frequency in the sentence.
Graph-Space features – Using graph representation of the vector space to use graph
specific algorithms.
Using a genetic algorithm [2] to create a linear combination of the rankings given by each of
the features MUSE selects the most ranked sentences and reorders them into the original
orders creating a summarization comprised entirely from the original sentences of the text.
Figure 2: Original MUSE workflow
6
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
The same linear combination could be potentially used on other languages without
retraining the algorithm. Thus, giving MUSE an advantage over other summarization
algorithms.
The problems with the MUSE algorithm is the runtime of the algorithm due to a nonparallelized implementation.
The aim of our project and research is to find ways to parallelize the algorithm giving it a
better implementation, decreasing runtime and increasing the capacity of work done in a
single run. The research part of the project is aimed at improving the quality of the
summarizations in popular languages such as English by adding language specific features to
the summarization of English texts.
Purposed updates
Introduction of new features
We evaluated to muse new language dependency features. We believe that language
dependency features will improve the quality of the summery.
Features depending upon lemmatization
n
Feature
Source
1
KEY
Edmundson (1969)
2
COV
Liu et al. (2006a)
3
TF
Vanderwende et al. (2007)
4
TFISF
Neto et al. (2000)
5
SVD
Steinberger and Jezek (2004)
6
TITLE_O
Edmundson (1969)
7
TITLE_J
Edmundson (1969)
8
TITLE_C
Edmundson (1969)
9
D_COV_O
Litvak et al. (2010b)
10
D_COV_J
Litvak et al. (2010b)
11
D_COV_C
Litvak et al. (2010b)
12
LUHN_DEG
Litvak et al. (2010b)
13
KEY_DEG
Litvak et al. (2010b)
14
COV_DEG
Litvak et al. (2010b)
15
DEG
Litvak et al. (2010b)
16
TITLE_E_O
Litvak et al. (2010b)
17
TITLE_E_J
Litvak et al. (2010b)
18
D_COV_E_O
Litvak et al. (2010b)
19
D_COV_E_J
Litvak et al. (2010b)
Taxonomy
Vector-based
Vector-based
Vector-based
Vector-based
Vector-based
Vector-based
Vector-based
Vector-based
Vector-based
Vector-based
Vector-based
Graph-based
Graph-based
Graph-based
Graph-based
Graph-based
Graph-based
Graph-based
Graph-based
Figure 3: Purposed new lemmatization features
Features depending upon Named Entity
Feature
Description
NE_LOC
Number of Location NE in
the sentence
NE_PER
Number of Person NE in the
sentence
NE_DATE
Number of Date-Time NE in
the sentence
7
Formula
Based on
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
NE_QU
NE
NE_LOC_RAT
NE_PER_RAT
NE_DATE_RAT
NE_QU_RAT
NE_RAT
NE_COV
NE_TFISF
NE_FRQ_SUM
Number of Quantitative NE
in the sentence
Total number of NE in the
sentence
Ratio of Location NE to all
words in the sentence
Ratio of Person NE to all
words in the sentence
Ratio of Date-Time NE to all
words in the sentence
Ratio of Quantitative NE to
all words in the sentence
Ratio of total number of NE
to all words in the sentence
Ratio of NE in sentence to
NE in document
𝑖𝑠𝑓(𝑁𝐸) = 1 −
log⁡(𝑛(𝑁𝐸))
log⁡(𝑛)
,
|𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝑆)|
|𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝑆)|
𝑁
|𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝑆)|
|𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝐷)|
∑
𝑛(𝑁𝐸) is the number of
sentences containing NE
Sum of the NE frequencies
𝑡𝑓(𝑁𝐸) ∗ 𝑖𝑠𝑓(𝑁𝐸)
COV
TFISF
𝑁𝐸∈𝑆
∑
𝑡𝑓(𝑁𝐸)
KEY
𝑁𝐸∈𝑆
NE_TITLE
NE_DEG_SUM
Number of NE mentioned
in the title
Sum of NE degrees
∑
𝐷𝑒𝑔(𝑁𝐸)
KEY_DEG
𝑁𝐸∈𝑆
Figure 4: Proposed new NE features
Notations description:
Notation Description
NE
Named entity
S
Sentence
D
Document
i
Sentence index
n
Number of sentences in document
N
Number of words in sentence
tf(NE)
In document Named entity frequency
Deg(NE) Degree of Named entity node
Figure 5 : Annotations
Optimizations
Lemmatization
Lemma (headword) - In morphology and lexicography, a lemma (plural lemmas) is
the canonical form, dictionary form, or citation form of a set of words [5]
Stem - In linguistics, a stem is a part of a word. The term is used with slightly different
meanings. In one usage, a stem is a form to which affixes can be attached. [6]
One of the optimizations this project researches and introduces to MUSE is the usage of
lemmas instead of stems (or in cooperation with stems). This method should find better
keywords in the text for all the keyword based metrics. Using stems loses importance in
keywords that hold the same lexical meaning however differ morphologically. By finding all
8
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
the lemmas in the text, the keyword bases metrics will be more accurate in rating the
sentences.
Reduction of IO usage
This version of MUSE uses less IO then the previous version. The only IO interaction this
implementation of MUSE reads from the HDD (Hard Drive Disk) only once. All the
configurations files and text input is loaded to the memory and stays in the memory during
the runtime. This method uses less IO, hence less interrupts for IO and less taken for
information request. On the downside of this method, the amount of RAM (Random Access
Memory) needed for this program is increased.
Workflow
The new workflow of MUSE resembles to the original workflow, however, the difference is
the parallelization of each step.
Figure 6 : New MUSE workflow
Text input
In this stage we read/receive the raw (no XML or other annotations) texts, and store them in
the RAM (Memory) for fast access and consumption.
Text Pre-processing
In this stage the pre-processor splits the single text into sentences and tokenizes the
sentences to receive separate words. Those words are stemmed, lemmatized and tagged.
All those pre-processing elements are done by the Stanford NLP toolkit.
In this stage the data about the text, sentences and the tokens is collected and saved for
later use in the feature extraction stage.
Feature Extraction
In this stage in the workflow each sentence in each text gets its rankings according to the
data collected in the pre-processing stage. At the end of the stage each sentence has as
much rankings as there are features involved in the feature extraction process.
Summarization
In this stage the algorithm takes the rankings of each sentence and creates a single rank
using the linear combination created in the training process.
Using the ranking of each sentence the algorithm then selects the top ranked sentences and
uses them as the summarization of the text.
9
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Training
In this stage, the trainer uses the output of the feature extraction stage explained earlier.
Using a genetic algorithm, and the quality evaluation packages ROUGE-1 and ROUGE-2, the
trainer creates the best suitable linear combination of the given features that create the
highest quality summarizations. In this stage the trainer also needs a golden standard to
check upon the quality of the summarizations created in the training process and evaluate
the quality of the linear combination weights.
Figure 7 : Genetic algorithm workflow
Representation
The representation of the final outcome of the summarization process could be suited to the
needs of the user. The output could be annotated (as needed by the ROUGE evaluation kits),
exported as raw *.txt files, uploaded to a server etc.
The representations of the summarizations vary on the user needs. Summarizations could be
raw texts that only contain the extracted sentences from the original text or a highlighting of
the selected sentences in the original text.
Functional Requirements
The MUSE implementation is a pipeline style framework. Construction of pipelines is done
by exploiting the Parallel-Processing-Pipeline library (GPL license).
Input consumption
The input texts should be raw *.txt files. The presence of annotations of any kind (XML,
HTML, JSON etc.) will make the program behave in unpredicted ways, yield summarizations
of low quality and even crash.
Text pre-processing
The text pre-processing is done by using the Stanford NLP toolkit.
Statistics gathering
The pre-processor should gather the following parameters of the text:




Sentence splitting
Tokenization
Stem frequency
Lemma frequency
10
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering







Part-Of-Speech (POS) tagging
Named-Entity (NE) recognition
NE frequency
Inverse term-sentence mapping
Inverse lemma-sentence mapping
Inverse NE-sentence mapping
Sentence vector normal calculation
Model construction
Structure model
The structure model depicts the sentence position in the text, the sentence character length
and word count. This is a trivial model, and is created during the statistics gathering.
Vector-Space model
The vector space model depicts the term to sentence relations. Each sentence has a vector
which is built of the term used in the sentence itself. The terms should be after stemming or
lemmatization.
Graph model
The graph model depicts the term-to-term relations in the whole document and the
sentence-to-sentence relations. Two primary graphs are constructed for this model:
-
Word graph – Essentially it is a directed bi-gram graph with labeled edges, each label
denoted the sentence in which the word progression was made.
Sentence graph – This is a weighted and undirected graph that shows similarity
relations between two sentences. Each edge in the graph has a weight of the cosinesimilarity between the two sentences. Some threshold is filtering the edges, so nonsimilar sentences won't be connected by an edge at all.
Feature extraction
The original features used in MUSE are denoted and explained in Dr. Litvak's et al. papers
about MUSE. [4] [7] [8]
New lingual features are available in this research.
Linear-combination using a weighting vector
The feature extraction provides each sentence with a rank. To find the best ranked
sentences in the text the MUSE algorithm uses a linear-combination of those ranks given by
the features using a weighting vector that holds calculated weights for each features in the
linear combination. The weighting vector will be explained below.
Output representation
The algorithm has two different output representations:
-
Summary – Holds only the selected sentences in their original order.
Annotated – Holds an annotated representation of the summary for quality
assurance.
Training
The training of the algorithm is done by a genetic algorithm [2].
11
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
The training algorithm could be inter-changeable with other types of training mechanisms,
for example Artificial Neuron Network (ANN).
Quality Assurance
The quality of the summarizations are evaluated by ROUGE-1 and ROUGE-2 suites.
Those suites are used during the training sessions of the algorithm. The genetic algorithm
tries different variations of linear-combinations, the ROUGE suites test the result of those
combinations and evaluate them with the use of a "Golden-Standard".
Non-Functional Requirements
Portability
The MUSE implementation should be portable to different operating systems and machines
that pass the needed system requirements and hardware requirements as stated below.
The whole project is written in Java (JDK 7) which is known to be portable to any machine
that could run the JVM (Java Virtual Machine). As a data storage format the project uses
JSON format strings and structures. This format is vastly used in the web for communication,
thus giving the option to use MUSE in the internet environment.
Modularity
The stages of the algorithm should be modular and changeable, so that a future user
(developer) could potentially change/add stages (the pre-processing for example) with ease.
Extensibility
The framework should be extensible enough to work with difference summarization
algorithms, training techniques and output representations.
The MUSE algorithm as well is extensible and should be ready to receive newly written
feature calculators without the need to recompile/rebuild the whole project.
12
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Performance
Figure 8 : New MUSE parallelized workflow
The implementation should utilize the multi-core architecture of the CPUs. Using multiple
cores and multiple threads will enable the MUSE algorithm to work on multiple texts at
once, or utilize a pipeline like processing structure that will increase the throughput of the
original MUSE implementation.
System Requirements
Operating system
The operating system should be a 64bit type to receive the needed RAM requirements noted
below.
Any OS that has a compatible JRE implementation should suffice. Notice that this project will
be tested on Windows 7 64bit Home Premium edition
Java Virtual Machine (JRE)
JRE 7 64bit should be installed on the computer.
13
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Hardware Requirements
The hardware on which the project was implemented should be used as the minimal
hardware requirements:
CPU: Intel i7 2670QM.
Memory: 8 GB DDR3 1600MHz.
HDD: At least 350MB of free space, not including the input and the output space needed.
Development Requirements
Environment
Operating System
The minimal needed version of the OS should be as described above.
Editor
Eclipse 4.3 (Keller) used in the development of this project. Using the Eclipse editor the
developer can work on either Windows or Linux OS.
Java Development Kit
JDK 7u45 is used to build this project. Any newer version of JDK 7 should suffice. Notice that
using JDK 8 could potentially arise warning due to deprecation.
Source Control
The source control software used in this project is Mercurial, using a BitBucket server as the
repository host.
Maven
Maven is an open-source library (under the Apache license ver.2) that
This project leverages Maven in multiple ways:
-
Dependency resolution
Build tool
Testing
This project uses the jUnit4 testing library. This library is an open-source tool under the
Apache v2 license.
14
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Diagrams
The set of this diagrams comes to show a preliminary structure of the project software
implementation.
Class diagrams
Figure 9 : Calculator and Feature base hierarchy
Figure 10 : Document and Summary hierarchy
15
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Sequence
Figure 11: MUSE initiation sequence
Figure 12: MUSE running
16
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Figure 13 : Random stage initiation
Use case
17
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
Figure Table
FIGURE 1: ORIGINAL MUSE FEATURES TAXONOMY ............................................................................................. 5
FIGURE 2: ORIGINAL MUSE WORKFLOW ........................................................................................................... 6
FIGURE 3: PURPOSED NEW LEMMATIZATION FEATURES ......................................................................................... 7
FIGURE4 : PROPOSED NEW NE FEATURES........................................................................................................... 8
FIGURE 5 : ANNOTATIONS ............................................................................................................................... 8
FIGURE 6 : NEW MUSE WORKFLOW................................................................................................................. 9
FIGURE 7 : GENETIC ALGORITHM WORKFLOW.................................................................................................... 10
FIGURE 8 : NEW MUSE PARALLELIZED WORKFLOW............................................................................................ 13
FIGURE 9 : CALCULATOR AND FEATURE BASE HIERARCHY ..................................................................................... 15
FIGURE 10 : DOCUMENT AND SUMMARY HIERARCHY ......................................................................................... 15
FIGURE 11: MUSE INITIATION SEQUENCE ........................................................................................................ 16
FIGURE 12: MUSE RUNNING ........................................................................................................................ 16
FIGURE 13: SOME RANDOM STAGE INITIATION .................................................................................................. 17
18
‫מחלקת הנדסת תוכנה‬
Department of Computer Engineering
References
[1] M. LITVAK, H. LIPMAN, A. BEN GUR, M. LAST, S. KISILEVICH and D. KEIM, "Towards
multi-lingual summarization: A comparative analysis ofsentence extraction methods on
English and Hebrew corpora," in Proceedings of the CLIA/COLING 2010, Beijing, 2010.
[2] M. LITVAK, M. LAST and M. FRIEDMAN, "A new Approach to Improving Multilingual
Summarization using a Genetic Algorithm," in Proceedings of the Association for
Computational Linguistics (ACL), Uppsala, 2010.
[3] M. LITVAK and M. LAST, "MUSE - A Multilingual Sentence Extractor," in Proceedings of
the Computational Linguistics-Applications (CL-A'11), Jachranka, 2011.
[4] M. LITVAK and M. LAST, "Cross-lingual training of summarization systems using
annotated corpora in a foreign language," Information Retrieval: Volume 16, Issue 5, pp.
629-656, 2013.
[5] Wikipedia, "Lemma (morphology)," 1 July 2013. [Online]. Available:
http://en.wikipedia.org/wiki/Lemma_(morphology). [Accessed 2 December 2013].
[6] Wikipedia, "Word stem," Wikipedia, 28 October 2013. [Online]. Available:
http://en.wikipedia.org/wiki/Word_stem. [Accessed 02 December 2013].
[7] M. LITVAK and M. LAST, "Multilingual Single-Document Summarization with MUSE," in
Proceedings of ACL/MultiLing, Sofia, 2013.
[8] H.-T. Dilek, G. Tur, M. Levit, D. Gillick, A. Singla and S. Yaman, "Statistical Sentence
Extraction for Multilingual Information Distillation," 2009.
[9] L. WANG, H. RAGHAVAN, V. CASTELLI, R. FLORIAN and C. CARDIE, "A Sentence
Compression Base Frameword To Query-focuse Multi-Document Summarization".
[10] J. MAYFIELD, P. MCNAMEE and C. PIATKO, "Named Ebtity Recognition using Hundreds
of Thousnads of Features".
[11] M. LITVAK and M. LAST, "Language-independent Techniques for Automated Text
Summarization," Web Intelligence and Security, IOS Press, NATO Science for Peace and
Security Series, pp. 209-240, 2010.
[12] M. LITVAK and M. LAST, "Graph-Based Keyword Extraction for Single-Document
Summarization," in Proceedings of the 2nd Workshop on Multi-source, Multilingual
Information Extraction and Summarization, Manchester, 2008.
19
Download