מחלקת הנדסת תוכנה Department of Computer Engineering System Requirements Document By: - Bar Dromi Yevgeni Krapivin 1 מחלקת הנדסת תוכנה Department of Computer Engineering Table of Contents Table of Contents .........................................................................................................2 Revision table ...............................................................................................................4 Distribution table..........................................................................................................4 Introduction .................................................................................................................5 Description of problem ..............................................................................................5 Importance of solving the problem..............................................................................5 Related work ............................................................................................................5 Multilingual Sentence Extractor ..................................................................................6 Purposed updates ..................................................................................................7 Workflow .................................................................................................................9 Text input .............................................................................................................9 Text Pre-processing................................................................................................9 Feature Extraction .................................................................................................9 Summarization ......................................................................................................9 Training .............................................................................................................. 10 Representation.................................................................................................... 10 Functional Requirements ............................................................................................. 10 Input consumption .................................................................................................. 10 Text pre-processing ................................................................................................. 10 Statistics gathering .................................................................................................. 10 Model construction ................................................................................................. 11 Structure model .................................................................................................. 11 Vector-Space model ............................................................................................. 11 Graph model ....................................................................................................... 11 Feature extraction ................................................................................................... 11 Linear-combination using a weighting vector.............................................................. 11 Output representation ............................................................................................. 11 Training.................................................................................................................. 11 Quality Assurance ................................................................................................ 12 Non-Functional Requirements ...................................................................................... 12 Portability .............................................................................................................. 12 Modularity ............................................................................................................. 12 Extensibility ............................................................................................................ 12 Performance........................................................................................................... 13 2 מחלקת הנדסת תוכנה Department of Computer Engineering System Requirements ................................................................................................. 13 Operating system .................................................................................................... 13 Java Virtual Machine (JRE) ........................................................................................ 13 Hardware Requirements.............................................................................................. 14 Development Requirements ........................................................................................ 14 Environment........................................................................................................... 14 Operating System ................................................................................................ 14 Editor ................................................................................................................. 14 Java Development Kit ........................................................................................... 14 Source Control .................................................................................................... 14 Maven ................................................................................................................ 14 Testing ............................................................................................................... 14 Diagrams ................................................................................................................... 15 Class diagrams ........................................................................................................ 15 Sequence ............................................................................................................... 16 Use case ................................................................................................................. 17 Figure Table ............................................................................................................... 18 References ................................................................................................................. 19 3 מחלקת הנדסת תוכנה Department of Computer Engineering Revision table Revision # 1 Revision Date 28/11/2013 Description of Change Original document conception 2 30/11/2013 3 2/12/2013 Added revision and distribution table, Added performance section to nonfunctional requirements Added purposed features and optimizations, Diagrams Author Yevgeni Krapivin, Bar Dromi Yevgeni Krapivin Bar Dromi, Eugene Krapivin Distribution table Recipient Name Dr. Litvak Marina Recipient Organization Project consultant 4 Distribution Method e-mail מחלקת הנדסת תוכנה Department of Computer Engineering Introduction Description of problem In a world of growing data finding the relevant information in the sea of ever growing data becomes a real time consuming problem. Every year researchers submit new articles that, sometimes, may contain duplicate data and almost always contain irrelevant data. There is a major problem of summarizing texts in exotic languages. Most summarization algorithm need golden standards to "learn" how to summarize a language, or use language specific features making those algorithm language dependent. Importance of solving the problem There is a major importance of solving this problem, because of the unstoppable growth of the data in the coming years and decades. Moreover, it is very important to have the ability to summarize information in exotic languages for the intelligence community. Related work Extractive summarization is aimed at the selection of a subset of the most relevant fragments, which can be paragraphs, sentences, key phrases, or keywords from a given source text. The extractive summarization process usually involves ranking, such that each fragment of a summarized text gets a relevance score, and extraction, during which the topranked fragments are extracted and arranged in a summary in the same order they appeared in the original text. Statistical methods for calculating the relevance score of each fragment can rely on such information as: fragment position inside the document, its length, whether it contains keywords or title words. [1] A research conducted by Dr. Litvak [1] we see the use of 31 different linguistic features to rank sentences. The 31 scores are combined using a weighting function that receives the weights from a genetic algorithm [2] that finds the optimal weights ratio that provides the best summarization (according to evaluations). Figure 1: Original MUSE features taxonomy 5 מחלקת הנדסת תוכנה Department of Computer Engineering Most of the current algorithms for sentence extraction use sentence specific features rendering them useless on a large variety of languages without being retrained specifically for the needed language. The process of retraining the algorithms may not even be possible because of a tight dependency on the language. Retraining some algorithms require preparing a special corpus of texts called "Golden Standard" with a corresponding set of human made summarizations. The creation of such a "golden standard" is very expensive and time demanding. In cases of rare and exotic languages may not even be possible due to insufficient man power. Multilingual Sentence Extractor The MUltilingual Sentence Extractor algorithm (MUSE) tries to deal with some of the problems presented previously. The MUSE algorithm is a language independent algorithm (hence Multi-lingual) that could be trained once and used for a variety of languages with success providing high standard results. The original MUSE algorithm [3] described in Dr. Litvak's paper contains 31 features [4]. Those features are divided into three primary groups: - Structural features – Involved with the placement of the sentences in the text and the length of those sentences. Vector-Space features – Involved with the term frequency in the sentence. Graph-Space features – Using graph representation of the vector space to use graph specific algorithms. Using a genetic algorithm [2] to create a linear combination of the rankings given by each of the features MUSE selects the most ranked sentences and reorders them into the original orders creating a summarization comprised entirely from the original sentences of the text. Figure 2: Original MUSE workflow 6 מחלקת הנדסת תוכנה Department of Computer Engineering The same linear combination could be potentially used on other languages without retraining the algorithm. Thus, giving MUSE an advantage over other summarization algorithms. The problems with the MUSE algorithm is the runtime of the algorithm due to a nonparallelized implementation. The aim of our project and research is to find ways to parallelize the algorithm giving it a better implementation, decreasing runtime and increasing the capacity of work done in a single run. The research part of the project is aimed at improving the quality of the summarizations in popular languages such as English by adding language specific features to the summarization of English texts. Purposed updates Introduction of new features We evaluated to muse new language dependency features. We believe that language dependency features will improve the quality of the summery. Features depending upon lemmatization n Feature Source 1 KEY Edmundson (1969) 2 COV Liu et al. (2006a) 3 TF Vanderwende et al. (2007) 4 TFISF Neto et al. (2000) 5 SVD Steinberger and Jezek (2004) 6 TITLE_O Edmundson (1969) 7 TITLE_J Edmundson (1969) 8 TITLE_C Edmundson (1969) 9 D_COV_O Litvak et al. (2010b) 10 D_COV_J Litvak et al. (2010b) 11 D_COV_C Litvak et al. (2010b) 12 LUHN_DEG Litvak et al. (2010b) 13 KEY_DEG Litvak et al. (2010b) 14 COV_DEG Litvak et al. (2010b) 15 DEG Litvak et al. (2010b) 16 TITLE_E_O Litvak et al. (2010b) 17 TITLE_E_J Litvak et al. (2010b) 18 D_COV_E_O Litvak et al. (2010b) 19 D_COV_E_J Litvak et al. (2010b) Taxonomy Vector-based Vector-based Vector-based Vector-based Vector-based Vector-based Vector-based Vector-based Vector-based Vector-based Vector-based Graph-based Graph-based Graph-based Graph-based Graph-based Graph-based Graph-based Graph-based Figure 3: Purposed new lemmatization features Features depending upon Named Entity Feature Description NE_LOC Number of Location NE in the sentence NE_PER Number of Person NE in the sentence NE_DATE Number of Date-Time NE in the sentence 7 Formula Based on מחלקת הנדסת תוכנה Department of Computer Engineering NE_QU NE NE_LOC_RAT NE_PER_RAT NE_DATE_RAT NE_QU_RAT NE_RAT NE_COV NE_TFISF NE_FRQ_SUM Number of Quantitative NE in the sentence Total number of NE in the sentence Ratio of Location NE to all words in the sentence Ratio of Person NE to all words in the sentence Ratio of Date-Time NE to all words in the sentence Ratio of Quantitative NE to all words in the sentence Ratio of total number of NE to all words in the sentence Ratio of NE in sentence to NE in document 𝑖𝑠𝑓(𝑁𝐸) = 1 − log(𝑛(𝑁𝐸)) log(𝑛) , |𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝑆)| |𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝑆)| 𝑁 |𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝑆)| |𝑁𝑎𝑚𝑒𝑑_𝐸𝑛𝑡(𝐷)| ∑ 𝑛(𝑁𝐸) is the number of sentences containing NE Sum of the NE frequencies 𝑡𝑓(𝑁𝐸) ∗ 𝑖𝑠𝑓(𝑁𝐸) COV TFISF 𝑁𝐸∈𝑆 ∑ 𝑡𝑓(𝑁𝐸) KEY 𝑁𝐸∈𝑆 NE_TITLE NE_DEG_SUM Number of NE mentioned in the title Sum of NE degrees ∑ 𝐷𝑒𝑔(𝑁𝐸) KEY_DEG 𝑁𝐸∈𝑆 Figure 4: Proposed new NE features Notations description: Notation Description NE Named entity S Sentence D Document i Sentence index n Number of sentences in document N Number of words in sentence tf(NE) In document Named entity frequency Deg(NE) Degree of Named entity node Figure 5 : Annotations Optimizations Lemmatization Lemma (headword) - In morphology and lexicography, a lemma (plural lemmas) is the canonical form, dictionary form, or citation form of a set of words [5] Stem - In linguistics, a stem is a part of a word. The term is used with slightly different meanings. In one usage, a stem is a form to which affixes can be attached. [6] One of the optimizations this project researches and introduces to MUSE is the usage of lemmas instead of stems (or in cooperation with stems). This method should find better keywords in the text for all the keyword based metrics. Using stems loses importance in keywords that hold the same lexical meaning however differ morphologically. By finding all 8 מחלקת הנדסת תוכנה Department of Computer Engineering the lemmas in the text, the keyword bases metrics will be more accurate in rating the sentences. Reduction of IO usage This version of MUSE uses less IO then the previous version. The only IO interaction this implementation of MUSE reads from the HDD (Hard Drive Disk) only once. All the configurations files and text input is loaded to the memory and stays in the memory during the runtime. This method uses less IO, hence less interrupts for IO and less taken for information request. On the downside of this method, the amount of RAM (Random Access Memory) needed for this program is increased. Workflow The new workflow of MUSE resembles to the original workflow, however, the difference is the parallelization of each step. Figure 6 : New MUSE workflow Text input In this stage we read/receive the raw (no XML or other annotations) texts, and store them in the RAM (Memory) for fast access and consumption. Text Pre-processing In this stage the pre-processor splits the single text into sentences and tokenizes the sentences to receive separate words. Those words are stemmed, lemmatized and tagged. All those pre-processing elements are done by the Stanford NLP toolkit. In this stage the data about the text, sentences and the tokens is collected and saved for later use in the feature extraction stage. Feature Extraction In this stage in the workflow each sentence in each text gets its rankings according to the data collected in the pre-processing stage. At the end of the stage each sentence has as much rankings as there are features involved in the feature extraction process. Summarization In this stage the algorithm takes the rankings of each sentence and creates a single rank using the linear combination created in the training process. Using the ranking of each sentence the algorithm then selects the top ranked sentences and uses them as the summarization of the text. 9 מחלקת הנדסת תוכנה Department of Computer Engineering Training In this stage, the trainer uses the output of the feature extraction stage explained earlier. Using a genetic algorithm, and the quality evaluation packages ROUGE-1 and ROUGE-2, the trainer creates the best suitable linear combination of the given features that create the highest quality summarizations. In this stage the trainer also needs a golden standard to check upon the quality of the summarizations created in the training process and evaluate the quality of the linear combination weights. Figure 7 : Genetic algorithm workflow Representation The representation of the final outcome of the summarization process could be suited to the needs of the user. The output could be annotated (as needed by the ROUGE evaluation kits), exported as raw *.txt files, uploaded to a server etc. The representations of the summarizations vary on the user needs. Summarizations could be raw texts that only contain the extracted sentences from the original text or a highlighting of the selected sentences in the original text. Functional Requirements The MUSE implementation is a pipeline style framework. Construction of pipelines is done by exploiting the Parallel-Processing-Pipeline library (GPL license). Input consumption The input texts should be raw *.txt files. The presence of annotations of any kind (XML, HTML, JSON etc.) will make the program behave in unpredicted ways, yield summarizations of low quality and even crash. Text pre-processing The text pre-processing is done by using the Stanford NLP toolkit. Statistics gathering The pre-processor should gather the following parameters of the text: Sentence splitting Tokenization Stem frequency Lemma frequency 10 מחלקת הנדסת תוכנה Department of Computer Engineering Part-Of-Speech (POS) tagging Named-Entity (NE) recognition NE frequency Inverse term-sentence mapping Inverse lemma-sentence mapping Inverse NE-sentence mapping Sentence vector normal calculation Model construction Structure model The structure model depicts the sentence position in the text, the sentence character length and word count. This is a trivial model, and is created during the statistics gathering. Vector-Space model The vector space model depicts the term to sentence relations. Each sentence has a vector which is built of the term used in the sentence itself. The terms should be after stemming or lemmatization. Graph model The graph model depicts the term-to-term relations in the whole document and the sentence-to-sentence relations. Two primary graphs are constructed for this model: - Word graph – Essentially it is a directed bi-gram graph with labeled edges, each label denoted the sentence in which the word progression was made. Sentence graph – This is a weighted and undirected graph that shows similarity relations between two sentences. Each edge in the graph has a weight of the cosinesimilarity between the two sentences. Some threshold is filtering the edges, so nonsimilar sentences won't be connected by an edge at all. Feature extraction The original features used in MUSE are denoted and explained in Dr. Litvak's et al. papers about MUSE. [4] [7] [8] New lingual features are available in this research. Linear-combination using a weighting vector The feature extraction provides each sentence with a rank. To find the best ranked sentences in the text the MUSE algorithm uses a linear-combination of those ranks given by the features using a weighting vector that holds calculated weights for each features in the linear combination. The weighting vector will be explained below. Output representation The algorithm has two different output representations: - Summary – Holds only the selected sentences in their original order. Annotated – Holds an annotated representation of the summary for quality assurance. Training The training of the algorithm is done by a genetic algorithm [2]. 11 מחלקת הנדסת תוכנה Department of Computer Engineering The training algorithm could be inter-changeable with other types of training mechanisms, for example Artificial Neuron Network (ANN). Quality Assurance The quality of the summarizations are evaluated by ROUGE-1 and ROUGE-2 suites. Those suites are used during the training sessions of the algorithm. The genetic algorithm tries different variations of linear-combinations, the ROUGE suites test the result of those combinations and evaluate them with the use of a "Golden-Standard". Non-Functional Requirements Portability The MUSE implementation should be portable to different operating systems and machines that pass the needed system requirements and hardware requirements as stated below. The whole project is written in Java (JDK 7) which is known to be portable to any machine that could run the JVM (Java Virtual Machine). As a data storage format the project uses JSON format strings and structures. This format is vastly used in the web for communication, thus giving the option to use MUSE in the internet environment. Modularity The stages of the algorithm should be modular and changeable, so that a future user (developer) could potentially change/add stages (the pre-processing for example) with ease. Extensibility The framework should be extensible enough to work with difference summarization algorithms, training techniques and output representations. The MUSE algorithm as well is extensible and should be ready to receive newly written feature calculators without the need to recompile/rebuild the whole project. 12 מחלקת הנדסת תוכנה Department of Computer Engineering Performance Figure 8 : New MUSE parallelized workflow The implementation should utilize the multi-core architecture of the CPUs. Using multiple cores and multiple threads will enable the MUSE algorithm to work on multiple texts at once, or utilize a pipeline like processing structure that will increase the throughput of the original MUSE implementation. System Requirements Operating system The operating system should be a 64bit type to receive the needed RAM requirements noted below. Any OS that has a compatible JRE implementation should suffice. Notice that this project will be tested on Windows 7 64bit Home Premium edition Java Virtual Machine (JRE) JRE 7 64bit should be installed on the computer. 13 מחלקת הנדסת תוכנה Department of Computer Engineering Hardware Requirements The hardware on which the project was implemented should be used as the minimal hardware requirements: CPU: Intel i7 2670QM. Memory: 8 GB DDR3 1600MHz. HDD: At least 350MB of free space, not including the input and the output space needed. Development Requirements Environment Operating System The minimal needed version of the OS should be as described above. Editor Eclipse 4.3 (Keller) used in the development of this project. Using the Eclipse editor the developer can work on either Windows or Linux OS. Java Development Kit JDK 7u45 is used to build this project. Any newer version of JDK 7 should suffice. Notice that using JDK 8 could potentially arise warning due to deprecation. Source Control The source control software used in this project is Mercurial, using a BitBucket server as the repository host. Maven Maven is an open-source library (under the Apache license ver.2) that This project leverages Maven in multiple ways: - Dependency resolution Build tool Testing This project uses the jUnit4 testing library. This library is an open-source tool under the Apache v2 license. 14 מחלקת הנדסת תוכנה Department of Computer Engineering Diagrams The set of this diagrams comes to show a preliminary structure of the project software implementation. Class diagrams Figure 9 : Calculator and Feature base hierarchy Figure 10 : Document and Summary hierarchy 15 מחלקת הנדסת תוכנה Department of Computer Engineering Sequence Figure 11: MUSE initiation sequence Figure 12: MUSE running 16 מחלקת הנדסת תוכנה Department of Computer Engineering Figure 13 : Random stage initiation Use case 17 מחלקת הנדסת תוכנה Department of Computer Engineering Figure Table FIGURE 1: ORIGINAL MUSE FEATURES TAXONOMY ............................................................................................. 5 FIGURE 2: ORIGINAL MUSE WORKFLOW ........................................................................................................... 6 FIGURE 3: PURPOSED NEW LEMMATIZATION FEATURES ......................................................................................... 7 FIGURE4 : PROPOSED NEW NE FEATURES........................................................................................................... 8 FIGURE 5 : ANNOTATIONS ............................................................................................................................... 8 FIGURE 6 : NEW MUSE WORKFLOW................................................................................................................. 9 FIGURE 7 : GENETIC ALGORITHM WORKFLOW.................................................................................................... 10 FIGURE 8 : NEW MUSE PARALLELIZED WORKFLOW............................................................................................ 13 FIGURE 9 : CALCULATOR AND FEATURE BASE HIERARCHY ..................................................................................... 15 FIGURE 10 : DOCUMENT AND SUMMARY HIERARCHY ......................................................................................... 15 FIGURE 11: MUSE INITIATION SEQUENCE ........................................................................................................ 16 FIGURE 12: MUSE RUNNING ........................................................................................................................ 16 FIGURE 13: SOME RANDOM STAGE INITIATION .................................................................................................. 17 18 מחלקת הנדסת תוכנה Department of Computer Engineering References [1] M. LITVAK, H. LIPMAN, A. BEN GUR, M. LAST, S. KISILEVICH and D. KEIM, "Towards multi-lingual summarization: A comparative analysis ofsentence extraction methods on English and Hebrew corpora," in Proceedings of the CLIA/COLING 2010, Beijing, 2010. [2] M. LITVAK, M. LAST and M. FRIEDMAN, "A new Approach to Improving Multilingual Summarization using a Genetic Algorithm," in Proceedings of the Association for Computational Linguistics (ACL), Uppsala, 2010. [3] M. LITVAK and M. LAST, "MUSE - A Multilingual Sentence Extractor," in Proceedings of the Computational Linguistics-Applications (CL-A'11), Jachranka, 2011. [4] M. LITVAK and M. LAST, "Cross-lingual training of summarization systems using annotated corpora in a foreign language," Information Retrieval: Volume 16, Issue 5, pp. 629-656, 2013. [5] Wikipedia, "Lemma (morphology)," 1 July 2013. [Online]. Available: http://en.wikipedia.org/wiki/Lemma_(morphology). [Accessed 2 December 2013]. [6] Wikipedia, "Word stem," Wikipedia, 28 October 2013. [Online]. Available: http://en.wikipedia.org/wiki/Word_stem. [Accessed 02 December 2013]. [7] M. LITVAK and M. LAST, "Multilingual Single-Document Summarization with MUSE," in Proceedings of ACL/MultiLing, Sofia, 2013. [8] H.-T. Dilek, G. Tur, M. Levit, D. Gillick, A. Singla and S. Yaman, "Statistical Sentence Extraction for Multilingual Information Distillation," 2009. [9] L. WANG, H. RAGHAVAN, V. CASTELLI, R. FLORIAN and C. CARDIE, "A Sentence Compression Base Frameword To Query-focuse Multi-Document Summarization". [10] J. MAYFIELD, P. MCNAMEE and C. PIATKO, "Named Ebtity Recognition using Hundreds of Thousnads of Features". [11] M. LITVAK and M. LAST, "Language-independent Techniques for Automated Text Summarization," Web Intelligence and Security, IOS Press, NATO Science for Peace and Security Series, pp. 209-240, 2010. [12] M. LITVAK and M. LAST, "Graph-Based Keyword Extraction for Single-Document Summarization," in Proceedings of the 2nd Workshop on Multi-source, Multilingual Information Extraction and Summarization, Manchester, 2008. 19