Brittany Johnson - Old Dominion University

advertisement
Running head: Lab 2 – LASI Prototype Product Specification
Lab 2 – LASI Prototype Product Specification
Red Team
Brittany Johnson
CS411W
Janet Brunelle
April 8, 2013
Version 1
1
Lab 2 – LASI Prototype Product Specification
2
Table of Contents
1
Introduction ............................................................................................................................. 3
1.1
Purpose............................................................................................................................ 3
1.2
Scope ............................................................................................................................... 4
1.3
Definitions, Acronyms, and Abbreviations .................................................................... 5
1.4
References ....................................................................................................................... 7
1.5
Overview ......................................................................................................................... 7
2
General Description ................................................................................................................ 7
2.1
Prototype Architecture Description ................................................................................ 8
2.2
Prototype Functional Description ................................................................................. 13
List of Figures
Figure 1. Prototype Major Functional Component Diagram .......................................................... 8
Figure 2. GUI Site Map .................................................................................................................. 9
Figure 3. Prototype Hardware and Software Component Diagram ............. Error! Bookmark not
defined.
Figure 4. Nouns ..............................................................................Error! Bookmark not defined.
Figure 5. Verbs...............................................................................Error! Bookmark not defined.
Figure 6. Phrase ............................................................................................................................ 12
List of Tables
Table 1. Feature comparison between full product and prototype.................................................. 4
Lab 2 – LASI Prototype Product Specification
3
1 Introduction
Linguistic Analysis is the contextual study of written works and how the words combine
to form and overall meaning. Themes are the subject-object-verb relationships that help the
reader to comprehend and summarize what has been read. LASI will be a decision support tool
to assist users in determining common themes across multiple documents. It is even more
difficult to come to a conclusion when the number of documents increases because the theme
across all of the documents may not be the theme of each of the individual documents. The
complexity of a topic and the reader’s familiarity with it plays an important role in
comprehension. The reader’s comprehension, along with the ability to summarize the material is
important in being able to communicate the content of a document. Thus, it is often difficult for
people to identify a common theme over a large set of documents in a timely, consistent, and
objective manner.
1.1 Purpose
LASI stands for Linguistic Analysis for Subject Identification. It is a stand-alone theme
finding application conceived by the Old Dominion University CS410 Red Group. It is designed
to be a decision support tool for large, multi-document linguistic analysis and allow for more
accurate and consistent results. LASI will be able to detect themes across many documents and
can provide both individual and cross document analysis to determine a single theme.
LASI’s ability to analyze multiple documents to find a common theme makes it a great
decision support tool for teachers, students, research analysts and those that would need to read
through large sets of documents on a frequent basis. Teachers, for example, would be able to use
LASI as an initial analysis on student papers to check whether or not it is consistent with the
Lab 2 – LASI Prototype Product Specification
4
topic of that paper. Both students and research analysts could use LASI to quickly assess the
usefulness of scientific and literary publications for the topic that they are researching.
1.2 Scope
Prototype features will differ from the real world product in scale. Some features will be
eliminated to the project due to limited development time. A complete list of features is available
in Table 1.
Table 1. Feature comparison between full product and prototype
The types of documents that the LASI prototype accepts has been limited to just DOC and
DOCX. Scanned text recognition has been removed from the prototype since there is not enough
time to get the OCR software fully functioning. The prototype will limit the number of
documents that can be added to one project to three to five, and there is a size limitation of 10
pages on each of those documents to insure that the algorithm can function in a timely manner.
Rather than focusing on every part of speech, in the LASI prototype we will focus on noun-verb
binding. There were also a few of the more complex features that did not make it into the
prototype like user defined dictionaries, synonym identification, and content assumption.
Lab 2 – LASI Prototype Product Specification
5
1.3 Definitions, Acronyms, and Abbreviations
A.I.D.: Assessment Improvement Design
A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems
and determine the feasibility of solutions.
Analysis: Detailed examination of the elements or structure of something, typically as a basis for
interpretation.
Document: A document herein refers to a formally written, expository paper which expounds,
via a declarative approach, on a relatively quantifiable issue, goal, or area of research.
Head word: A locally distinct word within a phrase which, by its syntactic associations,
determines the category of the phrase itself.
LASI: Linguistic Analysis for Subject Identification
Linguistic Analysis: The scientific analysis of a language.
Parser: Takes in DOC and DOCX files and converts them to TXT files.
Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a
sentence.
Phrase: An instance of the Phrase class.
Phrase: (Linguistically) A group of words standing together as a conceptual unit.
Phrase Class: The root of the taxonomy of class types which correspond to syntactic roles at the
phrase level and whose instances contain a collection of Words which together represent
a linguistic phrase.
Semantic Analysis: Relating the syntactical structure of words to their language independent
meanings.
Sharp NLP: Written in C#, natural language processing tool used to parse and tag parts-ofspeech.
Strategic Document: Document produced by a client that defines their Goals, Visions and
Missions.
Subject Identification: The process by which the subject matter and thematic content of
documents is determined.
Lab 2 – LASI Prototype Product Specification
6
Syntactic Analysis: Identifies key words based on their location in the sentence, rather than their
overall meaning throughout the document.
.TAGGED: The type of file that stores the output of the part-of-speech tagger containing the all
of the text of the document with embedded syntactic annotations.
Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set.
Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected element
in a document.
Tagged Set: A group of words, whose part of speech and location in a sentence have been
identified by the parser.
WordNet: Compiler and provider of the data files which forms the basis for the LASI thesaurus.
Word Class: The root of the taxonomy of class types which correspond to parts-of-speech at the
word level and whose instances encapsulate each occurrence of a textually identified
word.
Word Weight: A numeric value, associated with each syntactically and lexically unique word in
a written work, indicating its significance.
[This space intentionally left blank.]
Lab 2 – LASI Prototype Product Specification
1.4 References
Johnson, Brittany. (2013). Lab 1 – LASI Product Description.
SharpNLP. (n.d.). Retrieved from http://sharpnlp.codeplex.com/
Office binary to open xml. (n.d.). Retrieved from http://b2xtranslator.sourceforge.net/
1.5 Overview
This product specification provides the hardware and software configuration, external
interfaces, capabilities and features of the LASI prototype. The information provided in the
remaining sections of this document includes a detailed description of the hardware, software
and interface of the LASI prototype as well as the key features of the prototype.
2 General Description
The following sections describe the prototype in more detail. Section 2.1 identifies and
describes each architectural component of the prototype. Section 2.2 explains the prototype’s
functional requirement. Lastly, Section 2.3 describes the external interfaces of the prototype.
[This space intentionally left blank.]
7
Lab 2 – LASI Prototype Product Specification
8
2.1 Prototype Architecture Description
The architecture for the LASI prototype consists of 3 major components: a Graphical
User Interface, an algorithm and a file management system. Figure 1. shows a major functional
component diagram of the prototype.
Graphic User Interface
File Management
Start-up Screen
Create Project
File Converter
Create Project View
Documents Returned
Project Preview
Begin Analysis
Part-of-Speech
Tagger
In Progress View
Results View
Algorithm
Results Aggregator
Word &Phrase Binder
Individual Documents
Subject
All Documents
Tagged
File Parser
Object
Attributive
Figure 1. Prototype Major Functional Component Diagram
The first major component is the graphic user interface. The LASI User Interface is a
Windows Presentation Foundation (WPF) project using XAML to define the structure of the
views and C# to provide the interactivity. The LASI prototype GUI contains: a Start-up Screen, a
Create Project View, a Project Preview, an In Progress View and a Results View.
[This space intentionally left blank.]
Lab 2 – LASI Prototype Product Specification
9
Figure 2. GUI Site Map
As shown in Figure 2., results can be viewed in three different format types: Top Results, Word
Relationships, and Word Count and Weighting. The top results will be represented graphically
based on the user’s preferred chart type. The charting engine that is being used for this feature is
a functionality of the WPF Toolkit. The word relationships will also be displayed for each
Lab 2 – LASI Prototype Product Specification
10
document. Each word is colorized based on its part-of-speech. This will allow the user to see the
relationships between all of the words and phrases in a document. Results will also be displayed
based on the individual word count and weight. The weight that will be displayed is based on the
weighting algorithm. Results can either be printed or exported in PDF, JPG, and PNG.
The second major component is the file management system. It manages converting files
and invoking the tagger. The file management system contains the file converter and the partsof-speech tagger. The file converter that the LASI prototype is using is the B2XTranslator, third
party open source software that can convert DOC and DOCX into an XML file. The parts-ofspeech tagger software being used is SharpNLP, open source C# natural language processing
tool. The SharpNLP POS Tagger tags words and phrases with the respective parts-of-speech for
use by the LASI algorithm. SharpNLP utilizes the Penn Treebank parts-of-speech tags to define
the parts of speech.
The last major component is the algorithm. The LASI prototype algorithm is
written in C#. The Algorithm, as shown in Figure 1., contains a Tag Parser which converts the
text into word and phrase types representative to their parts-of-speech. A Word, in reference to
word types, is the root of the classification of class types which correspond to parts-of-speech at
the word level and whose instances encapsulate each occurrence of a textually identified word.
Figure 4. shows all of the Word types in the LASI prototype. Every word that is tagged by the
part-of-speech tagger has a corresponding Word type.
[This space intentionally left blank.]
Lab 2 – LASI Prototype Product Specification
Figure 3. Word
[This space intentionally left blank.]
11
Lab 2 – LASI Prototype Product Specification
12
A Phrase, as shown in Figure 5., is the root of the classification of class types which correspond
to syntactic roles at the phrase level and whose instances contain a collection of Words which
together represent a linguistic phrase. Just like with Word types, every type of phrase that can be
tagged with our part-of speech tagger is represented.
.
Figure 4. Phrase
The LASI prototype algorithm binds word and phrase types together based on their
syntactic relationship via a state machine derived logic flow. Words and phrases will be bound
together based on their Word or Phrase type mentioned above and how they relate to one another
within phrases, paragraphs, and the document. The weighting algorithm will assign each word a
weight based on its part-of-speech, frequency count and the number of times and ways it is
referenced. For the LASI prototype we will be focusing subject, object and attributive binding.
Lab 2 – LASI Prototype Product Specification
13
2.2 Prototype Functional Description
The major functional components are shown in Figure 1. A user will interact with the
LASI GUI and create a new project using documents that are of the correct file type and stripped
of all graphics. The user will need to fill out all required information needed to create a new
project. These actions will result in a new project being created and the document converter
being called. When documents are added to a project in the GUI, the document converter takes
DOC and DOCX and converts it to an XML file. Once the document is in XML, it is converted
to raw text that can be used by the parts-of-speech tagger.
The user is then navigated to the document preview where they can either remove or add
documents. Once analysis has begun, SharpNLP will embed a part-of-speech tag into the text
from each document. The tagged file is then passed on to tagged file reader which then assigns
each word and phrase a word or phrase type which corresponds to its part of speech given by the
tagger.
Once the word and phrase binding is finished, it will begin weighting the words based on
their frequency as the number of times it is referenced. The weighting metrics for each word will
be based on a raw frequency as well as a relative frequency. Each word will have a raw
frequency that is based on a simple word count, the number of times that the word was used in a
particular manner, and a frequency count for synonyms of that word. The relative frequency will
be based on subject, verb and object relationship between words as well as where a word is
located in a document. Each of the word weights will then be passed on to the GUI Results page
for the user to view.
Download