LAB II – Product Specification Outline 1 CS 411W Lab II

advertisement
LAB II – Product Specification Outline
1
CS 411W Lab II
Prototype Product Specification
For
LASI
Prepared by: Erik Rogers
Date: 03.April.2013
Version 1.0
[Type text]
[Type text]
[Type text]
LAB II – Product Specification Outline
2
Table of Contents
(not sure why I can’t change the numerals)
1 Introduction ................................................................................................................... v
1.1 Purpose ................................................................................................................... v
1.2 Scope ..................................................................................................................... vi
1.3 Definitions, Acronyms, and Abbreviations ........................................................... vi
1.4 References ............................................................................................................. ix
1.5 Overview ............................................................................................................... ix
2 General Description ...................................................................................................... x
2.1 Prototype Architecture Description ........................................................................ x
2.2 Prototype Functional Description.......................................................................... xi
2.3 External Interfaces .............................................................................................. xvii
2.3.1 Hardware Interfaces .................................................................................... xviii
2.3.2 Software Interfaces ..................................................................................... xviii
2.3.3 User Interfaces ............................................................................................ xviii
2.3.4 Communications Protocols and Interfaces .................................................. xxii
3 Specific Requirements .............................................................................................. xxii
3.1 Functional Requirements.................................................................................... xxii
3.1.1 User Interface .............................................................................................. xxii
3.1.1.1 Start-up Screen ...................................................................................... xxii
3.1.1.2 Create New Project Screen ................................................................... xxiii
3.1.1.3 Project Preview Screen ......................................................................... xxiii
[Type text]
[Type text]
[Type text]
LAB II – Product Specification Outline
3
Table of Contents (continued)
3.1.1.4 Processing Screen ................................................................................. xxiv
3.1.1.5 Results Screen ...................................................................................... xxiv
3.1.2 File Manager ................................................................................................ xxv
3.1.3 Tagged File Parser ...................................................................................... xxvi
3.1.4 Word Association ....................................................................................... xxvi
3.1.5 Weighting Algorithm ................................................................................. xxvii
3.2 Assumptions and Constraints ............................................................................ xxix
3.2.1 Assumptions .................................................................................................... xxx
3.2.2 Constraints ....................................................................................................... xxx
3.2.3 Dependencies .................................................................................................. xxxi
3.3 Non-Functional Requirements .......................................................................... xxxi
3.3.1 Security ....................................................................................................... xxxi
3.3.2 Maintainability............................................................................................ xxxi
[Type text]
[Type text]
[Type text]
LAB II – Product Specification Outline
iv
List of Figures
Figure 1 - Major Component Model .................................................................................. xi
Figure 2 - LASI's 3-Sector Algorithm Overview .............................................................. xii
Figure 3 - Document Traversal ........................................................................................ xiii
Figure 4 - Word Type Class Diagram .............................................................................. xiv
Figure 5 - Phrase Type Diagram ....................................................................................... xv
Figure 6 - Interface Types ................................................................................................ xvi
Figure 7 - UI - Top Results Tab ....................................................................................... xix
Figure 8 - UI - Word Relationships Tab ........................................................................... xx
Figure 9 - UI - Word Count and Weighting Tab ............................................................. xxi
List of Tables
Table 1. Effects of Assumptions, Dependencies, and Constraints on Requirements ..... xxx
[Type text]
[Type text]
[Type text]
1
1.1
Introduction
Purpose
Linguistic Analysis for Subject Identification (LASI) is a computer application
currently being developed by the CS411 Red Team. Linguistic analysis is the
examination of language form, language meaning, and the ways in which these two
entities synergize to form language context. LASI, a linguistic analyzer that aids the user
as a decision support tool, extracts themes, or specific qualities and characteristics, of a
document or range of documents. Locating the themes of documents is necessary as it
allows the reader to comprehend what has been read; in comprehending what has been
read, the reader can summarize and share the material. LASI will take various document
types as input and return a weighted list of themes in each document individually and any
common themes found over the group of input documents. LASI will not make decisions
for the user directly; rather, it will allow the user to infer information based on the results.
The CS411 Red Team imagines that LASI will not only aid the persons who
presented us with the problem, Dr. Hester and Dr. Meyers (introduced in Lab 1, Section
1), but will also be beneficial to numerous professions. Students will be able to utilize
LASI in order to determine whether publications across the Internet relate to their areas
of study or research paper topics. Teachers could implement LASI in the classroom by
using it to grade student papers or provide examples as to how language is used and
interpreted. Research Analysts and Statisticians should be able to parse numerous
documents in order to quickly locate the topics to crunch data values. Contractors,
Consultants, and other related professions would be able to implement all of the previous
uses to suit their individual or clients’ needs.
1.2
Scope
LASI will implement a handful of features detailed later in this document—a few
of which are advanced in this field of linguistic analysis. LASI will be able to input
multiple documents, of any text file type, and deduce the individual themes of each
document as well as the themes and commonalities between all documents included. The
prototype will allow users to infer important themes without having to manually parse the
document.
1.3
Definitions, Acronyms, and Abbreviations
A.I.D. Process: A process that provides quantitative and qualitative basis to identify
problems and determine the feasibility of solutions.
Analysis: Detailed examination of the elements or structure of something, typically as a
basis for interpretation.
Document: A document herein refers to a formally written, expository paper which
expounds, via a declarative approach, on a relatively quantifiable issue, goal, or
area of research.
Head word: A locally distinct word within a phrase which, by its syntactic associations,
determines the category of the phrase itself.
LASI: Linguistic Analysis for Subject Identification
Linguistic Analysis: The scientific analysis of a language.
Parser: Takes in DOC and DOCX files and converts them to TXT files.
Part of Speech Tagger: Software utility that associates words with the parts-of-speech in
a sentence.
Phrase: An instance of the Phrase class.
Phrase: (Linguistically) A group of words standing together as a conceptual unit.
Phrase Class: The root of the taxonomy of class types which correspond to syntactic roles
at the phrase level and whose instances contain a collection of Words which
together represent a linguistic phrase.
Semantic Analysis: Relating the syntactical structure of words to their language
independent meanings.
Sharp NLP: Written in C#, natural language processing tool used to parse and tag partsof-speech.
Strategic Document: Document produced by a client that defines their Goals, Visions
and Missions.
Subject Identification: The process by which the subject matter and thematic content of
documents is determined.
Syntactic Analysis: Identifies key words based on their location in the sentence, rather
than their overall meaning throughout the document.
.TAGGED: The type of file that stores the output of the part-of-speech tagger containing
the all of the text of the document with embedded syntactic annotations.
Theme: Subject-object-verb relationships that LASI is attempting to generate from the
input set.
Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected
element in a document.
Tagged Set: A group of words, whose part of speech and location in a sentence have
been identified by the parser.
WordNet: Compiler and provider of the data files which forms the basis for the LASI
thesaurus.
Word Class: The root of the taxonomy of class types which correspond to parts-ofspeech at the word level and whose instances encapsulate each occurrence of a
textually identified word.
Word Weight: A numeric value, associated with each syntactically and lexically unique
word in a written work, indicating its significance.
1.4
References
"WordNet." About - . Princeton University, 27 Dec. 2012. Web. Fall 2012.
<http://wordnet.princeton.edu/>.
"The Stanford NLP (Natural Language Processing) Group." The Stanford NLP (Natural
Language Processing) Group. Stanford University, n.d. Web. Fall 2012.
<http://www-nlp.stanford.edu/>.
Rogers, Erik P. CS411 Red Team - LASI - Lab 1. Paper. Old Dominion University, 2013.
Print.
1.5
Overview
This product specification provides the hardware and software configuration,
external interfaces, capabilities, and features of the LASI prototype. The information in
the remaining sections of this document includes a detailed description of the hardware,
software, and external interface architecture of the LASI prototype; the key features of
the prototype; the parameters that will be used to control, manage, and establish each
feature; and the performance characteristics of each feature in terms of outputs, displays,
and user interaction.
2
2.1
General Description
Prototype Architecture Description
The LASI prototype is composed of two main components—hardware and
software. It will not implement any hardware, rather it will require the use of preexisting hardware, with required minimum technical specifications, in order to run
successfully and within optimal constraints. LASI’s software is branched into two
subcomponents—the algorithm and user interface. The algorithm consists of three
sectors (a primary analysis, secondary analysis, and tertiary analysis). The user interface
will allow for the users to interact and utilize the algorithm while keeping information
that is irrelevant or able to cause confusion within the user encapsulated.
[This space intentionally left blank]
Figure 1 - Major Component Model
The Major Component Model diagram in Figure 1 displays the minimal technical
requirements for hardware (whether it be a physical machine or virtual machine) to run
the software package as well as the separation of the two subcomponents provided within
the software package. LASI’s prototype will require the user to obtain a system with a
quad-core processor, 8 gigabytes of random access memory, and a solid-state storage
drive or better. LASI’s internal calculations and passes will be hidden from the user
through the use of a user interface that will display only the relevant, final results.
2.2
Prototype Functional Description
The entirety of LASI’s algorithm is conceptualized in three phrases for effective
programming. Each of these phases is essential to a proper, expected output. The phaseby-phase breakdown follows.
Figure 2 - LASI's 3-Sector Algorithm Overview
Figure 2 depicts LASI’s 3-sector algorithm overview. Each stage is executed
linearly. The “Primary Analysis” phase will parse input documents word-by-word, tag
each part of speech, and count the frequency of each word and part of speech found. The
“Secondary Analysis” phase will bind pronouns and adjectives to nouns and tag the
respective relationships as phrase types. The “Tertiary Analysis” phase will locate and
relate synonyms based on a set taxonomy, access subject-object-verb relationships and
assign them of a phrase type, parse through each word/phrase relationship and assign
weights, and finally output the final weights to the user interface.
[This space intentionally left blank]
Figure 3 - Document Traversal
The first function that the prototype will undergo is that of parsing the document.
The Document Traversal, presented in Figure X, shows the structure of a document. This
organization scheme allows LASI to divide and conquer documents by splitting them into
sub types. When LASI first parses a document, it will separate the document into
paragraphs, sentences, and words. The CS411 Red Team has included both clause and
phrase types in this class structure for use in future phases.
[This space intentionally left blank]
Figure 4 - Word Type Class Diagram
The CS411 Red Team has implemented SharpNLP’s part of speech tagger to
accomplish part of the parsing phase. SharpNLP parses the documents and tags the parts
of speech in a language that LASI’s algorithm can import and make use of. The Word
type construct, built by the CS411 Red Team upon the class diagram in Figure 5, will
wrap textual tokens, imported from SharpNP’s results, word-by-word and link them by
role. This functionality will allow the prototype’s algorithm to access and manipulate
tokens based on their part of speech and relative use in sentences. Once LASI has linked
each token by role, it will be able to begin the second stage of binding and relating parts
of speech to one another to construct phrases.
Figure 5 - Phrase Type Diagram
Figure X presents the class structure for determining phrase types. Once a
document has passed through the “Primary Phase”, it is eligible for use in the “Secondary
Phrase”. Here, each token in the document will be re-analyzed, syntactically, with the
tokens neighboring it. If there is any relevant relationship between tokens within the
same sentence or paragraph, the relationship will be stored and a phrase type will be
assigned.
[This space intentionally left blank]
Figure 6 - Interface Types
In order to aid the construction of phrases, LASI implements an interface
hierarchy, pictured in Figure X. This interface allows for each independent entity that is
contained within a phrase to assume certain roles based on its personality. Said roles can
be used in the “Tertiary Phase” in hopes to strengthen the syntactic relationships between
phrases and infer levels of semantic relationships.
Once a document enters the “Tertiary Phase”, it will be parsed, again, token-bytoken in search for synonyms. A collection of database files, provided by WordNet, and
arranged by part of speech, are parsed and each token in the database file is compared to
each token in the document. When a match in the same type database file is found, the
token and its synonym will be inspected by their philosophical category--provided by
WordNet. If the token in the document and it’s matched synonym are within the same
category, they will be marked as synonyms.
LASI will then assign each token in the document a weight determined by its part
of speech. The weighting of each token individually will provide an objective standard
for further weighting modifiers. After each token possesses a weight, LASI will search
for token-to-token relationships, defined in the “Secondary Phase”, and phrase
relationships and assign weight modifiers determined by both syntactic and semantic
relationships. Weights of synonyms will be handled separately to ensure a stable
distribution of importance.
The final weights of each phrase will be leveled by the token standardization. Once
the weights are finalized, they will be sorted. The final component of the “Tertiary
Phase” will allow for the user interface to retrieve the results for display.
2.3
External Interfaces
LASI has been designed as a stand-alone and open-source software application that
is to be run on a local (or virtual) machine such as a laptop or desktop computer. In this
regard, external interfaces are limited to standard hardware, provided by the user, and all
third-party software has been incorporated internally. Users will not be required to
purchase hardware specific for the prototype, nor will they need to make use of outside
software.
[This space intentionally left blank]
2.3.1 Hardware Interfaces
No hardware interfaces will be constructed for this prototype. A virtual machine,
hosted by Old Dominion University, will be used in order to demonstrate LASI. LASI
can also be run on a physical machine within the lab.
2.3.2 Software Interfaces
No software interfaces will be required for this prototype to run. LASI contains
all of functions for, and calls to, third-party software internally. Output results can be
exported to Microsoft Applications if the user so chooses, but are not required for LASI
to function.
2.3.3 User Interfaces
One of the key features of LASI is the ability to graphically display the parsed
results in a manageable format. The CS411 Red Team will provide three tabs: Top
Results, Word Relationships, and Word Count and Weighting. These three tabs allow the
user to view the results in three separate modes and provide sub tabs to view each
document individually in addition to viewing the collective results of all documents.
[This space intentionally left blank]
Figure 7 - UI - Top Results Tab
Figure 1 illustrates the Top Results tab. This tab allows the user to view the top
results, in chart form, that are the most likely themes for the document or documents
analyzed. Results are available for each document individually as well as the collection
of documents as a single entity. These results can be exported to Microsoft Office
applications.
[This space intentionally left blank]
Figure 8 - UI - Word Relationships Tab
The Word Relationships, displayed in Figure 2, tab gives users the option to view
the word relationships, including parts of speech, in each document individually. In this
view, users will be able to mouse over each word in order to see the relationships,
individual word statistics, and phrase statistics. A key will be added to this view to match
the color-coding.
[This space intentionally left blank]
Figure 9 - UI - Word Count and Weighting Tab
Figure 3 displays the Word Count and Weighting tab, which will show each
word’s count and weight for each the individual and collective documents. The word
count variables for each word will be represented by a whole number as well as a
percentage of frequency in comparison to other words. Weights will be represented as a
decimal.
LASI enables users the option to print or export their results. Said results will be
exportable to Microsoft Office applications (such as Excel, Word, and PowerPoint) as
well as graphical images (with .JPG, .PNG, etc. extensions) and in PDF format. The
CS411 Red Team hopes that this feature will allow for extended use of the results.
[This space intentionally left blank]
2.3.4 Communications Protocols and Interfaces
No communication protocols or interfaces will be required for the prototype to run.
LASI was built to run without connection to sources external to the machine it is being
executed on. In order to demonstrate LASI’s functionality, Transmission Control
Protocol/Internet Protocol (TCP/IP) over a standard Ethernet connection will be utilized
to access the designated virtual machine.
3
Specific Requirements
The following section describes the specific functional and non-functional
requirements along with the assumptions and constraints of the LASI prototype.
3.1
Functional Requirements
The functional requirements describe the capabilities of the LASI prototype. They
describe what the product must do in order to meet the previously discussed goals and
objectives of the project.
3.1.1 User Interface
The LASI GUI is the way the user interacts with LASI and views the results.
3.1.1.1 Start-up Screen
The Start-up Screen will provide two distinct paths that provide access to LASI’s
functionality. It will allow the user to create or load a project.
1.
The user shall be able to create new projects
2.
The user shall be able to load new projects
3.1.1.2 Create New Project Screen
The Create New Project Screen guides the user through the new project creation
process. It will prompt the user to enter the necessary information. During this process
the screen will display a running list of the documents selected for analysis. The
following functional capabilities shall be provided:
1.
The user shall be able to add files to project
a.
DOC
b.
DOCX
c.
TXT
2.
The user shall not be able to load any other file types
3.
The program shall only allow a maximum of five documents to be loaded into a
single project.
4.
Documents added to the project will be displayed in a document queue.
5.
Documents can be removed from the document queue.
6.
All fields must be correctly filled out to create a new project.
3.1.1.3 Project Preview Screen
The Project Preview Screen will provide a preview of the documents selected
during the Create New Project Screen. It allows the user to verify that the correct
documents have been selected, and remove or add additional documents. This screen
will allow the user to start the analysis process. The following functional capabilities
shall be provided:
1.
LASI shall provide a preview of uploaded documents.
a.
Each tab in the document preview shall display the title of its document
b.
2.
Each tab shall display the text of the document.
The user shall be able to add accepted files to a project.
a.
The Program shall only allow a maximum of 5 files.
3.
The user shall be able to remove documents from a project.
4.
LASI must have at least 1 document to start analyzing.
3.1.1.4 Processing Screen
The Processing Screen will display a moving graphic to show that the system has
not frozen. The user will also be able to interrupt analysis. The following functional
capabilities shall be provided:
1.
2.
The user shall be able to interrupt analysis.
a.
The user shall be returned to the Project Preview Screen.
b.
All temporary data will be discarded.
LASI shall display a visual indication that analysis is still ongoing.
3.1.1.5 Results Screen
The Results Screen will allow the user to toggle between different scopes,
perspectives, and levels of detail. The following functional capabilities shall be provided:
1.
LASI shall render results in multiple views.
a.
The Top Results View shall provide a visualized summary of the analysis.
a.1. The user shall be able to toggle between different graphical views
of the top results.
a.2. This view shall allow the user to toggle between individual and
collective document scopes.
b.
The Word Relationships View shall provide visuals to show all
relationships and bindings of words throughout each document.
c.
Word Count & Weighting View
c.1. The user shall be presented with the quantitative data used in the
analysis.
c.2. This view shall allow the user to toggle between individual and
collective document scopes.
2.
LASI shall be able to export results.
3.1.2 File Manager
The File Manager verifies that the documents loaded into a project are of the
types allowed. It provides file conversion routines to format documents into plain text. It
also manages the tagging process and the resulting TAGGED files. The following
functional capabilities shall be provided:
1.
The file manager shall accept a path to a document.
2.
The file manager shall verify that the document is in one of the following file
formats:
a.
DOC
b.
DOCX
c.
TXT
3.
The file manager must be able to convert a DOC file to DOCX.
4.
The file manager must be able to convert a DOCX file to TXT.
5.
The File Manager shall invoke the SharpNLP tagger to process each TXT file into
a new TAGGED file containing:
6.
a.
The original text of the document.
b.
The part of speech of each word.
c.
The type of every phrase.
The file manager shall provide functionality to backup up the entire project
directory.
3.1.3 Tagged File Parser
The Tagged File Parser loads the TAGGED files and creates a data structure inmemory of the documents. The following functional capabilities shall be provided:
1.
The Tagged File parser shall only accept a TAGGED file.
2.
The Tagged File parser shall create an instance of the Word subclass
corresponding to the annotation imbedded for that word in the TAGGED file.
3.
The Tagged File parser shall create an instance of the Phrase subclass
corresponding to the annotation imbedded for that phrase in the TAGGED file.
3.1.4 Word Association
The Word Association algorithm will associate words and phrases to one another
based on their POS and their syntax within the document. The following functional
capabilities shall be provided:
1.
The Subject binder determines which noun phrases are the subjects of verb
phrases.
2.
The Object binder determines which noun phrases are the direct objects of verb
phrases.
3.
The Object binder determines which noun phrases are the indirect objects of verb
phrases.
4.
The Thesaurus correctly identifies synonyms.
5.
An adjective or adjective phrase describing a noun or noun phrase is bound as a
describer to that noun or noun phrase.
6.
A noun or noun phrase that is the subject of a verb or verb phrase is bound as a
subject to that verb or verb phrase.
7.
A noun or noun phrase that is the direct object of a verb or verb phrase is bound as
a direct object to that verb or verb phrase.
8.
A noun or noun phrase that is the indirect object of a verb or verb phrase is bound
as an indirect object to that verb or verb phrase.
9.
Adverb phrases are associated with the adjective phrases or verb phrases that they
modify.
3.1.5 Weighting Algorithm
The Weighting Algorithm will calculate numeric weights for each Word and
Phrase based on their syntactic associations. Based on this analysis themes are
assembled. The following functional capabilities shall be provided:
1.
It shall each Word instance shall start with an equivalent initial weight.
2.
It shall each Phrase instance shall start with an equivalent initial weight.
3.
It shall update previous weight of each word when encountered again.
4.
It shall update previous weight of each word when the association count is
incremented.
5.
The weighting algorithm shall increase the weight of a Word if a synonym is
encountered.
6.
7.
8.
The weighting algorithm shall increase the weight of a Word if:
a.
A Word is associated with it.
b.
A Phrase is associated with it.
The weighting algorithm shall increase the weight of a Phrase if:
a.
A Word is associated with it.
b.
A Phrase is associated with it.
The algorithm shall exclude weights of commonly used words such as 'the', 'to', a,
etc. on an individual word-by-word basis.
9.
The distance between words and phrases shall be used as a weight modifier.
10. Each Word instance shall store its weight with respect to:
a.
The individual document containing it.
b.
All of the documents in the project.
11. Each Phrase instance shall store its weight with respect to:
a.
The individual document containing it.
b.
All of the documents in the project.
12. There shall be an aggregate result computed across all documents.
[This space intentionally left blank]
3.2
Assumptions and Constraints
The LASI prototype will operate on a set of assumptions and constraints that will
act as boundaries for the prototype functionality. Table 1. contains the full list of
assumptions, constraints, and dependencies for the prototype.
Condition
Document types are
Type
Constraint
limited to DOC, DOCX,
Effect on Requirements
If a document cannot be converted into raw
text, it cannot be accepted.
and TXT.
A project is limited to 5
Constraint
documents.
Documents submitted
This restricts the number of documents to a
testable set.
Assumption
Allows for minimal error checking for the
shall consist of entirely
purposes to developing and demonstrating the
grammatically correct
prototype.
statements.
The host machine will
have a sufficient amount
Assumption
Allows for minimal error checking for the
purposes of demonstrating the prototype.
of RAM and at least one
multicore processor.
The host machine must
have .NET version 4.5
or greater.
Dependency The prototype cannot be demonstrated without
this.
The host machine must
Dependency The prototype is not designed to work on other
be using a 64-bit
operating systems. It lacks a GUI for other
Windows operating
operating systems and is untested.
system.
Table 1. Effects of Assumptions, Dependencies, and Constraints on Requirements
3.2.1 Assumptions
Assumptions with respect to the LASI prototype are being made. First, documents
that are used for analysis are out of LASI’s control. It is expected that the documents
submitted shall consist of entirely grammatically correct statements. This will allow for
minimal error checking for the purposes of developing and demonstrating the prototype.
Second, the host machine is assumed to have sufficient specifications to be able to run the
LAASI prototype.
3.2.2 Constraints
A number of constraints will be used to limit the scope of the prototype to
simplify the development process. First the type of documents that the LASI prototype
can accept has been limited to DOC, DOCX, and TXT. Images cannot be analyzed. This
also means that if a document cannot be converted into raw text, it cannot be accepted.
Second, the prototype will limit the number of document that can be added to one project
to five. This is to insure that the algorithm can function in a timely manner.
3.2.3 Dependencies
There are two dependencies that have been identified for the LASI prototype. The
hardware used for the prototype demonstration must have .NET version 4.5 or greater
installed. The host machine must all be using a 64-bit Windows operating system since
the prototype is not designed to work on other operating systems. It lacks a GUI designed
for other operating systems and is untested in such an environment. ODU servers are
expected to be available to host the LASI components. If the ODU servers are
unavailable, personal hardware would need to be used.
3.3
Non-Functional Requirements
3.3.1 Security
No security requirements are required for the prototype. It is advised that users
keep sensitive documents protected within their personal storage drives. Sensitive
documents that are handled by LASI are only subject to security risks if the user leaves
their personal machine unprotected.
3.3.2 Maintainability
The prototype’s entire functionality can be maintained. The CS411 Red Team
plans to release the software as open-source so that the community at large can continue
to approve upon our intended output or manipulate the current infrastructure in order to
accomplish tasks that LASI was not initially intended for.
Download