Scott Minter - Old Dominion University

advertisement
Running Head: Lab 2 – LASI Prototype Product Specification
Lab 2 – Prototype Product Specification For LASI
Red Team
Scott Minter
April 2, 2013
Version 1
Lab 2 – LASI Prototype Product Specification
2
Table Of Contents
1
Introduction ……………………………………………………………………… 3
1.1
Purpose ………………………………………………………………...… 4
1.2
Scope …………………………………………………………………….. 4
1.3
Definitions, Acronyms and Abbreviations ……………………………… 5
1.4
References ……………………………………………………………….. 6
1.5
Overview .,……………………………………………………………….. 6
2
General Description ……………………………………………………………... 8
2.1
Prototype Architecture Description ……………………………………... 8
2.2
Prototype Functional Description ……………………………………….. 9
2.3
External Interfaces …………………………………………………...… 12
3
Specific Requirements …………………..…………………… See Group Printout
List of Figures
Example 1: Top Results Screen……………………………………………………….……………… 9
Example 2: Word Relationships ……………………………………………………….. 10
Example 3: Word Count and Weighting ………………………………………………..……...… 10
Figure 1: Prototype Major Functional Components ………………………………........ 11
Lab 2 – LASI Prototype Product Specification
3
1 Introduction
Linguistic Analysis for Subject Identification (LASI) is the name of The Red
Group’s project. LASI is being developed as a requirement in the Professional
Workforce Development I & II courses at Old Dominion University. Linguistic
Analysis, in the scope of LASI, is the contextual study of written works and how the
words combine to form an overall meaning. LASI will be a decision support tool to
assist users in determining common themes across multiple documents. The themes that
LASI will produce are going to be subject-object-verb relationships. Themes are
important because they help the reader to comprehend what has just been read. Then, if
the reader has comprehended what was read, then the reader can summarize the material.
Comprehension and summarization are important because they assist the reader in
communicating the content of the material with other people.
The process of finding common themes across multiple documents may be
lengthy and repetitive. This is due to the depth of understanding needed to identify
themes across all the documents, which may not be the theme of any individual
document. Therefore, it is difficult for people to identify a common theme over a large
set of documents in a timely, consistent and objective manner.
LASI will assist in this area by providing a weighted list of potential themes from
which the user can choose the best fit for their understanding of the material. For LASI
to effectively resolve this societal problem it will need to accurately find themes, be
system efficient, and provide consistent results.
(This space intentionally left blank.)
Lab 2 – LASI Prototype Product Specification
4
1.1 Purpose
LASI will be a self-contained, stand-alone piece of software. It will not require a
connection to the Internet to produce accurate results. LASI will be designed to run on a
consumer level laptop or desktop. Also, LASI will be designed to be an open source
back-end engine for other projects. The data collected from the analysis performed by
LASI can be used drive other projects and their respective GUIs.
1.2 Scope
LASI’s ability to identify common themes from multiple documents makes it
useful to anyone who reads over large sets of documents looking for commonality.
Students could use LASI to verify how advantageous certain scientific publications may
be to the topic they are researching. Teachers could use it as an initial analysis tool to
verify that student papers are staying on topic. Similar to students, research analysts
could use LASI to verify whether or not a different papers and articles address the
specific areas they are researching.
The LASI prototype will demonstrate an ability to analyze documents
syntactically and semantically in order to extract themes from multiple documents. The
analysis of the prototype will be more rudimentary than that of a full working model.
This would be mainly in the depth of its ability to fully understand and recognize the
relationship of words to each other in the document as a whole.
(This space intentionally left blank.)
Lab 2 – LASI Prototype Product Specification
5
1.3 Definitions, Acronyms and Abbreviations
A.I.D.: Assessment Improvement Design
A.I.D. Process: A process that provides quantitative and qualitative basis to identify
problems and determine the feasibility of solutions.
Analysis: Detailed examination of the elements or structure of something, typically as a
basis for interpretation.
Document: A document herein refers to a formally written, expository paper which
expounds, via a declarative approach, on a relatively quantifiable issue, goal, or
area of research.
Head word: A locally distinct word within a phrase which, by its syntactic associations,
determines the category of the phrase itself.
LASI: Linguistic Analysis for Subject Identification
Linguistic Analysis: The scientific analysis of a language.
Parser: Takes in DOC and DOCX files and converts them to TXT files.
Part of Speech Tagger: Software utility that associates words with the parts-of-speech in
a sentence.
Phrase: An instance of the Phrase class.
Phrase: (Linguistically) A group of words standing together as a conceptual unit.
Phrase Class: The root of the taxonomy of class types which correspond to syntactic roles
at the phrase level and whose instances contain a collection of Words which
together represent a linguistic phrase.
Semantic Analysis: Relating the syntactical structure of words to their language
independent meanings.
Sharp NLP: Written in C#, natural language processing tool used to parse and tag partsof-speech.
Strategic Document: Document produced by a client that defines their Goals, Visions
and Missions.
Subject Identification: The process by which the subject matter and thematic content of
documents is determined.
Lab 2 – LASI Prototype Product Specification
6
Syntactic Analysis: Identifies key words based on their location in the sentence, rather
than their overall meaning throughout the document.
Tagged: The type of file that stores the output of the part-of-speech tagger containing the
all of the text of the document with embedded syntactic annotations.
Theme: Subject-object-verb relationships that LASI is attempting to generate from the
input set.
Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected
element in a document.
Tagged Set: A group of words, whose part of speech and location in a sentence have
been identified by the parser.
WordNet: Compiler and provider of the data files which forms the basis for the LASI
thesaurus.
Word Class: The root of the taxonomy of class types which correspond to parts-ofspeech at the word level and whose instances encapsulate each occurrence of a
textually identified word.
Word Weight: A numeric value, associated with each syntactically and lexically unique
word in a written work, indicating its significance.
1.4 References
Hester, P.T., Meyers, T. (2012). Enterprise AID: A performance measurement system
for enterprise assessment, improvement, and design (NCSOSE-TR-12-001).
Norfolk, VA: National Centers for System of Systems Engineering.
1.5 Overview
The product prototype specification provides the hardware and software needs,
algorithm data types, graphic user interfaces, and features of the LASI prototype. The
information contained in the rest of this document are an architecture description,
functionality description, external interface descriptions, functional requirements,
performance requirements, assumptions and constraints, and non-functional
requirements.
Lab 2 – LASI Prototype Product Specification
7
2 General Description
The LASI prototype will be able to identify themes. However, it will be doing so
without fully associating all words inside a document. The LASI prototype will attempt
to use lower level associations such as word counting, POS tagging, adjective to noun,
and adverb to verb in order to correctly identify themes. It will be leaving out more
difficult associations such as pronoun to noun.
2.1 Prototype Architecture Description
The LASI prototype consists of three COTS programs, two algorithms and a user
interface. The three COTS programs are SharpNLP Part of Speech Tagger, WordNet
Thesaurus data, WPF Toolkit, and document converters. The two main algorithms are
binding and weighting. The user interface will be discussed in a later section.
The SharpNLP Part of Speech Tagger is software that is being use to take in TXT
files, analyze them and return them with words and phrases being tagged for the
corresponding POS (i.e. dog->NOUN). WordNet Thesaurus data allows LASI to identify
synonyms for words. The WPF Toolkit library enables LASI’s charting capabilities for
the GUI. The document converter for LASI is the B2XTranslator and it allows LASI to
take in DOC and DOCX files and then convert them to TXT files.
The binding algorithm allows LASI to understand how words and phrases relate
to one another. It binds words and phrases that go together in meaningful ways. We can
look at the statement “The big blue dog ran up the hill.” In this statement LASI would
bind big and blue to dog because they both describe an aspect of dog to allow LASI to
have a more complete understanding of how dog is being used in this instance. The
Lab 2 – LASI Prototype Product Specification
8
weighting algorithm will look at various metrics to weight words and phrases by their
relative importance in the document.
2.2 Prototype Functional Description
A user will interact with the LASI prototype through a GUI. The GUI will allow
a user to start a new project or open an existing one. They will then be able to add or
remove the documents involved the in the project. The user will then be able to preview
the documents added to the project before starting analysis. After analysis the results
will be displayed in three distinct ways for the user to view . The results will be output
into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The
user will also be able to export the results into PDF format. Example 1 shows a prototype
of the Top Results tab displaying the likeliest possible themes based on analysis.
Example 1: Top Results
Lab 2 – LASI Prototype Product Specification
9
Example 2 is a prototype of the Word Relationship tab. It also shows that the user
will be able to see these results for all the documents and for the individual documents.
The colors will correspond to the word’s corresponding POS. The search box will allow
the user to search for specific words and have them searched words be highlighted.
Example 2: Word Relationships
Example 3 is a prototype of the Word Count and Weighting Tab. It will display
the count of each word in the set of documents and display their weights based on the
weighting algorithm.
(This space intentionally left blank.)
Lab 2 – LASI Prototype Product Specification
10
Example 3: Word Count and Weighting
2.3 External Interfaces
The external interfaces that LASI will use are going to be those of the previously
discussed COT software components. Other than those the interfaces involved in LASI
are custom.
2.3.1 Hardware
LASI will be a stand alone program so the hardware required to run the LASI prototype
will be a laptop or desktop with four to eight gigabytes of RAM and a multi-core
processor.
2.3.2 Software Interfaces
The software needed will be the third-party software to tag parts-of-speech and convert
DOC and DOCX files to TXT files, the LASI data structures and algorithms and the
Lab 2 – LASI Prototype Product Specification
11
LASI GUI (Fig. 7). For in-class development a Virtual Machine is also being utilized as
a testing, demonstration, and code writing environment.
Figure 1. Prototype Major Functional Components
(This space intentionally left blank.)
Download