Scott Minter - Old Dominion University

advertisement
Running Head: Lab 1 – LASI Product Description
LASI Product Description
LASI – Red Team
Old Dominion University
Author: Scott Minter
Last Modified: March 18pe1, 2013
Version 1.1
2
Lab 1 – LASI Product Description
Table of Contents
1. Introduction …………………………………….…………………………………… 3
2. Product Description ……………………………..………………………………….. 4
2.1 Key Product Features and Capabilities ……………………...…………….. 4
2.2 Major Functional Components (Hardware and Software) …….………….. 7
3. Identification of Case Study .……………………………………..……….……….. 10
4. Prototype Description …………………………………………………………..…. 12
4.1 Major Functional Components (Hardware/Software) ………….......……. 12
4.2 Features and Capabilities …………………………………………….…... 14
4.3 Prototype Challenges …………………………………………………….. 14
Glossary ……………………………………………………………………………….. 16
References .……………………………………………………………………………. 18
List of Figures
Example 1. Top Results Tab ………………………………………………………….. 5
Example 2. Word Relationship Tab ……………………………………………….…... 6
Example 3. Word Count and Weighting Tab …...…………………………………...... 7
Figure 4. Current AID Process …………………………………………..…………… 10
Figure 5. AID Process with LASI ……………………………………….………….... 11
Figure 6. Real-World vs. Prototype …………...…………………………..………….. 12
Figure 7. Prototype Major Functional Components …………………………...…….... 13
Lab 1 – LASI Product Description
3
1. Introduction
Linguistic Analysis for Subject Identification (LASI) is the name of The Red
Group’s project. This project is being developed as a requirement in the Professional
Workforce Development I & II courses at Old Dominion University. Linguistic
Analysis, in the scope of LASI, is the contextual study of written works and how the
words combine to form an overall meaning. LASI will be a decision support tool to
assist users in determining common themes across multiple documents. The themes that
LASI will produce are going to be subject-object-verb relationships. Themes are
important because they help the reader to comprehend what has just been read. Then, if
the reader has comprehended what was read, then the reader can summarize the material.
Comprehension and summarization are important because they assist the reader in
communicating the content of the material with other people.
The process of finding common themes across multiple documents may be
lengthy and repetitive. This is due to the depth of understanding needed to identify
themes across all the documents, which may not be the theme of any individual
document. Therefore, it is difficult for people to identify a common theme over a large
set of documents in a timely, consistent and objective manner.
LASI will assist in this area by providing a weighted list of potential themes from
which the user can choose the best fit for their understanding of the material. For LASI
to effectively resolve this societal problem it will need to accurately find themes, be
system efficient, and provide consistent results.
(This space intentionally left blank)
Lab 1 – LASI Product Description
4
2. Product Description
LASI will be a self-contained, stand-alone piece of software. It will not require a
connection to the Internet to produce accurate results. LASI will be designed to run on a
consumer level laptop or desktop. Also, LASI will be designed to be an open source
back-end engine for other projects. The data collected from the analysis that LASI
performs can be used drive other projects and their respective GUIs, completely
bypassing the default GUI.
LASI’s ability to analyze multiple documents for common themes makes it a
decision support tool that is useful to anyone who has to read over large sets of
documents looking for commonality. Students could use it to verify the usefulness of
scientific publications to the topic they are researching. Teachers could use it as an initial
analysis of student research papers, verifying that the paper correctly addresses the topic.
Similar to students, research analysts could use LASI to verify whether or not a different
papers and articles address the specific area they are researching.
2.1 Key Product Features and Capabilities
LASI’s ability to find themes is based on three different sub-routines. The first is
a Part-Of-Speech (POS) tagging system that will return the input document(s) with all
Words and Word Phrases tagged for their corresponding POS. Second is a word
association algorithm that will associate Words based on their POS and their proximity to
one another. Finally, a weight is applied to each Word or Word Phrase based on it’s POS
and it’s association to other words and their POS’s.
LASI will accept DOC, DOCX, and TXT files as input. LASI will allow a user to
input any known “problem” words: any organization specific jargon or slang. LASI will
Lab 1 – LASI Product Description
5
also allow a user to input any desired assumptions such as synonyms and acronyms. The
user will be able to specify word equivalency, allowing LASI to better analyze the
document in the context the user desires.
One of the important aspects of LASI is the user experience. The results will be
output into three tabs: Top Results, Word Relationships, and Word Count and Weighting.
The user will also be able to export the results into PDF format. Example 1 shows a
prototype of the Top Results tab displaying the likeliest possible themes based on
analysis.
Example 1. Top Results Tab
(This space intentionally left blank)
Lab 1 – LASI Product Description
Figure 2 is a prototype of the Word Relationship tab. It also shows that the user will be
able to see these results for all the documents and for the individual documents. The
colors will correspond to the word’s corresponding POS. The search box will allow the
user to search for specific words and have them searched words be highlighted.
Example 2. Word Relationships Tab
(This space intentionally left blank)
6
Lab 1 – LASI Product Description
7
Figure 3 is a prototype of the Word Count and Weighting Tab. It will display the
count of each word in the set of documents and display their weights based on the
weighting algorithm.
Example 3. Word Count and Weighting Tab
2.2 Major Functional Components (Hardware and Software)
LASI will be able to run one a laptop or a desktop provided the machine has a
multi-core processor and four to eight gigabytes of RAM. The third party software
components for LASI are SharpNPL Part Of Speech Tagger, WordNet Thesaurus Data,
and Document Converters. SharpNPL Part Of Speech Tagger is handling the tagging of
words and word phrases for their corresponding POS. WordNet Thesaurus Data is
allowing LASI to recognize synonyms. Document Converters are converting DOC and
DOCX files to TXT files.
Lab 1 – LASI Product Description
8
LASI’s analytical capabilities are enabled by a combination of data structures and
algorithms. The key data types go into two categories: specialized word and phrase
constructs and the ability to traverse documents as a collection of specialized word and
phrase constructs.
Specialized word and phrase constructs are assigned based on their tagged POS.
Once initialized an instance of a Word Construct will be able to be displayed, sorted and
displayed based on it’s POS (e.g. Noun, Verb, etc.), and have other constructs assigned to
it based on known syntactic and/or semantic relationships. Phrase Constructs are going
handle phrase tags that are generated by the SharpNLP POS tagging system. A phrase is
a recognized group of words that will have a tagged POS (eg. NounPhrase, VerbPhrase,
etc.) Similar to the Word Constructs, Phrase Constructs will be able to be displayed,
sorted and displayed based on its POS, and have other constructs assigned to it based on
known syntactic and/or semantic relationships.
A document being viewed and used as a traversable collection allows a document
to be moved through using different methods: Word, Reference, and Web-wise. When
moving through a document using a Word-wise method, the document is broken up by
individual words, with each word being an instance of the Word class. Moving through a
document using a Reference-wise method, the document is broken up using the Word
class and Phrase class respective reference methods. This allows the document to be
viewed in terms of which words and or phrases reference each other. Moving through a
document using a Web-wise method allows the document to be traversed through as if it
were a web with the nodes being the words and or phrases and the references being the
connection between nodes.
Lab 1 – LASI Product Description
9
The algorithms used by LASI breakdown into two categories: Element-binding,
Weighting and Conflict Resolution. Element-binding binds words and phrases together
based on each instance’s POS to create references mentioned in the above Word and
Phrase Construct section. Element-binding will consist of Direct Binding and Indirect
Binding. Direct Binding will create subject-verb, verb-subject, adverb-verb, adjectivenoun, and determiner-noun references. Whereas, Indirect Binding will create pronounnoun references.
The Weighting algorithm gives a numeric value to both Word and Phrase
instances that will give the instance weight when being considered for its importance.
The algorithm looks at both raw and relative data. For raw data it looks at word instance,
word instance POS, and synonym count. Word instance count will tally the number of
times a word occurs. Word instance POS count will contain the frequency as long as it
has the same POS tag. Finally the synonym count raises the count of both synoptic
words for any recognized synonyms. For relative data it looks at Subject-Object-Verb
reference count and Lexical distance. Once Word and Phrase references have been made
on a level deep enough to establish Subject-Object-Verb (SOV) references, a count is
made of the number of times the SOV instance occurs. Weight is also based on LexicalDistance, meaning the physical proximity a Word or Phrase is to the reference instance
will determine the weight assigned.
Conflict Resolution will be important to ensuring that LASI can complete analysis
successfully. In a document, there may be any number of unaccounted for items such as
incorrect grammar and unrecognized characters. Conflict Resolution will be able to
recognize these items try to address them and if not throw the proper exception.
Lab 1 – LASI Product Description
10
3. Identification of Case Study
Dr. Hester and Dr. Meyers work for an organization housed on the Old Dominion
University campus called National Center for Systems of Systems Engineering
(NCSOSE). NCSOSE analyzes organizations and their respective documents in order to
help them recognize and address internal problems. The current process utilized at
NCSOCE is called the Assessment Improvement Design (AID) process. The AID
process involves a company coming to NCSOSE for evaluation. At which point,
NCSOSE will gather organizationally specific documents for analysis. In this analysis
phase, Drs. Hester and Meyers will read over the documents multiple times in order to
find common themes specific to the structure and function of the organization. Finally,
Dr. Hester and Dr. Meyers will return to the organization with their findings based on the
analysis (Fig. 4).
Figure 4. Current AID Process
Lab 1 – LASI Product Description
11
It is during the document analysis phase that LASI will be utilized. By inserting
LASI into the AID process, it will cut down on both time and inconsistency. LASI will
allow for less time spent rereading the documents and give NCSOSE logical grounding
for the findings they return to the organizations (Fig. 5).
Figure 5. AID Process with LASI
(This space intentionally left blank)
Lab 1 – LASI Product Description
12
4. Prototype Description
A full Real World Solution for LASI would be highly difficult to develop in the
time allotted so, a prototype needs to be created in order to narrow the scope but have
something that can still demonstrate its capabilities. Figure 6 shows what the prototype
will do in comparison to the Real-World Solution.
Figure 6. Real-World vs. Prototype
4.1 Major Functional Components (Hardware/Software)
The major functional components for the prototype are very similar to those of the
Real-World Solution. The hardware required to run the LASI prototype will be a laptop
or desktop with four to eight gigabytes of RAM and a multi-core processor. The software
needed will be the third-party software to tag parts-of-speech and convert DOC and
DOCX files to TXT files, the LASI data structures and algorithms and the LASI GUI
(Fig. 7). For in-class development a Virtual Machine is also being utilized as a testing,
demonstration, and code writing environment.
Lab 1 – LASI Product Description
13
Figure 7. Prototype Major Functional Components
The third-party software is the SharpNLP Part-Of-Speech Tagger and the
B2XTranslator. The SharpNLP POS Tagger tags words and word phrases for their
respective POS in order for the LASI algorithms to use them. The B2XTranslator
converts DOC to DOCX files. This is done because DOCX files contain an XML file
that can easily be converted to a TXT file.
The LASI data structures and algorithms needed for the prototype are reference
binding and weight assigning. In the prototype, the reference binding works the same as
in the Real-World solution. References are made between Words and Word Phrases
based on their tagged POS and how they relate to one another within the sentence,
paragraph and document structure. The weight assigning algorithm will assign weight
based on tagged POS, word instance, and reference count. The reference count will
count how many times other Words and Word Phrases refer to a Word or Word Phrase.
Lab 1 – LASI Product Description
14
4.2 Features and Capabilities
The prototype will be limited in its capabilities from the Real-World Solution due
to the time constraints of the class. One of the areas it will be limited in is that it will
only allow five documents to be loaded into a LASI project. Also, it will only accept
DOC, DOCX and TXT files as input. In the Real-World solution there would need to be
some kind of scanned text recognition in order to correctly convert PDF files to TXT
files. However, our prototype will not accept PDF files and therefore will not have
scanned text recognition capabilities.
Some of the identified risks with LASI are trust for the output, post semester
maintenance, individual PC system limitations, and illegal character handling. These
risks are being mitigated through the various means. Trust for the output will be handled
by the various views and tabs on our results GUI. By showing the user much of LASI’s
accumulated data the results will be provable. Maintenance of LASI will be performed
by the open source community and possibly by future CS410 and CS411 groups.
Avoiding crashes due to system limitations will be handled by multithreading the LASI
algorithms and making sure the program runs as efficiently as possible. At some point,
LASI will encounter some unrecognized characters in a document. When this happens,
LASI attempt to recognize these characters based on their syntax in the document.
However, if they remain unrecognizable then an exception will be thrown and the
character will be ignored.
4.3 Prototype Development Challenges
Some of the challenges facing the development of the LASI prototype are the
ability to correctly use all the data generated and to correctly identify themes. Identifying
Lab 1 – LASI Product Description
15
POS, creating Word and Word Phrase references and assigning weight are all constructs
that assist and enable LASI with correctly identifying themes. This will be mitigated by
intelligently creating an algorithm that will use this information to identify themes in a
timely, consistent and objective manner.
(This space intentionally left blank)
Lab 1 – LASI Product Description
16
Glossary of Terms
Theme - subject-object-verb relationships that LASI is attempting to generate from the
input set
LASI - Linguistic Analysis for Subject Identification
Parser - Takes in DOC and DOCS files and converts them to TXT files
WordNet - compilers and providers of our thesaurus
Phrase - A group of words standing together as a conceptual unit, typically forming a new
component.
Analysis - Detailed examination of the elements or structure of something, typically as a
basis for interpretation.
Linguistic Analysis - The scientific analysis of a language.
Tag - A label, or the act of attaching a label, that specifies the role (such part of speech or
location) of a selected element in a document.
Document - A document herein refers to a formally written, expository paper which
expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of
research.
Word Weight - A numeric value, associated with each syntactically and lexically unique
word in a written work, which indicates the relative significance of that word.
Tornado chart - A horizontal bar graph like visualization, representing the relative
frequency or significance of elements, sorted in descending order by magnitude.
Head word - A Head Word is the locally distinct word within a phrase which, by its
syntactic associations, determines the syntactic category of the phrase itself.
Word Binding - Conversion of scanned images to text.
Sharp NLP - C# natural language processing tool used to parse and tag part-of-speech.
Tagged Word Object - The process of binding part-of-speech to a word.
Optical Character Recognition - A word that has an associated part-of-speech.
Tagged Set - A group of words whose part of speech and location in a sentence have
been identified by our parser.
Lexer - A piece of our parsing tool that isolates each word and its part of speech, and
location in a sentence into machine readable tokens. These are stored as elements in an
XML file.
Syntactic Analysis - a form of Linguistic analysis that focuses on grammar in sentences
and identifies themes based on sentence structure and formatting. Unlike Semantic
Analysis, it identifies key words based on their location in the sentence, rather than their
overall meaning throughout the document.
Subject Identification- dentifies the main actor in a sentence. However, in a broader
sense, the word subject is synonymous with the theme of a document. Subject
identification is the process of determining subjects, or themes of a document or
documents.
Part of Speech Tagger - Software utility that associates words with the parts of speech
(i.e. Noun, Verb, etc.) in a sentence.
Semantic Analysis - Relating the syntactical structure of words to their language
independent meanings.
A.I.D. Process - Assessment Improvement Design: A process that provides quantitative
and qualitative basis to identify problems and determine the feasibility of solutions.
Lab 1 – LASI Product Description
17
Strategic Document - Document produced by a client that defines what their Goals,
Visions, and Missions.
Word (denoted by capital W) – an instance of LASI’s Word class
Word Phrase (denoted by capital W and capital P) – an instance of LASI’s Word Phrase
class
Lab 1 – LASI Product Description
References
Hester, P.T., Meyers, T. (2012). Enterprise AID: A performance measurement system
for enterprise assessment, improvement, and design (NCSOSE-TR-12-001).
Norfolk, VA: National Centers for System of Systems Engineering.
18
Download