Lab 1 – LASI Description 1

advertisement
Lab 1 – LASI Description
Running head:
LAB 1 – LASI DESCRIPTION
Lab 1 – LASI Product Description
Brittany Johnson
CS411
Janet Brunelle
March 18, 2013
Version 2
1
Lab 1 – LASI Description
2
Table of Contents
1
INTRODUCTION ...................................................................................................................4
2
PRODUCT DESCRIPTION ....................................................................................................4
2.1
Key Product Features and Capabilities ........................................................................5
2.2
Major Components (Hardware/Software)....................................................................9
3
IDENTIFICATION OF CASE STUDY ................................................................................11
4
PRODUCT PROTOTYPE DESCRIPTION ..........................................................................12
4.1
Prototype Architecture (Hardware/Software) ............................................................12
4.2
Prototype Features and Capabilities...........................................................................14
GLOSSARY ..................................................................................................................................15
REFERENCES ..............................................................................................................................18
List of Figures
Figure 1. Top Results Output ...........................................................................................................6
Figure 2. Word Relationships Output ..............................................................................................7
Figure 3. Word Count and Weighting Output .................................................................................8
Figure 4 AID Process: Assessment ................................................................................................11
Figure 5.Prototype Hardware and Software Component Diagram ................................................13
Lab 1 – LASI Description
3
List of Tables
Table 1. Feature comparison between prototype and real world product ......................................14
Lab 1 – LASI Description
4
Lab 1 – CertAnon Product Description
1
INTRODUCTION
LASI stands for Linguistic Analysis for Subject Identification. It is a stand-alone theme
finding application conceived by the Old Dominion University CS410 Red Group. It is designed
to be a decision support tool for large, multi-document linguistic analysis and allow for more
accurate and consistent results. Linguistic Analysis, with respect to the current project, is the
contextual study of written works and how the words combine to form and overall meaning.
Themes are subject-object-verb relationships that LASI is attempting to generate from the
input set and are important because they help the reader to comprehend and summarize what has
been read. It is even more difficult to come to a conclusion when the number of documents
increases because the theme across all of the documents may not be the theme of each of the
individual documents. The complexity of a topic and the reader’s familiarity with it plays an
important role in a reader’s comprehension.
This comprehension, along with the ability to summarize the material is important in
being able to communicate the content of a document. Thus, it is often difficult for people to
identify a common theme over a large set of documents in a timely, consistent, and objective
manner. LASI will assist in helping the reader come to an informed conclusion by providing a
weighted list of potential themes.
2
PRODUCT DESCRIPTION
LASI will be an open-source, stand-alone piece of software designed to run on a
consumer grade laptop. LASI will be able to detect themes across many documents and can
provide both individual and cross document analysis to determine a single theme. LASI’s ability
to analyze multiple documents to find a common theme makes it a great decision support tool for
Lab 1 – LASI Description
5
teachers, students, research analysts and those that would need to read through large sets of
documents on a frequent basis.
Teachers for example, would be able to use LASI as an initial analysis on student papers
to check whether or not it is consistent with the topic of that paper. Both students and research
analysts could use LASI to quickly assess the usefulness of scientific and literary publications for
the topic that they are researching.
2.1
Key Product Features and Capabilities
Through the use of Optical Character Recognition (OCR) and a parser that is integrated
into LASI, the user has the ability to create a project with multiple file types including DOC,
DOCX, PPT, PPTX, TXT and PDF. By finding the commonalities between the documents using
their parts of speech and statistics analysis, a common theme can be revealed.
Once a project is created, the files can be viewed in plaintext form in the LASI user
interface. Documents can be added once the project has been created, as well as after they have
already been analyzed. If the project has already been compiled, the documents will be analyzed
and then added to the overall results. While the project is being created, there is also the option
for the user to add their own dictionary of company specific jargon as well as assumptions about
the content. This will help LASI to tailor its analysis to the content and increase the statistical
likelihood of determining a theme.
Lab 1 – LASI Description
6
Figure 1. Top Results Output
Once the documents have been analyzed, the results can be viewed in three different
format types: Top Results, Word Relationships, and Word Count and Weighting. The top results
will be represented graphically based on the user’s preferred chart type. The types of charts
available include tornado charts, bar graphs, and pie charts. In figure 1 there is a tornado chart
showing the top 10 most likely themes throughout all of the documents listed in descending
order based on the word weight. Each of the documents may also be viewed individually, where
the data will be represented similarly.
[This space intentionally left blank.]
Lab 1 – LASI Description
Figure 2. Word Relationships Output
The word relationships, as shown in Figure 2, will also be displayed for each document.
Each word is colorized based on its part-of-speech. This will allow the user to see the
relationships between all of the words in a document. The links between the words is an
important visual aid for helping the user to understand the importance of individual words.
[This space intentionally left blank.]
7
Lab 1 – LASI Description
8
Figure 3. Word Count and Weighting Output
Results will also be displayed based on the individual word count and weight. The weight
that will be displayed is based on the weighting algorithm. In Figure 3, this is shown as a simple
table that can be sorted by word, frequency, and weight. This will show how each document
affected the total results and the importance of individual words. Once the project has finished
being analyzed, the results can either be printed or exported in PDF, JPG, and PNG.
[This space intentionally left blank.]
Lab 1 – LASI Description
2.2
9
Major Components (Hardware/Software)
LASI requires a few hardware specifications for the product to run at an optimal level. It
is preferable that it is run on a high end business grade computer with at least 8GB or greater of
DDR3 SDRAM and a Quad core CPU. It is also requires that the user provide a secondary
storage space for documentation.
The first software component of LASI is the graphical user interface. This application can
be run locally on the user’s machine. This is a Windows Presentation Foundation (WPF) project
using XMAL to define the structure of the views and C# to provide the interactivity.
The second software component is the file system. It manages converting files and
invoking the tagger. After the text file is tagged, it is then passed to a tagged file parser which
converts the text into word and phrase types which represent the elements of the document at run
time. B2XTranslator is a third party open source software that is being used to convert file
types. When documents are added to a project in the GUI, it takes DOCX and converts it to an
XMLfile. Once the document is in XML, it can be converted again into a form useable by the
parts-of-speech tagger.
The parts-of-speech tagger software being used is SharpNLP, open source C# natural
language processing tool. SharpNLP utilizes the Penn Treebank parts-of-speech tags to define
the parts of speech. SharpNLP will assign each word a type and place groups into phrase types
before writing them back to a file. Once the documents have been tagged, it is assigned a word
type which corresponds to its part of speech given by the tagger. Phrase types are groups of
words that have been put together. Each of these phrase types contains a list of words and the
attributes that the syntactic phrase types represent.
Lab 1 – LASI Description
10
The fourth software component is the LASI algorithm. The LASI algorithm is written in
C#. The LASI algorithm ties word and phrase types together based on their syntactic
relationship via a state machine derived logic flow. The document is traversable in multiple
methods: Word-wise, Reference-wise, and Tree-wise. When moving through the document in
word-wise manner, the document is broken up by individual words. When moving through the
document in a Reference-wise manner, this allows the document to be viewed based on the
words and phrases that reference each other. A Tree-wise manner follows a specific word to its
referenced words and so on.
The algorithm focuses on the direct and indirect binding of words and phrases. Direct
binding includes the binding of nouns and verbs, adverbs to verbs, adjectives to nouns,
determiners to nouns. Indirect binding will include the binding of pronouns to nouns.
Once the word and phrase binding is finished, it will begin weighting the words based on their
frequency as word as well as how it is used. The weighting metrics for each word will be based
on a raw frequency as well as a relative frequency. Each word will have a raw frequency that is
based on a simple word count, the number of times that the word was used in a particular
manner, and a frequency count for synonyms of that word. The relative frequency will be based
on subject, verb and object relationship between words as well as where a word is located in a
document. As more bindings get made, the more accurate the results are.
[This space intentionally left blank.]
Lab 1 – LASI Description
3
11
IDENTIFICATION OF CASE STUDY
Dr. Patrick Hester and Dr. Tom Meyers work for the National Center for System of
Systems Engineering (NCSOSE) consulting with organizations and businesses that need an
outside view on issues or future plans of improvement. When consulting with their client, they
use the Assessment Improvement Design Methodology (AID) to help assist the client in both
realizing and achieving their goals. The focus is on evaluating current performance with respect
to client intent, enhancing performance based off of evaluations of current operations, and
procedure versus alternatives. Using this, they create a new method for improvement that is
aligned with their client’s intent.
Figure 4. AID Process: Assessment
In following this process, both Dr. Hester and Dr. Meyers must familiarize themselves
with their client’s domain. Essentially, they must become an expert in the inner workings of their
client’s organization and the field of work. The level of difficulty for this task is dependent on
Lab 1 – LASI Description
12
whether their client provides useable documentation. LASI will assist in the process of defining
what the potential problem is and whether it coincides with what the client believes is the issue.
In Figure 6, LASI would fit into the Document Analysis portion of the Assessment phase. The
results that LASI produces can be used to verify Dr. Hester and Dr. Meyer’s assessment of the
situation and serve as visual proof of their reasoning for the client.
4
PRODUCT PROTOTYPE DESCRIPTION
Due to time constraints the LASI prototype has a much lessened functionality than the
real world product. LASI will still function in the same but in a less complex and process
intensive manner. A prototype needs to be developed in order to narrow the scope but still have a
product that can demonstrate its capabilities.
4.1
Prototype Architecture (Hardware/Software)
The hardware and software components for the prototype will remain largely unchanged
from the real-world solution that was discussed in Section 2 and 2.1. Figure 5 shows the
hardware and software components of the LASI prototype. The hardware required to run the
LASI prototype is a laptop or desktop with at least 8GB of DDR3 RAM and a Quad-Core CPU.
For development purposes we will be using a Virtual Machine for a testing and code writing
environment. The software needed for the prototype includes our part-of-speech tagger,
document converter to convert DOC and DOCX files to TXT files. Other software includes the
LASI algorithms and the LASI GUI.
[This space intentionally left blank.]
Lab 1 – LASI Description
13
Figure 5. Prototype Hardware and Software Component Diagram
The third-party software for the LASI prototype is the SharpNLP Part-of-Speech Tagger
and the B2XTranslator. The SharpNLP POS Tagger tags words and phrases with the respective
parts-of-speech for use by the LASI algorithm. The B2XTranslator converts DOC to DOCX
files. The files then can be converted to a TXT file.
In the LASI prototype, word and phrase binding works the same as it would in the RealWorld solution. Words and word phrases are interrelated based on the tagged part-of-speech and
how they relate to one another within phrases, paragraphs, and the document. The weighting
algorithm will assign each word a weight based on its part-of-speech, frequency count and the
number of times and ways it is referenced.
[This space intentionally left blank.]
Lab 1 – LASI Description
4.2
14
Prototype Features and Capabilities
Table 1. Feature comparison between prototype and real world product
As shown in Table 1, there are a few key differences to the Real World Product and our
Prototype. The types of documents that the LASI prototype accepts has been limited to just DOC
and DOCX. Scanned text recognition has been removed from the prototype since there is not
enough time to get the OCR software fully functioning. The prototype will limit the number of
documents that can be added to one project to three to five, and there is a size limitation of 10
pages on each of those documents to insure that the algorithm can function in a timely manner.
Rather than focusing on every part of speech, in the LASI prototype we will focus on noun-verb
binding. There were also a few of the more complex features that did not make it into the
prototype like user defined dictionaries, synonym identification, and content assumption. Despite
removing a lot of the unnecessary features, the prototype will still function very similarly to how
the real world product would have functioned.
Lab 1 – LASI Description
15
GLOSSARY
A.I.D. : Assessment Improvement Design
A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems
and determine the feasibility of solutions.
Analysis: Detailed examination of the elements or structure of something, typically as a basis for
interpretation.
Document: A document herein refers to a formally written, expository paper which expounds,
via a declarative approach, on a relatively quantifiable issue, goal, or area of research.
Head word: A locally distinct word within a phrase which, by its syntactic associations,
determines the category of the phrase itself.
LASI: Linguistic Analysis for Subject Identification
Lexer: Part of the parsing tool that isolates each word, its part of speech, and location in a
sentence into machine readable tokens. These are stored as elements in an XML file.
Linguistic Analysis: The scientific analysis of a language.
Optical Character Recognition: A word that has an associated part-of-speech.
Parser: Takes in DOC and DOCS files and converts them to TXT files
Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a
sentence.
Phrase: A group of words standing together as a conceptual unit, typically forming a new
component.
Semantic Analysis: Relating the syntactical structure of words to their language independent
meanings.
Lab 1 – LASI Description
16
Sharp NLP: Written in C#, natural language processing tool used to parse and tag parts-ofspeech.
Strategic Document: Document produced by a client that defines what their Goals, Visions,
and Missions.
Subject Identification: Finds the main actor in a sentence. However, in a broader sense, the
word subject is synonymous with the themes of one or more documents. Subject
identification is the process of determining subjects, or themes of a document or
documents.
Syntactic Analysis: A form of Linguistic analysis that focuses on grammar in sentences and
identifies themes based on structure and formatting. Unlike Semantic Analysis, it
identifies key words based on their location in the sentence, rather than their overall
meaning throughout the document.
Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set
Tag: A label, or the act of attaching a label, that specifies the role (such part of speech or
location) of a selected element in a document
Tagged Set: A group of words, whose part of speech and location in a sentence have been
identified by the parser
Tagged Word Object: The process of binding part-of-speech to a word
Tornado chart: A horizontal bar graph like visualization, representing the relative frequency or
significance of elements, sorted in descending order by magnitude
Word Binding: Conversion of scanned images to text
WordNet: compiler and provider of our thesaurus.
Lab 1 – LASI Description
17
Word Weight: A numeric value, associated with each syntactically and lexically unique word
in a written work, indicating its significance.
Lab 1 – LASI Description
REFERENCES
SharpNLP.(n.d.). Retrieved from http://sharpnlp.codeplex.com/
Office binary to open xml.(n.d.). Retrieved from http://b2xtranslator.sourceforge.net/
18
Download