LASI Product Description V2

advertisement
Lab 1 – LASI Description
Running head:
LAB 1 – LASI DESCRIPTION
LASI Product Description V2
Linguistic Analysis for Subject Identification
Determining common themes across multiple documents
CS411 Red Team
Dustin Patrick
3/18/2013
1
Lab 1 – LASI Description 2
Contents
List of Figures..................................................................................................................................2
1 Introduction...................................................................................................................................3
2 LASI Product Description.............................................................................................................4
2.1 Key Product Features and Capabilities..........................................................................5
2.1 Major Functional Components......................................................................................6
3 Identification of Case Study..........................................................................................................7
4 LASI Prototype Description.........................................................................................................8
4.1 Key Prototype Features and Cababilities......................................................................9
4.2 Major Functional Components....................................................................................10
Glossary.........................................................................................................................................11
Lab 1 – LASI Description 3
List of Figures
Figure 1............................................................................................................................................7
Figure 2............................................................................................................................................8
Figure 3............................................................................................................................................9
Figure 4..........................................................................................................................................10
Figure 5..........................................................................................................................................13
Figure 6..........................................................................................................................................14
Figure 7..........................................................................................................................................19
Lab 1 – LASI Description 4
1. Introduction
LASI, which stands for Linguistic Analysis for Subject Identification is a tool that is designed to
assuage researchers in finding and obtaining common themes across a document or multiple
documents. It relies on an algorithm that identifies themes based on a word’s weight in a
document or documents, as well as its frequency. A word’s weight is determined by the
relationships it has with other words in the sentences in which it is used. Linguistic Analysis in
this case refers to the abstract understanding of an author’s intention for a document based on a
syntactic and semantic evaluation of the text of the document(s). LASI is a decision support tool,
not a decision making tool. This means that it will not output a single theme for a document, but
a list of possible themes, along with their likelihood of accurately reflecting the theme of a
document. The user will still need to analyze the document to determine which of the themes is
most accurate. Themes in this case refer to the main idea of an entire document. In normal
literary analysis it is determined by answering the following questions in a document: who, what,
when, where, why and how. LASI takes a bit of a different approach, but it is less subjective in
its approach to find the theme, or themes. Finding the theme of a document or multiple
documents is important because understanding the main idea of what others say and write is the
foundation of human communication. LASI does not reinvent the wheel, but it makes the process
of finding themes of large documents significantly less time consuming.
The problem with manually determining themes is that it is difficult to determine themes across
multiple documents in an objective, consistent, and timely manner. Determining themes is
something that most people do automatically, but two people often read the same work and
derive different themes from that same work. LASI seeks to resolve that by using an algorithmic
approach to determining themes rather than a subjective one. It can also be difficult for people to
Lab 1 – LASI Description 5
consistently gather the same theme across multiple documents. A person can read something
more than once and get a different main idea each time because of the subjective nature of the
way people read. LASI does not follow this subjective approach and when a document is
analyzed, it will display the same results each time. The process of manually determining the
theme of a document is also incredibly time-consuming. It requires multiple read-throughs of the
same documents to make sure that the author’s meaning is not lost. It can take hours or days to
determine this information. LASI cuts this process down to a matter of minutes.
2. Product Description
LASI as a real-world product will be a computer application designed to run on the Windows
platform. It will be written in C# and will be a stand alone, client side, desktop application. It
will be efficient enough to run on a high-end laptop, and is going to be an engine that can be
expanded upon to add plugins which will extend functionality. LASI as a real world application
will be able to determine themes across as many documents as can be provided and can provide
cross-document analysis to determine single themes across multiple documents. It will provide a
user with the ability to create custom dictionaries to increase the accuracy and increase statistical
likelihood of determining a theme. It will also be optimized to improve efficiency by
incorporating multi-threading to expedite processing time, and decrease resource usage. It will
contain a minimalistic user interface to avoid confusion for the user. It will also parse documents
in each of the following formats: PDF, DOC, DOCX, PPT, PPTX, and TXT. LASI will generate
themes across all documents, as well as each individual document. LASI will also be an opensource project that will be released under the Limited GNU Public License Agreement.
Lab 1 – LASI Description 6
LASI will be useful for many different demographics. It can be used by teachers to identify
plagiarism, as well as to assist with the grading of papers. The teacher just needs to run the
papers through LASI and wait for the output. Students will find this tool useful as well because it
will make reading through books and papers for research significantly faster. They can also use
LASI to make sure what they are submitting does not contain exactly the same information as
something someone else is submitting. Researchers will be able to use LASI as well because
their job consists of reading countless documents thoroughly; often in fields with which they
may not be familiar. Lawyers and contractors can use LASI to read through legal documents and
identify what exactly is being agreed to with a specific contract.
2.1. Key Product Features and Capabilities
LASI will determine themes across multiple documents by using semantic and syntactic
evaluation of a set of documents. It will accurately determine a word’s part of speech, as well as
that word’s location in a sentence, whether it is part of a subject, verb, or object, and then it will
assign a weight based on a count of that word and the word’s part of speech, a generic word
count over the whole document, and where that word falls in a specific document. It will
evaluate the weight of all words in a document and then output the results from multiple views.
The real-world product will be able to accept multiple file types as input and will parse them all
the same way. Additional documents can be added to a project after it has begun analysis and the
user will be able to add custom dictionaries to the project to increase efficiency and accuracy.
LASI will not make any assumptions about content by default, but the user can specify a type of
document, for example strategic documents, literary documents, scientific reports, as well as
many others that will make LASI parse these document formats in a more accurate way.
Lab 1 – LASI Description 7
There will be multiple levels of output that LASI will display to the user after it completes. The
user interface will be able to display the top results, which is essentially a weighted word count
displayed as a tornado chart. A visual representation of this can be found in Figure 1 below. The
Top Results page with the weighted output can be displayed on all documents or just individual
documents.
Figure 1.
There is also a word relationships page which will display an individual document, as well as a
color-coded representation of words in the document. The colors correspond to a word’s part of
speech. This is demonstrated in Figure 2.
(This space intentionally left blank)
Lab 1 – LASI Description 8
Figure 2.
The last view of the output would be an in-depth textual representation of each word, its weight,
its count, and then its part of speech. A very basic implementation of this is listed in Figure 3.
This output view would be the most informative, but will likely be more difficult to read than the
other two output views.
(This space intentionally left blank)
Lab 1 – LASI Description 9
Figure 3.
The results of LASI will also need to be exportable to be used as presentation materials, visual
aids, and further analysis. Exporting the results into multiple file formats also provides a level of
convenience for the user. LASI will be able to export results in PDF, XLS, and several image
formats to make sure that anyone using the project will be able to make the results portable.
2.2. Major Functional Components
LASI will be efficient enough to run on the virtual machine provided by the university.
However, for the real-world product, there will be some hardware requirements. It will require a
Quad core or better Intel Core CPU, 8 GB or greater of DDR3 SDRAM. It will also require the
user to provide secondary storage space. The exact amount of physical storage required will be
specified at a later time. The major functional components of the real world product can be seen
in Figure 4.
(This space intentionally left blank)
Lab 1 – LASI Description 10
Figure 4.
There will also be several Software components of LASI. The external tools LASI will be using
to assist with development and functionality include the SharpNLP Part of Speech Tagger. This
is an open-source tool that is a fork of the OpenNLP tools developed in Java. SharpNLP is built
using C# which makes it more secure than its Java counterpart, and also easier to incorporate
into LASI, which is also written in C#. LASI will also be using WordNet, which is a thesaurus
database compiled by Princeton, which contains virtually every word, its known synonyms, and
antonyms. It will be incredibly useful for binding synonyms together to improve accuracy of
results. LASI also takes advantage of a doc2x tool that converts Microsoft Word 1997 – 2003
document files (.doc) and converts them into a manageable format, Microsoft Word 2007
document files (.docx). This conversion between doc and docx is necessary because docx files
Lab 1 – LASI Description 11
are actually a compressed format that contains an easily parsed XML file containing all the text
of a document.
There are several important features to this software. Several key data structures incorporated
into LASI. Each word and phrase will be stored into a C# List, which is essentially a vector in
C++. Each word will be assigned a type and then added to a list at the initial parsing of a
document. Words will be assigned a part of speech before being assigned to a list and phrases
will be assigned a location in a sentence, meaning that a phrase will be determined to be either
part of the subject, or object of a sentence. These lists will be traversable by each individual
word. Each individual word will also link to another list of associated words.
The underlying algorithm of LASI will consist of several key parts. There is the element binding
process, the weighting process, and then high level analysis process. An element is either a word
or a phrase. There will be both direct binding and indirect binding of elements to other elements.
Direct binding of word elements will consist of binding nouns and verbs together, adverbs to
verbs, adjectives to nouns, and determiners to nouns. Direct binding of phrase objects will
consist of binding phrases to subjects, phrases to objects, and then breaking phrases down and
binding them to each word inside them. Indirect binding will consist of binding synonyms to
mean a single noun, and binding pronouns to the noun they derive from.
The weighting process will be handled by two separate processes. There will be a raw weighting
system, which analyzes simple word frequency, word and phrase frequency with part of speech
and location in a sentence considered, and then synonym-aware word frequency. This will
provide the foundation for LASI to modify weights via comparison with other words. The
relative comparison will count the relationships between words and modify the weight of each
Lab 1 – LASI Description 12
word and phrase accordingly. It will measure the lexical distance between associated words in
the document set. It will also produce a Pronoun-aware word frequency to increase word counts
of the associated noun.
It will also provide a high level analysis of each element’s weight and then order the highest
weighted words and phrases to form a list of coherent sentences from there. The algorithm will
also determine the optimal overlap of weighting metrics to produce the most accurate results. It
will then employ a process of resolving conflicts between highly weighted words.
3. Identification of Case Study
Dr. Patrick Hester & Dr. Tom Meyers work for an organization called NCSOSE. NCSOSE
stands for the National Center for Systems of Systems Engineering. NCSOSE works with
organizations and companies to improve workflow and optimize efficiency. They also generate
and provide training materials to these organizations.
At NCSOSE, Dr. Hester and Dr. Meyers currently utilize a process known as the AID process to
identify problem statements from groups of strategic documents. AID stands for Assessment
Improvement Design. The Assessment phase of this process is currently what LASI will be
improving upon. This phase of the process involves analyzing multiple strategic documents in a
range of domains he may not be an expert in and then based on several criteria and the
identification of key components. This involves him doing extensive, unnecessary research and
is an incredibly time consuming and in-depth process.
Dr. Hester and Dr. Meyers then take the analyzed data from these documents and formulate a
concise and accurate problem statement. This problem statement is used to identify
Lab 1 – LASI Description 13
organizational issues and then to offer ways to optimize the company who consults with Dr.
Hester and Dr. Meyers and improve efficiency. Figure 5 outlines the current AID process.
Figure 5.
There is currently a bottleneck in the system at the Assessment part of the process that LASI
would be able to assist with. LASI by nature will eliminate much of the thorough analysis that
Dr. Hester does of the documentation proved and output the same results to him quickly. All he
will need to do is analyze the results of LASI and determine the most appropriate theme from the
results. It will remove much of the guesswork because of the objective nature of LASI and it will
be a great means of defending his findings. LASI will provide him with a group of themes in an
objective, consistent, and timely manner. This will resolve the bottleneck in his system and allow
him to interact with his clients while he is processing the “Assessment” step in AID. Figure 6
demonstrates what LASI will contribute to this.
Lab 1 – LASI Description 14
Figure 6.
4. Prototype Description
Due to the complex nature of the algorithm and the simplistic nature of both the user interface, as
well as the hierarchy of users, the main differences between the real-world product and the
prototype which needs to be scaled down is the algorithm. The algorithm’s output will be
approximately the same. It will still accept multiple documents and search for common themes
across them. It will still interact with the UI the same way. The results will still be displayed
graphically on the results page. The differences will be that the algorithm will be a bit less
versatile as it will make a few more assumptions about the documents through which it searches.
The only two acceptable input types will be .doc and .docx files. It will also be forced to limit the
number of documents that it can analyze. The prototype will limit interpretable documents to 10
Lab 1 – LASI Description 15
pages or less. Lastly rather than focusing on mapping every part of speech to related parts of
speech, LASI will focusing on subjects, verbs, direct objects, and indirect objects.
This prototype’s limitation on the nature of the subject matter in the document is sparked by the
fact that NCSOSE deals exclusively with strategic documents. It will also make our results more
accurate for NCSOSE to be specific with the nature of the documents parsed by LASI. For
instance, LASI can search for keywords like “Mission,” “Vision” and “Goals” and then place
increased emphasis on the content below those words to create a more accurate weighting
algorithm. This also saves some work for the algorithm because it leads to fewer passes of the
document.
The decision to limit the input types to either .doc files or .docx files is due to the fact that
determining a suitable mechanism for parsing other data types requiring optical character
recognition would require learning an entire API and would increase the likelihood of errors in
parsing. Optical character recognition would lead to the possibility of LASI reading the same
word or phrase differently based on input file extension. A homogenous format for gathering
parsing input would resolve that. Also, .docx files are incredibly simple to parse and doc files
can be converted easily to .docx.
Limiting the number of documents in the prototype will result in a simplified output. If the
prototype allows for too many documents to be parsed, it will need to contain some sort of
algorithm to remove irrelevant subject/object/verb associations. Time constraints prevent this
from being feasible, so the input pool must remain small because otherwise, the results will be
confusing and difficult to read. Another reason to limit the number of input documents is that
Lab 1 – LASI Description 16
this prototype will be a multi-threaded application. The more documents that are allowed for
input, the more memory will need to be devoted to this tool.
Limiting the number of pages in a document accomplishes essentially the same thing as limiting
the number of documents. The program will need to rely heavily on its weighting algorithm to
determine themes of these documents. Not assuming a fixed length increases the amount of
required testing exponentially and the time constraints provided simply will not provide the
necessary time to debug the prototype in 18 weeks without limiting input.
Lastly, decision for limiting the prototype to identifying subjects, verbs, direct objects, and
indirect objects was because at its core, themes can still be derived from these, but it produces
fewer incorrect associations than going deeper and analyzing every single part of speech. If the
weighting algorithm focuses on subject/object relationships, it can determine valid themes with
an increased statistical likelihood of accuracy, but it will not need to go back through and remove
false associations. This change will also remove the need to search for synonyms individually as
they can be determined not just by verb associations, but by the associations with the other parts
of that sentence.
4.1. Major Prototype Functional Components
Course requirements, as well as limiting the amount of testing needed, caused the need for a
homogenous hardware platform. This means there needed to be a number of changes to the
hardware platform, the algorithm, and what the prototype will accept as input. The reason for this
is because it is mainly to decrease the time required for testing, as well as to decrease the
potential resource usage. The specific changes are outlined below.
Lab 1 – LASI Description 17
One of the requirements for the semester is that the project run on the virtual machine provided
to each group by the CS department. As a result, LASI will be developed and tested on this
virtual machine. However, there may be hardware limitations on this virtual machine which
prevent it from running the code optimally. That being said, the virtual machine contains 8GB
virtual RAM, and a Quad Core Intel Core CPU.
The software will be changing from the real-world product in that the prototype will keep the
graphical interface more or less the same, but will be missing some of the underlying complexity
of the algorithm. The GUI will still contain the ability to save and load past analysis, select new
documents, and display results, but it will not be able to add documents during analysis.
The prototype will contain the ability to convert DOC files, and DOCX files to a useable format,
but will not be able to handle PPT, PPTX, and PDF files. The reason for this is that Optical
Character Recognition is incredibly difficult and not accurate enough for what we are trying to
do. Also, PPT and PPTX files can contain a plethora of different formats. Most of these are not
traversable by a tool that parses by sentence.
The algorithm will still contain part of speech tagging, a simplified weighting algorithm that
focuses on subject-object relationships, rather than the robust relationships between individual
words. It will still bind phrases to words and determine whether a phrase is a subject or an object
of a sentence. It will also still bind pronouns and synonyms to nouns to increase accuracy.
4.2. Prototype Features and Capabilities
Due to the fact that we have such a time constraint this semester, we will need to scale back our
original product from a real-world product that would be marketable, to a prototype that is
missing some functionality. The real-world product would contain everything that was listed in
Lab 1 – LASI Description 18
the product description section of this paper. The prototype will need to make a few assumptions
about the nature of the documents that we are analyzing and it will also need to be a bit less
robust, which will result in decreased accuracy, but also provides more time to implement a
solution that meets all of the requirements set up by NCSOSE, and also can be created in the
time frame that set up last semester.
Certain functionality must be eliminated from the prototype. It will need to limit the type of input
documents to DOC, DOCX, and TXT file format. The file length must be limited as well. The
exact length of the files will be determined when the prototype gets to a position of testing. The
number of files that can be input will need to be limited as well for the sake of testing the
accuracy of the output generated by LASI. It will also decrease the resource usage of the
program to limit the number of input files. The algorithm will also not be providing a visual
representation of the logic of LASI in the prototype as it will require modifying the output
format. The LASI prototype will exclude scanned text recognition because incorporating and
testing an Optical Character Recognition tool would be more time consuming than it would be
worth to include in the prototype. Including it would also result in a decrease in accuracy of the
results. Certain optional refinement tools must also be removed, including user-added items like
dictionaries, and keywords, and content assumptions from the prototype as implementing such
features would require months of testing. There is also reduction in the number of times that
LASI will search through documents in an effort to improve load time and decrease testing.
Figure 7 is a visual representation of the differences between the Real world solution and our
prototype.
(This space intentionally left blank)
Lab 1 – LASI Description 19
Figure 7.
Lab 1 – LASI Description 20
Glossary of Terms
Theme: subject-object-verb relationships that LASI is attempting to generate from the
input set
LASI: Linguistic Analysis for Subject Identification
Parser: Takes in DOC and DOCS files and converts them to TXT files
WordNet: compilers and providers of our thesaurus
Phrase: A group of words standing together as a conceptual unit, typically forming a new
component.
Analysis: Detailed examination of the elements or structure of something, typically as a basis for
interpretation.
Linguistic Analysis: The process of gathering information about a document’s content from the
language of that document.
Tag: A label, or the act of attaching a label, that specifies the role (such part of speech or
location) of a selected element in a document.
Document: A document herein refers to a formally written, expository paper which expounds,
via a declarative approach, on a relatively quantifiable issue, goal, or area of research.
Word Weight: A numeric value, associated with each syntactically and lexically unique word in
a written work, which indicates the relative significance of that word.
Tornado chart: A horizontal bar graph like visualization, representing the relative frequency or
significance of elements, sorted in descending order by magnitude.
Head word: A Head Word is the locally distinct word within a phrase which, by its syntactic
associations, determines the syntactic category of the phrase itself.
Word Binding: Conversion of scanned images to text.
Lab 1 – LASI Description 21
Sharp NLP: C# natural language processing tool used to parse and tag part-of-speech.
Tagged Word Object: The process of binding part-of-speech to a word.
Optical Character Recognition: A word that has an associated part-of-speech.
Tagged Set: a group of words whose part of speech and location in a sentence have
been identified by our parser.
Lexer: a piece of our parsing tool that isolates each word and its part of speech, and location in a
sentence into machine readable tokens. These are stored as elements in an XML file.
Syntactic Analysis: a form of Linguistic analysis that focuses on grammar in sentences and
identifies themes based on sentence structure and formatting. Unlike Semantic Analysis, it
identifies key words based on their location in the sentence, rather than their overall meaning
throughout the document.
Subject Identification: This is the process of identifying the main actor in a sentence. However,
in a broader sense, the word subject is synonymous with the theme of a document. Subject
identification is the process of determining subjects, or themes of a document or documents.
Part of Speech Tagger: Software utility that associates words with the parts of speech (i.e.
Noun, Verb, etc.) in a sentence.
Semantic Analysis: Relating the syntactical structure of words to their language independent
meanings.
A.I.D. Process: Assessment Improvement Design: A process that provides quantitative and
qualitative basis to identify problems and determine the feasibility of solutions.
Strategic Document: Document produced by a client that defines what their Goals, Visions, and
Missions.
Download