Lab 1 – LASI Description Running head: LAB 1 – LASI DESCRIPTION LASI Product Description V2 Linguistic Analysis for Subject Identification Determining common themes across multiple documents CS411 Red Team Dustin Patrick 3/18/2013 1 Lab 1 – LASI Description 2 Contents List of Figures..................................................................................................................................2 1 Introduction...................................................................................................................................3 2 LASI Product Description.............................................................................................................4 2.1 Key Product Features and Capabilities..........................................................................5 2.1 Major Functional Components......................................................................................6 3 Identification of Case Study..........................................................................................................7 4 LASI Prototype Description.........................................................................................................8 4.1 Key Prototype Features and Cababilities......................................................................9 4.2 Major Functional Components....................................................................................10 Glossary.........................................................................................................................................11 Lab 1 – LASI Description 3 List of Figures Figure 1............................................................................................................................................7 Figure 2............................................................................................................................................8 Figure 3............................................................................................................................................9 Figure 4..........................................................................................................................................10 Figure 5..........................................................................................................................................13 Figure 6..........................................................................................................................................14 Figure 7..........................................................................................................................................19 Lab 1 – LASI Description 4 1. Introduction LASI, which stands for Linguistic Analysis for Subject Identification is a tool that is designed to assuage researchers in finding and obtaining common themes across a document or multiple documents. It relies on an algorithm that identifies themes based on a word’s weight in a document or documents, as well as its frequency. A word’s weight is determined by the relationships it has with other words in the sentences in which it is used. Linguistic Analysis in this case refers to the abstract understanding of an author’s intention for a document based on a syntactic and semantic evaluation of the text of the document(s). LASI is a decision support tool, not a decision making tool. This means that it will not output a single theme for a document, but a list of possible themes, along with their likelihood of accurately reflecting the theme of a document. The user will still need to analyze the document to determine which of the themes is most accurate. Themes in this case refer to the main idea of an entire document. In normal literary analysis it is determined by answering the following questions in a document: who, what, when, where, why and how. LASI takes a bit of a different approach, but it is less subjective in its approach to find the theme, or themes. Finding the theme of a document or multiple documents is important because understanding the main idea of what others say and write is the foundation of human communication. LASI does not reinvent the wheel, but it makes the process of finding themes of large documents significantly less time consuming. The problem with manually determining themes is that it is difficult to determine themes across multiple documents in an objective, consistent, and timely manner. Determining themes is something that most people do automatically, but two people often read the same work and derive different themes from that same work. LASI seeks to resolve that by using an algorithmic approach to determining themes rather than a subjective one. It can also be difficult for people to Lab 1 – LASI Description 5 consistently gather the same theme across multiple documents. A person can read something more than once and get a different main idea each time because of the subjective nature of the way people read. LASI does not follow this subjective approach and when a document is analyzed, it will display the same results each time. The process of manually determining the theme of a document is also incredibly time-consuming. It requires multiple read-throughs of the same documents to make sure that the author’s meaning is not lost. It can take hours or days to determine this information. LASI cuts this process down to a matter of minutes. 2. Product Description LASI as a real-world product will be a computer application designed to run on the Windows platform. It will be written in C# and will be a stand alone, client side, desktop application. It will be efficient enough to run on a high-end laptop, and is going to be an engine that can be expanded upon to add plugins which will extend functionality. LASI as a real world application will be able to determine themes across as many documents as can be provided and can provide cross-document analysis to determine single themes across multiple documents. It will provide a user with the ability to create custom dictionaries to increase the accuracy and increase statistical likelihood of determining a theme. It will also be optimized to improve efficiency by incorporating multi-threading to expedite processing time, and decrease resource usage. It will contain a minimalistic user interface to avoid confusion for the user. It will also parse documents in each of the following formats: PDF, DOC, DOCX, PPT, PPTX, and TXT. LASI will generate themes across all documents, as well as each individual document. LASI will also be an opensource project that will be released under the Limited GNU Public License Agreement. Lab 1 – LASI Description 6 LASI will be useful for many different demographics. It can be used by teachers to identify plagiarism, as well as to assist with the grading of papers. The teacher just needs to run the papers through LASI and wait for the output. Students will find this tool useful as well because it will make reading through books and papers for research significantly faster. They can also use LASI to make sure what they are submitting does not contain exactly the same information as something someone else is submitting. Researchers will be able to use LASI as well because their job consists of reading countless documents thoroughly; often in fields with which they may not be familiar. Lawyers and contractors can use LASI to read through legal documents and identify what exactly is being agreed to with a specific contract. 2.1. Key Product Features and Capabilities LASI will determine themes across multiple documents by using semantic and syntactic evaluation of a set of documents. It will accurately determine a word’s part of speech, as well as that word’s location in a sentence, whether it is part of a subject, verb, or object, and then it will assign a weight based on a count of that word and the word’s part of speech, a generic word count over the whole document, and where that word falls in a specific document. It will evaluate the weight of all words in a document and then output the results from multiple views. The real-world product will be able to accept multiple file types as input and will parse them all the same way. Additional documents can be added to a project after it has begun analysis and the user will be able to add custom dictionaries to the project to increase efficiency and accuracy. LASI will not make any assumptions about content by default, but the user can specify a type of document, for example strategic documents, literary documents, scientific reports, as well as many others that will make LASI parse these document formats in a more accurate way. Lab 1 – LASI Description 7 There will be multiple levels of output that LASI will display to the user after it completes. The user interface will be able to display the top results, which is essentially a weighted word count displayed as a tornado chart. A visual representation of this can be found in Figure 1 below. The Top Results page with the weighted output can be displayed on all documents or just individual documents. Figure 1. There is also a word relationships page which will display an individual document, as well as a color-coded representation of words in the document. The colors correspond to a word’s part of speech. This is demonstrated in Figure 2. (This space intentionally left blank) Lab 1 – LASI Description 8 Figure 2. The last view of the output would be an in-depth textual representation of each word, its weight, its count, and then its part of speech. A very basic implementation of this is listed in Figure 3. This output view would be the most informative, but will likely be more difficult to read than the other two output views. (This space intentionally left blank) Lab 1 – LASI Description 9 Figure 3. The results of LASI will also need to be exportable to be used as presentation materials, visual aids, and further analysis. Exporting the results into multiple file formats also provides a level of convenience for the user. LASI will be able to export results in PDF, XLS, and several image formats to make sure that anyone using the project will be able to make the results portable. 2.2. Major Functional Components LASI will be efficient enough to run on the virtual machine provided by the university. However, for the real-world product, there will be some hardware requirements. It will require a Quad core or better Intel Core CPU, 8 GB or greater of DDR3 SDRAM. It will also require the user to provide secondary storage space. The exact amount of physical storage required will be specified at a later time. The major functional components of the real world product can be seen in Figure 4. (This space intentionally left blank) Lab 1 – LASI Description 10 Figure 4. There will also be several Software components of LASI. The external tools LASI will be using to assist with development and functionality include the SharpNLP Part of Speech Tagger. This is an open-source tool that is a fork of the OpenNLP tools developed in Java. SharpNLP is built using C# which makes it more secure than its Java counterpart, and also easier to incorporate into LASI, which is also written in C#. LASI will also be using WordNet, which is a thesaurus database compiled by Princeton, which contains virtually every word, its known synonyms, and antonyms. It will be incredibly useful for binding synonyms together to improve accuracy of results. LASI also takes advantage of a doc2x tool that converts Microsoft Word 1997 – 2003 document files (.doc) and converts them into a manageable format, Microsoft Word 2007 document files (.docx). This conversion between doc and docx is necessary because docx files Lab 1 – LASI Description 11 are actually a compressed format that contains an easily parsed XML file containing all the text of a document. There are several important features to this software. Several key data structures incorporated into LASI. Each word and phrase will be stored into a C# List, which is essentially a vector in C++. Each word will be assigned a type and then added to a list at the initial parsing of a document. Words will be assigned a part of speech before being assigned to a list and phrases will be assigned a location in a sentence, meaning that a phrase will be determined to be either part of the subject, or object of a sentence. These lists will be traversable by each individual word. Each individual word will also link to another list of associated words. The underlying algorithm of LASI will consist of several key parts. There is the element binding process, the weighting process, and then high level analysis process. An element is either a word or a phrase. There will be both direct binding and indirect binding of elements to other elements. Direct binding of word elements will consist of binding nouns and verbs together, adverbs to verbs, adjectives to nouns, and determiners to nouns. Direct binding of phrase objects will consist of binding phrases to subjects, phrases to objects, and then breaking phrases down and binding them to each word inside them. Indirect binding will consist of binding synonyms to mean a single noun, and binding pronouns to the noun they derive from. The weighting process will be handled by two separate processes. There will be a raw weighting system, which analyzes simple word frequency, word and phrase frequency with part of speech and location in a sentence considered, and then synonym-aware word frequency. This will provide the foundation for LASI to modify weights via comparison with other words. The relative comparison will count the relationships between words and modify the weight of each Lab 1 – LASI Description 12 word and phrase accordingly. It will measure the lexical distance between associated words in the document set. It will also produce a Pronoun-aware word frequency to increase word counts of the associated noun. It will also provide a high level analysis of each element’s weight and then order the highest weighted words and phrases to form a list of coherent sentences from there. The algorithm will also determine the optimal overlap of weighting metrics to produce the most accurate results. It will then employ a process of resolving conflicts between highly weighted words. 3. Identification of Case Study Dr. Patrick Hester & Dr. Tom Meyers work for an organization called NCSOSE. NCSOSE stands for the National Center for Systems of Systems Engineering. NCSOSE works with organizations and companies to improve workflow and optimize efficiency. They also generate and provide training materials to these organizations. At NCSOSE, Dr. Hester and Dr. Meyers currently utilize a process known as the AID process to identify problem statements from groups of strategic documents. AID stands for Assessment Improvement Design. The Assessment phase of this process is currently what LASI will be improving upon. This phase of the process involves analyzing multiple strategic documents in a range of domains he may not be an expert in and then based on several criteria and the identification of key components. This involves him doing extensive, unnecessary research and is an incredibly time consuming and in-depth process. Dr. Hester and Dr. Meyers then take the analyzed data from these documents and formulate a concise and accurate problem statement. This problem statement is used to identify Lab 1 – LASI Description 13 organizational issues and then to offer ways to optimize the company who consults with Dr. Hester and Dr. Meyers and improve efficiency. Figure 5 outlines the current AID process. Figure 5. There is currently a bottleneck in the system at the Assessment part of the process that LASI would be able to assist with. LASI by nature will eliminate much of the thorough analysis that Dr. Hester does of the documentation proved and output the same results to him quickly. All he will need to do is analyze the results of LASI and determine the most appropriate theme from the results. It will remove much of the guesswork because of the objective nature of LASI and it will be a great means of defending his findings. LASI will provide him with a group of themes in an objective, consistent, and timely manner. This will resolve the bottleneck in his system and allow him to interact with his clients while he is processing the “Assessment” step in AID. Figure 6 demonstrates what LASI will contribute to this. Lab 1 – LASI Description 14 Figure 6. 4. Prototype Description Due to the complex nature of the algorithm and the simplistic nature of both the user interface, as well as the hierarchy of users, the main differences between the real-world product and the prototype which needs to be scaled down is the algorithm. The algorithm’s output will be approximately the same. It will still accept multiple documents and search for common themes across them. It will still interact with the UI the same way. The results will still be displayed graphically on the results page. The differences will be that the algorithm will be a bit less versatile as it will make a few more assumptions about the documents through which it searches. The only two acceptable input types will be .doc and .docx files. It will also be forced to limit the number of documents that it can analyze. The prototype will limit interpretable documents to 10 Lab 1 – LASI Description 15 pages or less. Lastly rather than focusing on mapping every part of speech to related parts of speech, LASI will focusing on subjects, verbs, direct objects, and indirect objects. This prototype’s limitation on the nature of the subject matter in the document is sparked by the fact that NCSOSE deals exclusively with strategic documents. It will also make our results more accurate for NCSOSE to be specific with the nature of the documents parsed by LASI. For instance, LASI can search for keywords like “Mission,” “Vision” and “Goals” and then place increased emphasis on the content below those words to create a more accurate weighting algorithm. This also saves some work for the algorithm because it leads to fewer passes of the document. The decision to limit the input types to either .doc files or .docx files is due to the fact that determining a suitable mechanism for parsing other data types requiring optical character recognition would require learning an entire API and would increase the likelihood of errors in parsing. Optical character recognition would lead to the possibility of LASI reading the same word or phrase differently based on input file extension. A homogenous format for gathering parsing input would resolve that. Also, .docx files are incredibly simple to parse and doc files can be converted easily to .docx. Limiting the number of documents in the prototype will result in a simplified output. If the prototype allows for too many documents to be parsed, it will need to contain some sort of algorithm to remove irrelevant subject/object/verb associations. Time constraints prevent this from being feasible, so the input pool must remain small because otherwise, the results will be confusing and difficult to read. Another reason to limit the number of input documents is that Lab 1 – LASI Description 16 this prototype will be a multi-threaded application. The more documents that are allowed for input, the more memory will need to be devoted to this tool. Limiting the number of pages in a document accomplishes essentially the same thing as limiting the number of documents. The program will need to rely heavily on its weighting algorithm to determine themes of these documents. Not assuming a fixed length increases the amount of required testing exponentially and the time constraints provided simply will not provide the necessary time to debug the prototype in 18 weeks without limiting input. Lastly, decision for limiting the prototype to identifying subjects, verbs, direct objects, and indirect objects was because at its core, themes can still be derived from these, but it produces fewer incorrect associations than going deeper and analyzing every single part of speech. If the weighting algorithm focuses on subject/object relationships, it can determine valid themes with an increased statistical likelihood of accuracy, but it will not need to go back through and remove false associations. This change will also remove the need to search for synonyms individually as they can be determined not just by verb associations, but by the associations with the other parts of that sentence. 4.1. Major Prototype Functional Components Course requirements, as well as limiting the amount of testing needed, caused the need for a homogenous hardware platform. This means there needed to be a number of changes to the hardware platform, the algorithm, and what the prototype will accept as input. The reason for this is because it is mainly to decrease the time required for testing, as well as to decrease the potential resource usage. The specific changes are outlined below. Lab 1 – LASI Description 17 One of the requirements for the semester is that the project run on the virtual machine provided to each group by the CS department. As a result, LASI will be developed and tested on this virtual machine. However, there may be hardware limitations on this virtual machine which prevent it from running the code optimally. That being said, the virtual machine contains 8GB virtual RAM, and a Quad Core Intel Core CPU. The software will be changing from the real-world product in that the prototype will keep the graphical interface more or less the same, but will be missing some of the underlying complexity of the algorithm. The GUI will still contain the ability to save and load past analysis, select new documents, and display results, but it will not be able to add documents during analysis. The prototype will contain the ability to convert DOC files, and DOCX files to a useable format, but will not be able to handle PPT, PPTX, and PDF files. The reason for this is that Optical Character Recognition is incredibly difficult and not accurate enough for what we are trying to do. Also, PPT and PPTX files can contain a plethora of different formats. Most of these are not traversable by a tool that parses by sentence. The algorithm will still contain part of speech tagging, a simplified weighting algorithm that focuses on subject-object relationships, rather than the robust relationships between individual words. It will still bind phrases to words and determine whether a phrase is a subject or an object of a sentence. It will also still bind pronouns and synonyms to nouns to increase accuracy. 4.2. Prototype Features and Capabilities Due to the fact that we have such a time constraint this semester, we will need to scale back our original product from a real-world product that would be marketable, to a prototype that is missing some functionality. The real-world product would contain everything that was listed in Lab 1 – LASI Description 18 the product description section of this paper. The prototype will need to make a few assumptions about the nature of the documents that we are analyzing and it will also need to be a bit less robust, which will result in decreased accuracy, but also provides more time to implement a solution that meets all of the requirements set up by NCSOSE, and also can be created in the time frame that set up last semester. Certain functionality must be eliminated from the prototype. It will need to limit the type of input documents to DOC, DOCX, and TXT file format. The file length must be limited as well. The exact length of the files will be determined when the prototype gets to a position of testing. The number of files that can be input will need to be limited as well for the sake of testing the accuracy of the output generated by LASI. It will also decrease the resource usage of the program to limit the number of input files. The algorithm will also not be providing a visual representation of the logic of LASI in the prototype as it will require modifying the output format. The LASI prototype will exclude scanned text recognition because incorporating and testing an Optical Character Recognition tool would be more time consuming than it would be worth to include in the prototype. Including it would also result in a decrease in accuracy of the results. Certain optional refinement tools must also be removed, including user-added items like dictionaries, and keywords, and content assumptions from the prototype as implementing such features would require months of testing. There is also reduction in the number of times that LASI will search through documents in an effort to improve load time and decrease testing. Figure 7 is a visual representation of the differences between the Real world solution and our prototype. (This space intentionally left blank) Lab 1 – LASI Description 19 Figure 7. Lab 1 – LASI Description 20 Glossary of Terms Theme: subject-object-verb relationships that LASI is attempting to generate from the input set LASI: Linguistic Analysis for Subject Identification Parser: Takes in DOC and DOCS files and converts them to TXT files WordNet: compilers and providers of our thesaurus Phrase: A group of words standing together as a conceptual unit, typically forming a new component. Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation. Linguistic Analysis: The process of gathering information about a document’s content from the language of that document. Tag: A label, or the act of attaching a label, that specifies the role (such part of speech or location) of a selected element in a document. Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research. Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, which indicates the relative significance of that word. Tornado chart: A horizontal bar graph like visualization, representing the relative frequency or significance of elements, sorted in descending order by magnitude. Head word: A Head Word is the locally distinct word within a phrase which, by its syntactic associations, determines the syntactic category of the phrase itself. Word Binding: Conversion of scanned images to text. Lab 1 – LASI Description 21 Sharp NLP: C# natural language processing tool used to parse and tag part-of-speech. Tagged Word Object: The process of binding part-of-speech to a word. Optical Character Recognition: A word that has an associated part-of-speech. Tagged Set: a group of words whose part of speech and location in a sentence have been identified by our parser. Lexer: a piece of our parsing tool that isolates each word and its part of speech, and location in a sentence into machine readable tokens. These are stored as elements in an XML file. Syntactic Analysis: a form of Linguistic analysis that focuses on grammar in sentences and identifies themes based on sentence structure and formatting. Unlike Semantic Analysis, it identifies key words based on their location in the sentence, rather than their overall meaning throughout the document. Subject Identification: This is the process of identifying the main actor in a sentence. However, in a broader sense, the word subject is synonymous with the theme of a document. Subject identification is the process of determining subjects, or themes of a document or documents. Part of Speech Tagger: Software utility that associates words with the parts of speech (i.e. Noun, Verb, etc.) in a sentence. Semantic Analysis: Relating the syntactical structure of words to their language independent meanings. A.I.D. Process: Assessment Improvement Design: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions. Strategic Document: Document produced by a client that defines what their Goals, Visions, and Missions.