Running Head: Lab 1 – LASI Product Description LASI Product Description LASI – Red Team Old Dominion University Author: Scott Minter Last Modified: March 18pe1, 2013 Version 1.1 2 Lab 1 – LASI Product Description Table of Contents 1. Introduction …………………………………….…………………………………… 3 2. Product Description ……………………………..………………………………….. 4 2.1 Key Product Features and Capabilities ……………………...…………….. 4 2.2 Major Functional Components (Hardware and Software) …….………….. 7 3. Identification of Case Study .……………………………………..……….……….. 10 4. Prototype Description …………………………………………………………..…. 12 4.1 Major Functional Components (Hardware/Software) ………….......……. 12 4.2 Features and Capabilities …………………………………………….…... 14 4.3 Prototype Challenges …………………………………………………….. 14 Glossary ……………………………………………………………………………….. 16 References .……………………………………………………………………………. 18 List of Figures Example 1. Top Results Tab ………………………………………………………….. 5 Example 2. Word Relationship Tab ……………………………………………….…... 6 Example 3. Word Count and Weighting Tab …...…………………………………...... 7 Figure 4. Current AID Process …………………………………………..…………… 10 Figure 5. AID Process with LASI ……………………………………….………….... 11 Figure 6. Real-World vs. Prototype …………...…………………………..………….. 12 Figure 7. Prototype Major Functional Components …………………………...…….... 13 Lab 1 – LASI Product Description 3 1. Introduction Linguistic Analysis for Subject Identification (LASI) is the name of The Red Group’s project. This project is being developed as a requirement in the Professional Workforce Development I & II courses at Old Dominion University. Linguistic Analysis, in the scope of LASI, is the contextual study of written works and how the words combine to form an overall meaning. LASI will be a decision support tool to assist users in determining common themes across multiple documents. The themes that LASI will produce are going to be subject-object-verb relationships. Themes are important because they help the reader to comprehend what has just been read. Then, if the reader has comprehended what was read, then the reader can summarize the material. Comprehension and summarization are important because they assist the reader in communicating the content of the material with other people. The process of finding common themes across multiple documents may be lengthy and repetitive. This is due to the depth of understanding needed to identify themes across all the documents, which may not be the theme of any individual document. Therefore, it is difficult for people to identify a common theme over a large set of documents in a timely, consistent and objective manner. LASI will assist in this area by providing a weighted list of potential themes from which the user can choose the best fit for their understanding of the material. For LASI to effectively resolve this societal problem it will need to accurately find themes, be system efficient, and provide consistent results. (This space intentionally left blank) Lab 1 – LASI Product Description 4 2. Product Description LASI will be a self-contained, stand-alone piece of software. It will not require a connection to the Internet to produce accurate results. LASI will be designed to run on a consumer level laptop or desktop. Also, LASI will be designed to be an open source back-end engine for other projects. The data collected from the analysis that LASI performs can be used drive other projects and their respective GUIs, completely bypassing the default GUI. LASI’s ability to analyze multiple documents for common themes makes it a decision support tool that is useful to anyone who has to read over large sets of documents looking for commonality. Students could use it to verify the usefulness of scientific publications to the topic they are researching. Teachers could use it as an initial analysis of student research papers, verifying that the paper correctly addresses the topic. Similar to students, research analysts could use LASI to verify whether or not a different papers and articles address the specific area they are researching. 2.1 Key Product Features and Capabilities LASI’s ability to find themes is based on three different sub-routines. The first is a Part-Of-Speech (POS) tagging system that will return the input document(s) with all Words and Word Phrases tagged for their corresponding POS. Second is a word association algorithm that will associate Words based on their POS and their proximity to one another. Finally, a weight is applied to each Word or Word Phrase based on it’s POS and it’s association to other words and their POS’s. LASI will accept DOC, DOCX, and TXT files as input. LASI will allow a user to input any known “problem” words: any organization specific jargon or slang. LASI will Lab 1 – LASI Product Description 5 also allow a user to input any desired assumptions such as synonyms and acronyms. The user will be able to specify word equivalency, allowing LASI to better analyze the document in the context the user desires. One of the important aspects of LASI is the user experience. The results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF format. Example 1 shows a prototype of the Top Results tab displaying the likeliest possible themes based on analysis. Example 1. Top Results Tab (This space intentionally left blank) Lab 1 – LASI Product Description Figure 2 is a prototype of the Word Relationship tab. It also shows that the user will be able to see these results for all the documents and for the individual documents. The colors will correspond to the word’s corresponding POS. The search box will allow the user to search for specific words and have them searched words be highlighted. Example 2. Word Relationships Tab (This space intentionally left blank) 6 Lab 1 – LASI Product Description 7 Figure 3 is a prototype of the Word Count and Weighting Tab. It will display the count of each word in the set of documents and display their weights based on the weighting algorithm. Example 3. Word Count and Weighting Tab 2.2 Major Functional Components (Hardware and Software) LASI will be able to run one a laptop or a desktop provided the machine has a multi-core processor and four to eight gigabytes of RAM. The third party software components for LASI are SharpNPL Part Of Speech Tagger, WordNet Thesaurus Data, and Document Converters. SharpNPL Part Of Speech Tagger is handling the tagging of words and word phrases for their corresponding POS. WordNet Thesaurus Data is allowing LASI to recognize synonyms. Document Converters are converting DOC and DOCX files to TXT files. Lab 1 – LASI Product Description 8 LASI’s analytical capabilities are enabled by a combination of data structures and algorithms. The key data types go into two categories: specialized word and phrase constructs and the ability to traverse documents as a collection of specialized word and phrase constructs. Specialized word and phrase constructs are assigned based on their tagged POS. Once initialized an instance of a Word Construct will be able to be displayed, sorted and displayed based on it’s POS (e.g. Noun, Verb, etc.), and have other constructs assigned to it based on known syntactic and/or semantic relationships. Phrase Constructs are going handle phrase tags that are generated by the SharpNLP POS tagging system. A phrase is a recognized group of words that will have a tagged POS (eg. NounPhrase, VerbPhrase, etc.) Similar to the Word Constructs, Phrase Constructs will be able to be displayed, sorted and displayed based on its POS, and have other constructs assigned to it based on known syntactic and/or semantic relationships. A document being viewed and used as a traversable collection allows a document to be moved through using different methods: Word, Reference, and Web-wise. When moving through a document using a Word-wise method, the document is broken up by individual words, with each word being an instance of the Word class. Moving through a document using a Reference-wise method, the document is broken up using the Word class and Phrase class respective reference methods. This allows the document to be viewed in terms of which words and or phrases reference each other. Moving through a document using a Web-wise method allows the document to be traversed through as if it were a web with the nodes being the words and or phrases and the references being the connection between nodes. Lab 1 – LASI Product Description 9 The algorithms used by LASI breakdown into two categories: Element-binding, Weighting and Conflict Resolution. Element-binding binds words and phrases together based on each instance’s POS to create references mentioned in the above Word and Phrase Construct section. Element-binding will consist of Direct Binding and Indirect Binding. Direct Binding will create subject-verb, verb-subject, adverb-verb, adjectivenoun, and determiner-noun references. Whereas, Indirect Binding will create pronounnoun references. The Weighting algorithm gives a numeric value to both Word and Phrase instances that will give the instance weight when being considered for its importance. The algorithm looks at both raw and relative data. For raw data it looks at word instance, word instance POS, and synonym count. Word instance count will tally the number of times a word occurs. Word instance POS count will contain the frequency as long as it has the same POS tag. Finally the synonym count raises the count of both synoptic words for any recognized synonyms. For relative data it looks at Subject-Object-Verb reference count and Lexical distance. Once Word and Phrase references have been made on a level deep enough to establish Subject-Object-Verb (SOV) references, a count is made of the number of times the SOV instance occurs. Weight is also based on LexicalDistance, meaning the physical proximity a Word or Phrase is to the reference instance will determine the weight assigned. Conflict Resolution will be important to ensuring that LASI can complete analysis successfully. In a document, there may be any number of unaccounted for items such as incorrect grammar and unrecognized characters. Conflict Resolution will be able to recognize these items try to address them and if not throw the proper exception. Lab 1 – LASI Product Description 10 3. Identification of Case Study Dr. Hester and Dr. Meyers work for an organization housed on the Old Dominion University campus called National Center for Systems of Systems Engineering (NCSOSE). NCSOSE analyzes organizations and their respective documents in order to help them recognize and address internal problems. The current process utilized at NCSOCE is called the Assessment Improvement Design (AID) process. The AID process involves a company coming to NCSOSE for evaluation. At which point, NCSOSE will gather organizationally specific documents for analysis. In this analysis phase, Drs. Hester and Meyers will read over the documents multiple times in order to find common themes specific to the structure and function of the organization. Finally, Dr. Hester and Dr. Meyers will return to the organization with their findings based on the analysis (Fig. 4). Figure 4. Current AID Process Lab 1 – LASI Product Description 11 It is during the document analysis phase that LASI will be utilized. By inserting LASI into the AID process, it will cut down on both time and inconsistency. LASI will allow for less time spent rereading the documents and give NCSOSE logical grounding for the findings they return to the organizations (Fig. 5). Figure 5. AID Process with LASI (This space intentionally left blank) Lab 1 – LASI Product Description 12 4. Prototype Description A full Real World Solution for LASI would be highly difficult to develop in the time allotted so, a prototype needs to be created in order to narrow the scope but have something that can still demonstrate its capabilities. Figure 6 shows what the prototype will do in comparison to the Real-World Solution. Figure 6. Real-World vs. Prototype 4.1 Major Functional Components (Hardware/Software) The major functional components for the prototype are very similar to those of the Real-World Solution. The hardware required to run the LASI prototype will be a laptop or desktop with four to eight gigabytes of RAM and a multi-core processor. The software needed will be the third-party software to tag parts-of-speech and convert DOC and DOCX files to TXT files, the LASI data structures and algorithms and the LASI GUI (Fig. 7). For in-class development a Virtual Machine is also being utilized as a testing, demonstration, and code writing environment. Lab 1 – LASI Product Description 13 Figure 7. Prototype Major Functional Components The third-party software is the SharpNLP Part-Of-Speech Tagger and the B2XTranslator. The SharpNLP POS Tagger tags words and word phrases for their respective POS in order for the LASI algorithms to use them. The B2XTranslator converts DOC to DOCX files. This is done because DOCX files contain an XML file that can easily be converted to a TXT file. The LASI data structures and algorithms needed for the prototype are reference binding and weight assigning. In the prototype, the reference binding works the same as in the Real-World solution. References are made between Words and Word Phrases based on their tagged POS and how they relate to one another within the sentence, paragraph and document structure. The weight assigning algorithm will assign weight based on tagged POS, word instance, and reference count. The reference count will count how many times other Words and Word Phrases refer to a Word or Word Phrase. Lab 1 – LASI Product Description 14 4.2 Features and Capabilities The prototype will be limited in its capabilities from the Real-World Solution due to the time constraints of the class. One of the areas it will be limited in is that it will only allow five documents to be loaded into a LASI project. Also, it will only accept DOC, DOCX and TXT files as input. In the Real-World solution there would need to be some kind of scanned text recognition in order to correctly convert PDF files to TXT files. However, our prototype will not accept PDF files and therefore will not have scanned text recognition capabilities. Some of the identified risks with LASI are trust for the output, post semester maintenance, individual PC system limitations, and illegal character handling. These risks are being mitigated through the various means. Trust for the output will be handled by the various views and tabs on our results GUI. By showing the user much of LASI’s accumulated data the results will be provable. Maintenance of LASI will be performed by the open source community and possibly by future CS410 and CS411 groups. Avoiding crashes due to system limitations will be handled by multithreading the LASI algorithms and making sure the program runs as efficiently as possible. At some point, LASI will encounter some unrecognized characters in a document. When this happens, LASI attempt to recognize these characters based on their syntax in the document. However, if they remain unrecognizable then an exception will be thrown and the character will be ignored. 4.3 Prototype Development Challenges Some of the challenges facing the development of the LASI prototype are the ability to correctly use all the data generated and to correctly identify themes. Identifying Lab 1 – LASI Product Description 15 POS, creating Word and Word Phrase references and assigning weight are all constructs that assist and enable LASI with correctly identifying themes. This will be mitigated by intelligently creating an algorithm that will use this information to identify themes in a timely, consistent and objective manner. (This space intentionally left blank) Lab 1 – LASI Product Description 16 Glossary of Terms Theme - subject-object-verb relationships that LASI is attempting to generate from the input set LASI - Linguistic Analysis for Subject Identification Parser - Takes in DOC and DOCS files and converts them to TXT files WordNet - compilers and providers of our thesaurus Phrase - A group of words standing together as a conceptual unit, typically forming a new component. Analysis - Detailed examination of the elements or structure of something, typically as a basis for interpretation. Linguistic Analysis - The scientific analysis of a language. Tag - A label, or the act of attaching a label, that specifies the role (such part of speech or location) of a selected element in a document. Document - A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research. Word Weight - A numeric value, associated with each syntactically and lexically unique word in a written work, which indicates the relative significance of that word. Tornado chart - A horizontal bar graph like visualization, representing the relative frequency or significance of elements, sorted in descending order by magnitude. Head word - A Head Word is the locally distinct word within a phrase which, by its syntactic associations, determines the syntactic category of the phrase itself. Word Binding - Conversion of scanned images to text. Sharp NLP - C# natural language processing tool used to parse and tag part-of-speech. Tagged Word Object - The process of binding part-of-speech to a word. Optical Character Recognition - A word that has an associated part-of-speech. Tagged Set - A group of words whose part of speech and location in a sentence have been identified by our parser. Lexer - A piece of our parsing tool that isolates each word and its part of speech, and location in a sentence into machine readable tokens. These are stored as elements in an XML file. Syntactic Analysis - a form of Linguistic analysis that focuses on grammar in sentences and identifies themes based on sentence structure and formatting. Unlike Semantic Analysis, it identifies key words based on their location in the sentence, rather than their overall meaning throughout the document. Subject Identification- dentifies the main actor in a sentence. However, in a broader sense, the word subject is synonymous with the theme of a document. Subject identification is the process of determining subjects, or themes of a document or documents. Part of Speech Tagger - Software utility that associates words with the parts of speech (i.e. Noun, Verb, etc.) in a sentence. Semantic Analysis - Relating the syntactical structure of words to their language independent meanings. A.I.D. Process - Assessment Improvement Design: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions. Lab 1 – LASI Product Description 17 Strategic Document - Document produced by a client that defines what their Goals, Visions, and Missions. Word (denoted by capital W) – an instance of LASI’s Word class Word Phrase (denoted by capital W and capital P) – an instance of LASI’s Word Phrase class Lab 1 – LASI Product Description References Hester, P.T., Meyers, T. (2012). Enterprise AID: A performance measurement system for enterprise assessment, improvement, and design (NCSOSE-TR-12-001). Norfolk, VA: National Centers for System of Systems Engineering. 18