Lab 1 – LASI Description Running head: LAB 1 – LASI DESCRIPTION Lab 1 – LASI Product Description Brittany Johnson CS411 Janet Brunelle March 18, 2013 Version 2 1 Lab 1 – LASI Description 2 Table of Contents 1 INTRODUCTION ...................................................................................................................4 2 PRODUCT DESCRIPTION ....................................................................................................4 2.1 Key Product Features and Capabilities ........................................................................5 2.2 Major Components (Hardware/Software)....................................................................9 3 IDENTIFICATION OF CASE STUDY ................................................................................11 4 PRODUCT PROTOTYPE DESCRIPTION ..........................................................................12 4.1 Prototype Architecture (Hardware/Software) ............................................................12 4.2 Prototype Features and Capabilities...........................................................................14 GLOSSARY ..................................................................................................................................15 REFERENCES ..............................................................................................................................18 List of Figures Figure 1. Top Results Output ...........................................................................................................6 Figure 2. Word Relationships Output ..............................................................................................7 Figure 3. Word Count and Weighting Output .................................................................................8 Figure 4 AID Process: Assessment ................................................................................................11 Figure 5.Prototype Hardware and Software Component Diagram ................................................13 Lab 1 – LASI Description 3 List of Tables Table 1. Feature comparison between prototype and real world product ......................................14 Lab 1 – LASI Description 4 Lab 1 – CertAnon Product Description 1 INTRODUCTION LASI stands for Linguistic Analysis for Subject Identification. It is a stand-alone theme finding application conceived by the Old Dominion University CS410 Red Group. It is designed to be a decision support tool for large, multi-document linguistic analysis and allow for more accurate and consistent results. Linguistic Analysis, with respect to the current project, is the contextual study of written works and how the words combine to form and overall meaning. Themes are subject-object-verb relationships that LASI is attempting to generate from the input set and are important because they help the reader to comprehend and summarize what has been read. It is even more difficult to come to a conclusion when the number of documents increases because the theme across all of the documents may not be the theme of each of the individual documents. The complexity of a topic and the reader’s familiarity with it plays an important role in a reader’s comprehension. This comprehension, along with the ability to summarize the material is important in being able to communicate the content of a document. Thus, it is often difficult for people to identify a common theme over a large set of documents in a timely, consistent, and objective manner. LASI will assist in helping the reader come to an informed conclusion by providing a weighted list of potential themes. 2 PRODUCT DESCRIPTION LASI will be an open-source, stand-alone piece of software designed to run on a consumer grade laptop. LASI will be able to detect themes across many documents and can provide both individual and cross document analysis to determine a single theme. LASI’s ability to analyze multiple documents to find a common theme makes it a great decision support tool for Lab 1 – LASI Description 5 teachers, students, research analysts and those that would need to read through large sets of documents on a frequent basis. Teachers for example, would be able to use LASI as an initial analysis on student papers to check whether or not it is consistent with the topic of that paper. Both students and research analysts could use LASI to quickly assess the usefulness of scientific and literary publications for the topic that they are researching. 2.1 Key Product Features and Capabilities Through the use of Optical Character Recognition (OCR) and a parser that is integrated into LASI, the user has the ability to create a project with multiple file types including DOC, DOCX, PPT, PPTX, TXT and PDF. By finding the commonalities between the documents using their parts of speech and statistics analysis, a common theme can be revealed. Once a project is created, the files can be viewed in plaintext form in the LASI user interface. Documents can be added once the project has been created, as well as after they have already been analyzed. If the project has already been compiled, the documents will be analyzed and then added to the overall results. While the project is being created, there is also the option for the user to add their own dictionary of company specific jargon as well as assumptions about the content. This will help LASI to tailor its analysis to the content and increase the statistical likelihood of determining a theme. Lab 1 – LASI Description 6 Figure 1. Top Results Output Once the documents have been analyzed, the results can be viewed in three different format types: Top Results, Word Relationships, and Word Count and Weighting. The top results will be represented graphically based on the user’s preferred chart type. The types of charts available include tornado charts, bar graphs, and pie charts. In figure 1 there is a tornado chart showing the top 10 most likely themes throughout all of the documents listed in descending order based on the word weight. Each of the documents may also be viewed individually, where the data will be represented similarly. [This space intentionally left blank.] Lab 1 – LASI Description Figure 2. Word Relationships Output The word relationships, as shown in Figure 2, will also be displayed for each document. Each word is colorized based on its part-of-speech. This will allow the user to see the relationships between all of the words in a document. The links between the words is an important visual aid for helping the user to understand the importance of individual words. [This space intentionally left blank.] 7 Lab 1 – LASI Description 8 Figure 3. Word Count and Weighting Output Results will also be displayed based on the individual word count and weight. The weight that will be displayed is based on the weighting algorithm. In Figure 3, this is shown as a simple table that can be sorted by word, frequency, and weight. This will show how each document affected the total results and the importance of individual words. Once the project has finished being analyzed, the results can either be printed or exported in PDF, JPG, and PNG. [This space intentionally left blank.] Lab 1 – LASI Description 2.2 9 Major Components (Hardware/Software) LASI requires a few hardware specifications for the product to run at an optimal level. It is preferable that it is run on a high end business grade computer with at least 8GB or greater of DDR3 SDRAM and a Quad core CPU. It is also requires that the user provide a secondary storage space for documentation. The first software component of LASI is the graphical user interface. This application can be run locally on the user’s machine. This is a Windows Presentation Foundation (WPF) project using XMAL to define the structure of the views and C# to provide the interactivity. The second software component is the file system. It manages converting files and invoking the tagger. After the text file is tagged, it is then passed to a tagged file parser which converts the text into word and phrase types which represent the elements of the document at run time. B2XTranslator is a third party open source software that is being used to convert file types. When documents are added to a project in the GUI, it takes DOCX and converts it to an XMLfile. Once the document is in XML, it can be converted again into a form useable by the parts-of-speech tagger. The parts-of-speech tagger software being used is SharpNLP, open source C# natural language processing tool. SharpNLP utilizes the Penn Treebank parts-of-speech tags to define the parts of speech. SharpNLP will assign each word a type and place groups into phrase types before writing them back to a file. Once the documents have been tagged, it is assigned a word type which corresponds to its part of speech given by the tagger. Phrase types are groups of words that have been put together. Each of these phrase types contains a list of words and the attributes that the syntactic phrase types represent. Lab 1 – LASI Description 10 The fourth software component is the LASI algorithm. The LASI algorithm is written in C#. The LASI algorithm ties word and phrase types together based on their syntactic relationship via a state machine derived logic flow. The document is traversable in multiple methods: Word-wise, Reference-wise, and Tree-wise. When moving through the document in word-wise manner, the document is broken up by individual words. When moving through the document in a Reference-wise manner, this allows the document to be viewed based on the words and phrases that reference each other. A Tree-wise manner follows a specific word to its referenced words and so on. The algorithm focuses on the direct and indirect binding of words and phrases. Direct binding includes the binding of nouns and verbs, adverbs to verbs, adjectives to nouns, determiners to nouns. Indirect binding will include the binding of pronouns to nouns. Once the word and phrase binding is finished, it will begin weighting the words based on their frequency as word as well as how it is used. The weighting metrics for each word will be based on a raw frequency as well as a relative frequency. Each word will have a raw frequency that is based on a simple word count, the number of times that the word was used in a particular manner, and a frequency count for synonyms of that word. The relative frequency will be based on subject, verb and object relationship between words as well as where a word is located in a document. As more bindings get made, the more accurate the results are. [This space intentionally left blank.] Lab 1 – LASI Description 3 11 IDENTIFICATION OF CASE STUDY Dr. Patrick Hester and Dr. Tom Meyers work for the National Center for System of Systems Engineering (NCSOSE) consulting with organizations and businesses that need an outside view on issues or future plans of improvement. When consulting with their client, they use the Assessment Improvement Design Methodology (AID) to help assist the client in both realizing and achieving their goals. The focus is on evaluating current performance with respect to client intent, enhancing performance based off of evaluations of current operations, and procedure versus alternatives. Using this, they create a new method for improvement that is aligned with their client’s intent. Figure 4. AID Process: Assessment In following this process, both Dr. Hester and Dr. Meyers must familiarize themselves with their client’s domain. Essentially, they must become an expert in the inner workings of their client’s organization and the field of work. The level of difficulty for this task is dependent on Lab 1 – LASI Description 12 whether their client provides useable documentation. LASI will assist in the process of defining what the potential problem is and whether it coincides with what the client believes is the issue. In Figure 6, LASI would fit into the Document Analysis portion of the Assessment phase. The results that LASI produces can be used to verify Dr. Hester and Dr. Meyer’s assessment of the situation and serve as visual proof of their reasoning for the client. 4 PRODUCT PROTOTYPE DESCRIPTION Due to time constraints the LASI prototype has a much lessened functionality than the real world product. LASI will still function in the same but in a less complex and process intensive manner. A prototype needs to be developed in order to narrow the scope but still have a product that can demonstrate its capabilities. 4.1 Prototype Architecture (Hardware/Software) The hardware and software components for the prototype will remain largely unchanged from the real-world solution that was discussed in Section 2 and 2.1. Figure 5 shows the hardware and software components of the LASI prototype. The hardware required to run the LASI prototype is a laptop or desktop with at least 8GB of DDR3 RAM and a Quad-Core CPU. For development purposes we will be using a Virtual Machine for a testing and code writing environment. The software needed for the prototype includes our part-of-speech tagger, document converter to convert DOC and DOCX files to TXT files. Other software includes the LASI algorithms and the LASI GUI. [This space intentionally left blank.] Lab 1 – LASI Description 13 Figure 5. Prototype Hardware and Software Component Diagram The third-party software for the LASI prototype is the SharpNLP Part-of-Speech Tagger and the B2XTranslator. The SharpNLP POS Tagger tags words and phrases with the respective parts-of-speech for use by the LASI algorithm. The B2XTranslator converts DOC to DOCX files. The files then can be converted to a TXT file. In the LASI prototype, word and phrase binding works the same as it would in the RealWorld solution. Words and word phrases are interrelated based on the tagged part-of-speech and how they relate to one another within phrases, paragraphs, and the document. The weighting algorithm will assign each word a weight based on its part-of-speech, frequency count and the number of times and ways it is referenced. [This space intentionally left blank.] Lab 1 – LASI Description 4.2 14 Prototype Features and Capabilities Table 1. Feature comparison between prototype and real world product As shown in Table 1, there are a few key differences to the Real World Product and our Prototype. The types of documents that the LASI prototype accepts has been limited to just DOC and DOCX. Scanned text recognition has been removed from the prototype since there is not enough time to get the OCR software fully functioning. The prototype will limit the number of documents that can be added to one project to three to five, and there is a size limitation of 10 pages on each of those documents to insure that the algorithm can function in a timely manner. Rather than focusing on every part of speech, in the LASI prototype we will focus on noun-verb binding. There were also a few of the more complex features that did not make it into the prototype like user defined dictionaries, synonym identification, and content assumption. Despite removing a lot of the unnecessary features, the prototype will still function very similarly to how the real world product would have functioned. Lab 1 – LASI Description 15 GLOSSARY A.I.D. : Assessment Improvement Design A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions. Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation. Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research. Head word: A locally distinct word within a phrase which, by its syntactic associations, determines the category of the phrase itself. LASI: Linguistic Analysis for Subject Identification Lexer: Part of the parsing tool that isolates each word, its part of speech, and location in a sentence into machine readable tokens. These are stored as elements in an XML file. Linguistic Analysis: The scientific analysis of a language. Optical Character Recognition: A word that has an associated part-of-speech. Parser: Takes in DOC and DOCS files and converts them to TXT files Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a sentence. Phrase: A group of words standing together as a conceptual unit, typically forming a new component. Semantic Analysis: Relating the syntactical structure of words to their language independent meanings. Lab 1 – LASI Description 16 Sharp NLP: Written in C#, natural language processing tool used to parse and tag parts-ofspeech. Strategic Document: Document produced by a client that defines what their Goals, Visions, and Missions. Subject Identification: Finds the main actor in a sentence. However, in a broader sense, the word subject is synonymous with the themes of one or more documents. Subject identification is the process of determining subjects, or themes of a document or documents. Syntactic Analysis: A form of Linguistic analysis that focuses on grammar in sentences and identifies themes based on structure and formatting. Unlike Semantic Analysis, it identifies key words based on their location in the sentence, rather than their overall meaning throughout the document. Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set Tag: A label, or the act of attaching a label, that specifies the role (such part of speech or location) of a selected element in a document Tagged Set: A group of words, whose part of speech and location in a sentence have been identified by the parser Tagged Word Object: The process of binding part-of-speech to a word Tornado chart: A horizontal bar graph like visualization, representing the relative frequency or significance of elements, sorted in descending order by magnitude Word Binding: Conversion of scanned images to text WordNet: compiler and provider of our thesaurus. Lab 1 – LASI Description 17 Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, indicating its significance. Lab 1 – LASI Description REFERENCES SharpNLP.(n.d.). Retrieved from http://sharpnlp.codeplex.com/ Office binary to open xml.(n.d.). Retrieved from http://b2xtranslator.sourceforge.net/ 18