Running Head: Lab 2 – LASI Prototype Product Specification Lab 2 – Prototype Product Specification For LASI Red Team Scott Minter April 2, 2013 Version 1 Lab 2 – LASI Prototype Product Specification 2 Table Of Contents 1 Introduction ……………………………………………………………………… 3 1.1 Purpose ………………………………………………………………...… 4 1.2 Scope …………………………………………………………………….. 4 1.3 Definitions, Acronyms and Abbreviations ……………………………… 5 1.4 References ……………………………………………………………….. 6 1.5 Overview .,……………………………………………………………….. 6 2 General Description ……………………………………………………………... 8 2.1 Prototype Architecture Description ……………………………………... 8 2.2 Prototype Functional Description ……………………………………….. 9 2.3 External Interfaces …………………………………………………...… 12 3 Specific Requirements …………………..…………………… See Group Printout List of Figures Example 1: Top Results Screen……………………………………………………….……………… 9 Example 2: Word Relationships ……………………………………………………….. 10 Example 3: Word Count and Weighting ………………………………………………..……...… 10 Figure 1: Prototype Major Functional Components ………………………………........ 11 Lab 2 – LASI Prototype Product Specification 3 1 Introduction Linguistic Analysis for Subject Identification (LASI) is the name of The Red Group’s project. LASI is being developed as a requirement in the Professional Workforce Development I & II courses at Old Dominion University. Linguistic Analysis, in the scope of LASI, is the contextual study of written works and how the words combine to form an overall meaning. LASI will be a decision support tool to assist users in determining common themes across multiple documents. The themes that LASI will produce are going to be subject-object-verb relationships. Themes are important because they help the reader to comprehend what has just been read. Then, if the reader has comprehended what was read, then the reader can summarize the material. Comprehension and summarization are important because they assist the reader in communicating the content of the material with other people. The process of finding common themes across multiple documents may be lengthy and repetitive. This is due to the depth of understanding needed to identify themes across all the documents, which may not be the theme of any individual document. Therefore, it is difficult for people to identify a common theme over a large set of documents in a timely, consistent and objective manner. LASI will assist in this area by providing a weighted list of potential themes from which the user can choose the best fit for their understanding of the material. For LASI to effectively resolve this societal problem it will need to accurately find themes, be system efficient, and provide consistent results. (This space intentionally left blank.) Lab 2 – LASI Prototype Product Specification 4 1.1 Purpose LASI will be a self-contained, stand-alone piece of software. It will not require a connection to the Internet to produce accurate results. LASI will be designed to run on a consumer level laptop or desktop. Also, LASI will be designed to be an open source back-end engine for other projects. The data collected from the analysis performed by LASI can be used drive other projects and their respective GUIs. 1.2 Scope LASI’s ability to identify common themes from multiple documents makes it useful to anyone who reads over large sets of documents looking for commonality. Students could use LASI to verify how advantageous certain scientific publications may be to the topic they are researching. Teachers could use it as an initial analysis tool to verify that student papers are staying on topic. Similar to students, research analysts could use LASI to verify whether or not a different papers and articles address the specific areas they are researching. The LASI prototype will demonstrate an ability to analyze documents syntactically and semantically in order to extract themes from multiple documents. The analysis of the prototype will be more rudimentary than that of a full working model. This would be mainly in the depth of its ability to fully understand and recognize the relationship of words to each other in the document as a whole. (This space intentionally left blank.) Lab 2 – LASI Prototype Product Specification 5 1.3 Definitions, Acronyms and Abbreviations A.I.D.: Assessment Improvement Design A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions. Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation. Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research. Head word: A locally distinct word within a phrase which, by its syntactic associations, determines the category of the phrase itself. LASI: Linguistic Analysis for Subject Identification Linguistic Analysis: The scientific analysis of a language. Parser: Takes in DOC and DOCX files and converts them to TXT files. Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a sentence. Phrase: An instance of the Phrase class. Phrase: (Linguistically) A group of words standing together as a conceptual unit. Phrase Class: The root of the taxonomy of class types which correspond to syntactic roles at the phrase level and whose instances contain a collection of Words which together represent a linguistic phrase. Semantic Analysis: Relating the syntactical structure of words to their language independent meanings. Sharp NLP: Written in C#, natural language processing tool used to parse and tag partsof-speech. Strategic Document: Document produced by a client that defines their Goals, Visions and Missions. Subject Identification: The process by which the subject matter and thematic content of documents is determined. Lab 2 – LASI Prototype Product Specification 6 Syntactic Analysis: Identifies key words based on their location in the sentence, rather than their overall meaning throughout the document. Tagged: The type of file that stores the output of the part-of-speech tagger containing the all of the text of the document with embedded syntactic annotations. Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set. Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected element in a document. Tagged Set: A group of words, whose part of speech and location in a sentence have been identified by the parser. WordNet: Compiler and provider of the data files which forms the basis for the LASI thesaurus. Word Class: The root of the taxonomy of class types which correspond to parts-ofspeech at the word level and whose instances encapsulate each occurrence of a textually identified word. Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, indicating its significance. 1.4 References Hester, P.T., Meyers, T. (2012). Enterprise AID: A performance measurement system for enterprise assessment, improvement, and design (NCSOSE-TR-12-001). Norfolk, VA: National Centers for System of Systems Engineering. 1.5 Overview The product prototype specification provides the hardware and software needs, algorithm data types, graphic user interfaces, and features of the LASI prototype. The information contained in the rest of this document are an architecture description, functionality description, external interface descriptions, functional requirements, performance requirements, assumptions and constraints, and non-functional requirements. Lab 2 – LASI Prototype Product Specification 7 2 General Description The LASI prototype will be able to identify themes. However, it will be doing so without fully associating all words inside a document. The LASI prototype will attempt to use lower level associations such as word counting, POS tagging, adjective to noun, and adverb to verb in order to correctly identify themes. It will be leaving out more difficult associations such as pronoun to noun. 2.1 Prototype Architecture Description The LASI prototype consists of three COTS programs, two algorithms and a user interface. The three COTS programs are SharpNLP Part of Speech Tagger, WordNet Thesaurus data, WPF Toolkit, and document converters. The two main algorithms are binding and weighting. The user interface will be discussed in a later section. The SharpNLP Part of Speech Tagger is software that is being use to take in TXT files, analyze them and return them with words and phrases being tagged for the corresponding POS (i.e. dog->NOUN). WordNet Thesaurus data allows LASI to identify synonyms for words. The WPF Toolkit library enables LASI’s charting capabilities for the GUI. The document converter for LASI is the B2XTranslator and it allows LASI to take in DOC and DOCX files and then convert them to TXT files. The binding algorithm allows LASI to understand how words and phrases relate to one another. It binds words and phrases that go together in meaningful ways. We can look at the statement “The big blue dog ran up the hill.” In this statement LASI would bind big and blue to dog because they both describe an aspect of dog to allow LASI to have a more complete understanding of how dog is being used in this instance. The Lab 2 – LASI Prototype Product Specification 8 weighting algorithm will look at various metrics to weight words and phrases by their relative importance in the document. 2.2 Prototype Functional Description A user will interact with the LASI prototype through a GUI. The GUI will allow a user to start a new project or open an existing one. They will then be able to add or remove the documents involved the in the project. The user will then be able to preview the documents added to the project before starting analysis. After analysis the results will be displayed in three distinct ways for the user to view . The results will be output into three tabs: Top Results, Word Relationships, and Word Count and Weighting. The user will also be able to export the results into PDF format. Example 1 shows a prototype of the Top Results tab displaying the likeliest possible themes based on analysis. Example 1: Top Results Lab 2 – LASI Prototype Product Specification 9 Example 2 is a prototype of the Word Relationship tab. It also shows that the user will be able to see these results for all the documents and for the individual documents. The colors will correspond to the word’s corresponding POS. The search box will allow the user to search for specific words and have them searched words be highlighted. Example 2: Word Relationships Example 3 is a prototype of the Word Count and Weighting Tab. It will display the count of each word in the set of documents and display their weights based on the weighting algorithm. (This space intentionally left blank.) Lab 2 – LASI Prototype Product Specification 10 Example 3: Word Count and Weighting 2.3 External Interfaces The external interfaces that LASI will use are going to be those of the previously discussed COT software components. Other than those the interfaces involved in LASI are custom. 2.3.1 Hardware LASI will be a stand alone program so the hardware required to run the LASI prototype will be a laptop or desktop with four to eight gigabytes of RAM and a multi-core processor. 2.3.2 Software Interfaces The software needed will be the third-party software to tag parts-of-speech and convert DOC and DOCX files to TXT files, the LASI data structures and algorithms and the Lab 2 – LASI Prototype Product Specification 11 LASI GUI (Fig. 7). For in-class development a Virtual Machine is also being utilized as a testing, demonstration, and code writing environment. Figure 1. Prototype Major Functional Components (This space intentionally left blank.)