LAB II – Product Specification Outline 1 CS 411W Lab II Prototype Product Specification For LASI Prepared by: Erik Rogers Date: 03.April.2013 Version 1.0 [Type text] [Type text] [Type text] LAB II – Product Specification Outline 2 Table of Contents (not sure why I can’t change the numerals) 1 Introduction ................................................................................................................... v 1.1 Purpose ................................................................................................................... v 1.2 Scope ..................................................................................................................... vi 1.3 Definitions, Acronyms, and Abbreviations ........................................................... vi 1.4 References ............................................................................................................. ix 1.5 Overview ............................................................................................................... ix 2 General Description ...................................................................................................... x 2.1 Prototype Architecture Description ........................................................................ x 2.2 Prototype Functional Description.......................................................................... xi 2.3 External Interfaces .............................................................................................. xvii 2.3.1 Hardware Interfaces .................................................................................... xviii 2.3.2 Software Interfaces ..................................................................................... xviii 2.3.3 User Interfaces ............................................................................................ xviii 2.3.4 Communications Protocols and Interfaces .................................................. xxii 3 Specific Requirements .............................................................................................. xxii 3.1 Functional Requirements.................................................................................... xxii 3.1.1 User Interface .............................................................................................. xxii 3.1.1.1 Start-up Screen ...................................................................................... xxii 3.1.1.2 Create New Project Screen ................................................................... xxiii 3.1.1.3 Project Preview Screen ......................................................................... xxiii [Type text] [Type text] [Type text] LAB II – Product Specification Outline 3 Table of Contents (continued) 3.1.1.4 Processing Screen ................................................................................. xxiv 3.1.1.5 Results Screen ...................................................................................... xxiv 3.1.2 File Manager ................................................................................................ xxv 3.1.3 Tagged File Parser ...................................................................................... xxvi 3.1.4 Word Association ....................................................................................... xxvi 3.1.5 Weighting Algorithm ................................................................................. xxvii 3.2 Assumptions and Constraints ............................................................................ xxix 3.2.1 Assumptions .................................................................................................... xxx 3.2.2 Constraints ....................................................................................................... xxx 3.2.3 Dependencies .................................................................................................. xxxi 3.3 Non-Functional Requirements .......................................................................... xxxi 3.3.1 Security ....................................................................................................... xxxi 3.3.2 Maintainability............................................................................................ xxxi [Type text] [Type text] [Type text] LAB II – Product Specification Outline iv List of Figures Figure 1 - Major Component Model .................................................................................. xi Figure 2 - LASI's 3-Sector Algorithm Overview .............................................................. xii Figure 3 - Document Traversal ........................................................................................ xiii Figure 4 - Word Type Class Diagram .............................................................................. xiv Figure 5 - Phrase Type Diagram ....................................................................................... xv Figure 6 - Interface Types ................................................................................................ xvi Figure 7 - UI - Top Results Tab ....................................................................................... xix Figure 8 - UI - Word Relationships Tab ........................................................................... xx Figure 9 - UI - Word Count and Weighting Tab ............................................................. xxi List of Tables Table 1. Effects of Assumptions, Dependencies, and Constraints on Requirements ..... xxx [Type text] [Type text] [Type text] 1 1.1 Introduction Purpose Linguistic Analysis for Subject Identification (LASI) is a computer application currently being developed by the CS411 Red Team. Linguistic analysis is the examination of language form, language meaning, and the ways in which these two entities synergize to form language context. LASI, a linguistic analyzer that aids the user as a decision support tool, extracts themes, or specific qualities and characteristics, of a document or range of documents. Locating the themes of documents is necessary as it allows the reader to comprehend what has been read; in comprehending what has been read, the reader can summarize and share the material. LASI will take various document types as input and return a weighted list of themes in each document individually and any common themes found over the group of input documents. LASI will not make decisions for the user directly; rather, it will allow the user to infer information based on the results. The CS411 Red Team imagines that LASI will not only aid the persons who presented us with the problem, Dr. Hester and Dr. Meyers (introduced in Lab 1, Section 1), but will also be beneficial to numerous professions. Students will be able to utilize LASI in order to determine whether publications across the Internet relate to their areas of study or research paper topics. Teachers could implement LASI in the classroom by using it to grade student papers or provide examples as to how language is used and interpreted. Research Analysts and Statisticians should be able to parse numerous documents in order to quickly locate the topics to crunch data values. Contractors, Consultants, and other related professions would be able to implement all of the previous uses to suit their individual or clients’ needs. 1.2 Scope LASI will implement a handful of features detailed later in this document—a few of which are advanced in this field of linguistic analysis. LASI will be able to input multiple documents, of any text file type, and deduce the individual themes of each document as well as the themes and commonalities between all documents included. The prototype will allow users to infer important themes without having to manually parse the document. 1.3 Definitions, Acronyms, and Abbreviations A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions. Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation. Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research. Head word: A locally distinct word within a phrase which, by its syntactic associations, determines the category of the phrase itself. LASI: Linguistic Analysis for Subject Identification Linguistic Analysis: The scientific analysis of a language. Parser: Takes in DOC and DOCX files and converts them to TXT files. Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a sentence. Phrase: An instance of the Phrase class. Phrase: (Linguistically) A group of words standing together as a conceptual unit. Phrase Class: The root of the taxonomy of class types which correspond to syntactic roles at the phrase level and whose instances contain a collection of Words which together represent a linguistic phrase. Semantic Analysis: Relating the syntactical structure of words to their language independent meanings. Sharp NLP: Written in C#, natural language processing tool used to parse and tag partsof-speech. Strategic Document: Document produced by a client that defines their Goals, Visions and Missions. Subject Identification: The process by which the subject matter and thematic content of documents is determined. Syntactic Analysis: Identifies key words based on their location in the sentence, rather than their overall meaning throughout the document. .TAGGED: The type of file that stores the output of the part-of-speech tagger containing the all of the text of the document with embedded syntactic annotations. Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set. Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected element in a document. Tagged Set: A group of words, whose part of speech and location in a sentence have been identified by the parser. WordNet: Compiler and provider of the data files which forms the basis for the LASI thesaurus. Word Class: The root of the taxonomy of class types which correspond to parts-ofspeech at the word level and whose instances encapsulate each occurrence of a textually identified word. Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, indicating its significance. 1.4 References "WordNet." About - . Princeton University, 27 Dec. 2012. Web. Fall 2012. <http://wordnet.princeton.edu/>. "The Stanford NLP (Natural Language Processing) Group." The Stanford NLP (Natural Language Processing) Group. Stanford University, n.d. Web. Fall 2012. <http://www-nlp.stanford.edu/>. Rogers, Erik P. CS411 Red Team - LASI - Lab 1. Paper. Old Dominion University, 2013. Print. 1.5 Overview This product specification provides the hardware and software configuration, external interfaces, capabilities, and features of the LASI prototype. The information in the remaining sections of this document includes a detailed description of the hardware, software, and external interface architecture of the LASI prototype; the key features of the prototype; the parameters that will be used to control, manage, and establish each feature; and the performance characteristics of each feature in terms of outputs, displays, and user interaction. 2 2.1 General Description Prototype Architecture Description The LASI prototype is composed of two main components—hardware and software. It will not implement any hardware, rather it will require the use of preexisting hardware, with required minimum technical specifications, in order to run successfully and within optimal constraints. LASI’s software is branched into two subcomponents—the algorithm and user interface. The algorithm consists of three sectors (a primary analysis, secondary analysis, and tertiary analysis). The user interface will allow for the users to interact and utilize the algorithm while keeping information that is irrelevant or able to cause confusion within the user encapsulated. [This space intentionally left blank] Figure 1 - Major Component Model The Major Component Model diagram in Figure 1 displays the minimal technical requirements for hardware (whether it be a physical machine or virtual machine) to run the software package as well as the separation of the two subcomponents provided within the software package. LASI’s prototype will require the user to obtain a system with a quad-core processor, 8 gigabytes of random access memory, and a solid-state storage drive or better. LASI’s internal calculations and passes will be hidden from the user through the use of a user interface that will display only the relevant, final results. 2.2 Prototype Functional Description The entirety of LASI’s algorithm is conceptualized in three phrases for effective programming. Each of these phases is essential to a proper, expected output. The phaseby-phase breakdown follows. Figure 2 - LASI's 3-Sector Algorithm Overview Figure 2 depicts LASI’s 3-sector algorithm overview. Each stage is executed linearly. The “Primary Analysis” phase will parse input documents word-by-word, tag each part of speech, and count the frequency of each word and part of speech found. The “Secondary Analysis” phase will bind pronouns and adjectives to nouns and tag the respective relationships as phrase types. The “Tertiary Analysis” phase will locate and relate synonyms based on a set taxonomy, access subject-object-verb relationships and assign them of a phrase type, parse through each word/phrase relationship and assign weights, and finally output the final weights to the user interface. [This space intentionally left blank] Figure 3 - Document Traversal The first function that the prototype will undergo is that of parsing the document. The Document Traversal, presented in Figure X, shows the structure of a document. This organization scheme allows LASI to divide and conquer documents by splitting them into sub types. When LASI first parses a document, it will separate the document into paragraphs, sentences, and words. The CS411 Red Team has included both clause and phrase types in this class structure for use in future phases. [This space intentionally left blank] Figure 4 - Word Type Class Diagram The CS411 Red Team has implemented SharpNLP’s part of speech tagger to accomplish part of the parsing phase. SharpNLP parses the documents and tags the parts of speech in a language that LASI’s algorithm can import and make use of. The Word type construct, built by the CS411 Red Team upon the class diagram in Figure 5, will wrap textual tokens, imported from SharpNP’s results, word-by-word and link them by role. This functionality will allow the prototype’s algorithm to access and manipulate tokens based on their part of speech and relative use in sentences. Once LASI has linked each token by role, it will be able to begin the second stage of binding and relating parts of speech to one another to construct phrases. Figure 5 - Phrase Type Diagram Figure X presents the class structure for determining phrase types. Once a document has passed through the “Primary Phase”, it is eligible for use in the “Secondary Phrase”. Here, each token in the document will be re-analyzed, syntactically, with the tokens neighboring it. If there is any relevant relationship between tokens within the same sentence or paragraph, the relationship will be stored and a phrase type will be assigned. [This space intentionally left blank] Figure 6 - Interface Types In order to aid the construction of phrases, LASI implements an interface hierarchy, pictured in Figure X. This interface allows for each independent entity that is contained within a phrase to assume certain roles based on its personality. Said roles can be used in the “Tertiary Phase” in hopes to strengthen the syntactic relationships between phrases and infer levels of semantic relationships. Once a document enters the “Tertiary Phase”, it will be parsed, again, token-bytoken in search for synonyms. A collection of database files, provided by WordNet, and arranged by part of speech, are parsed and each token in the database file is compared to each token in the document. When a match in the same type database file is found, the token and its synonym will be inspected by their philosophical category--provided by WordNet. If the token in the document and it’s matched synonym are within the same category, they will be marked as synonyms. LASI will then assign each token in the document a weight determined by its part of speech. The weighting of each token individually will provide an objective standard for further weighting modifiers. After each token possesses a weight, LASI will search for token-to-token relationships, defined in the “Secondary Phase”, and phrase relationships and assign weight modifiers determined by both syntactic and semantic relationships. Weights of synonyms will be handled separately to ensure a stable distribution of importance. The final weights of each phrase will be leveled by the token standardization. Once the weights are finalized, they will be sorted. The final component of the “Tertiary Phase” will allow for the user interface to retrieve the results for display. 2.3 External Interfaces LASI has been designed as a stand-alone and open-source software application that is to be run on a local (or virtual) machine such as a laptop or desktop computer. In this regard, external interfaces are limited to standard hardware, provided by the user, and all third-party software has been incorporated internally. Users will not be required to purchase hardware specific for the prototype, nor will they need to make use of outside software. [This space intentionally left blank] 2.3.1 Hardware Interfaces No hardware interfaces will be constructed for this prototype. A virtual machine, hosted by Old Dominion University, will be used in order to demonstrate LASI. LASI can also be run on a physical machine within the lab. 2.3.2 Software Interfaces No software interfaces will be required for this prototype to run. LASI contains all of functions for, and calls to, third-party software internally. Output results can be exported to Microsoft Applications if the user so chooses, but are not required for LASI to function. 2.3.3 User Interfaces One of the key features of LASI is the ability to graphically display the parsed results in a manageable format. The CS411 Red Team will provide three tabs: Top Results, Word Relationships, and Word Count and Weighting. These three tabs allow the user to view the results in three separate modes and provide sub tabs to view each document individually in addition to viewing the collective results of all documents. [This space intentionally left blank] Figure 7 - UI - Top Results Tab Figure 1 illustrates the Top Results tab. This tab allows the user to view the top results, in chart form, that are the most likely themes for the document or documents analyzed. Results are available for each document individually as well as the collection of documents as a single entity. These results can be exported to Microsoft Office applications. [This space intentionally left blank] Figure 8 - UI - Word Relationships Tab The Word Relationships, displayed in Figure 2, tab gives users the option to view the word relationships, including parts of speech, in each document individually. In this view, users will be able to mouse over each word in order to see the relationships, individual word statistics, and phrase statistics. A key will be added to this view to match the color-coding. [This space intentionally left blank] Figure 9 - UI - Word Count and Weighting Tab Figure 3 displays the Word Count and Weighting tab, which will show each word’s count and weight for each the individual and collective documents. The word count variables for each word will be represented by a whole number as well as a percentage of frequency in comparison to other words. Weights will be represented as a decimal. LASI enables users the option to print or export their results. Said results will be exportable to Microsoft Office applications (such as Excel, Word, and PowerPoint) as well as graphical images (with .JPG, .PNG, etc. extensions) and in PDF format. The CS411 Red Team hopes that this feature will allow for extended use of the results. [This space intentionally left blank] 2.3.4 Communications Protocols and Interfaces No communication protocols or interfaces will be required for the prototype to run. LASI was built to run without connection to sources external to the machine it is being executed on. In order to demonstrate LASI’s functionality, Transmission Control Protocol/Internet Protocol (TCP/IP) over a standard Ethernet connection will be utilized to access the designated virtual machine. 3 Specific Requirements The following section describes the specific functional and non-functional requirements along with the assumptions and constraints of the LASI prototype. 3.1 Functional Requirements The functional requirements describe the capabilities of the LASI prototype. They describe what the product must do in order to meet the previously discussed goals and objectives of the project. 3.1.1 User Interface The LASI GUI is the way the user interacts with LASI and views the results. 3.1.1.1 Start-up Screen The Start-up Screen will provide two distinct paths that provide access to LASI’s functionality. It will allow the user to create or load a project. 1. The user shall be able to create new projects 2. The user shall be able to load new projects 3.1.1.2 Create New Project Screen The Create New Project Screen guides the user through the new project creation process. It will prompt the user to enter the necessary information. During this process the screen will display a running list of the documents selected for analysis. The following functional capabilities shall be provided: 1. The user shall be able to add files to project a. DOC b. DOCX c. TXT 2. The user shall not be able to load any other file types 3. The program shall only allow a maximum of five documents to be loaded into a single project. 4. Documents added to the project will be displayed in a document queue. 5. Documents can be removed from the document queue. 6. All fields must be correctly filled out to create a new project. 3.1.1.3 Project Preview Screen The Project Preview Screen will provide a preview of the documents selected during the Create New Project Screen. It allows the user to verify that the correct documents have been selected, and remove or add additional documents. This screen will allow the user to start the analysis process. The following functional capabilities shall be provided: 1. LASI shall provide a preview of uploaded documents. a. Each tab in the document preview shall display the title of its document b. 2. Each tab shall display the text of the document. The user shall be able to add accepted files to a project. a. The Program shall only allow a maximum of 5 files. 3. The user shall be able to remove documents from a project. 4. LASI must have at least 1 document to start analyzing. 3.1.1.4 Processing Screen The Processing Screen will display a moving graphic to show that the system has not frozen. The user will also be able to interrupt analysis. The following functional capabilities shall be provided: 1. 2. The user shall be able to interrupt analysis. a. The user shall be returned to the Project Preview Screen. b. All temporary data will be discarded. LASI shall display a visual indication that analysis is still ongoing. 3.1.1.5 Results Screen The Results Screen will allow the user to toggle between different scopes, perspectives, and levels of detail. The following functional capabilities shall be provided: 1. LASI shall render results in multiple views. a. The Top Results View shall provide a visualized summary of the analysis. a.1. The user shall be able to toggle between different graphical views of the top results. a.2. This view shall allow the user to toggle between individual and collective document scopes. b. The Word Relationships View shall provide visuals to show all relationships and bindings of words throughout each document. c. Word Count & Weighting View c.1. The user shall be presented with the quantitative data used in the analysis. c.2. This view shall allow the user to toggle between individual and collective document scopes. 2. LASI shall be able to export results. 3.1.2 File Manager The File Manager verifies that the documents loaded into a project are of the types allowed. It provides file conversion routines to format documents into plain text. It also manages the tagging process and the resulting TAGGED files. The following functional capabilities shall be provided: 1. The file manager shall accept a path to a document. 2. The file manager shall verify that the document is in one of the following file formats: a. DOC b. DOCX c. TXT 3. The file manager must be able to convert a DOC file to DOCX. 4. The file manager must be able to convert a DOCX file to TXT. 5. The File Manager shall invoke the SharpNLP tagger to process each TXT file into a new TAGGED file containing: 6. a. The original text of the document. b. The part of speech of each word. c. The type of every phrase. The file manager shall provide functionality to backup up the entire project directory. 3.1.3 Tagged File Parser The Tagged File Parser loads the TAGGED files and creates a data structure inmemory of the documents. The following functional capabilities shall be provided: 1. The Tagged File parser shall only accept a TAGGED file. 2. The Tagged File parser shall create an instance of the Word subclass corresponding to the annotation imbedded for that word in the TAGGED file. 3. The Tagged File parser shall create an instance of the Phrase subclass corresponding to the annotation imbedded for that phrase in the TAGGED file. 3.1.4 Word Association The Word Association algorithm will associate words and phrases to one another based on their POS and their syntax within the document. The following functional capabilities shall be provided: 1. The Subject binder determines which noun phrases are the subjects of verb phrases. 2. The Object binder determines which noun phrases are the direct objects of verb phrases. 3. The Object binder determines which noun phrases are the indirect objects of verb phrases. 4. The Thesaurus correctly identifies synonyms. 5. An adjective or adjective phrase describing a noun or noun phrase is bound as a describer to that noun or noun phrase. 6. A noun or noun phrase that is the subject of a verb or verb phrase is bound as a subject to that verb or verb phrase. 7. A noun or noun phrase that is the direct object of a verb or verb phrase is bound as a direct object to that verb or verb phrase. 8. A noun or noun phrase that is the indirect object of a verb or verb phrase is bound as an indirect object to that verb or verb phrase. 9. Adverb phrases are associated with the adjective phrases or verb phrases that they modify. 3.1.5 Weighting Algorithm The Weighting Algorithm will calculate numeric weights for each Word and Phrase based on their syntactic associations. Based on this analysis themes are assembled. The following functional capabilities shall be provided: 1. It shall each Word instance shall start with an equivalent initial weight. 2. It shall each Phrase instance shall start with an equivalent initial weight. 3. It shall update previous weight of each word when encountered again. 4. It shall update previous weight of each word when the association count is incremented. 5. The weighting algorithm shall increase the weight of a Word if a synonym is encountered. 6. 7. 8. The weighting algorithm shall increase the weight of a Word if: a. A Word is associated with it. b. A Phrase is associated with it. The weighting algorithm shall increase the weight of a Phrase if: a. A Word is associated with it. b. A Phrase is associated with it. The algorithm shall exclude weights of commonly used words such as 'the', 'to', a, etc. on an individual word-by-word basis. 9. The distance between words and phrases shall be used as a weight modifier. 10. Each Word instance shall store its weight with respect to: a. The individual document containing it. b. All of the documents in the project. 11. Each Phrase instance shall store its weight with respect to: a. The individual document containing it. b. All of the documents in the project. 12. There shall be an aggregate result computed across all documents. [This space intentionally left blank] 3.2 Assumptions and Constraints The LASI prototype will operate on a set of assumptions and constraints that will act as boundaries for the prototype functionality. Table 1. contains the full list of assumptions, constraints, and dependencies for the prototype. Condition Document types are Type Constraint limited to DOC, DOCX, Effect on Requirements If a document cannot be converted into raw text, it cannot be accepted. and TXT. A project is limited to 5 Constraint documents. Documents submitted This restricts the number of documents to a testable set. Assumption Allows for minimal error checking for the shall consist of entirely purposes to developing and demonstrating the grammatically correct prototype. statements. The host machine will have a sufficient amount Assumption Allows for minimal error checking for the purposes of demonstrating the prototype. of RAM and at least one multicore processor. The host machine must have .NET version 4.5 or greater. Dependency The prototype cannot be demonstrated without this. The host machine must Dependency The prototype is not designed to work on other be using a 64-bit operating systems. It lacks a GUI for other Windows operating operating systems and is untested. system. Table 1. Effects of Assumptions, Dependencies, and Constraints on Requirements 3.2.1 Assumptions Assumptions with respect to the LASI prototype are being made. First, documents that are used for analysis are out of LASI’s control. It is expected that the documents submitted shall consist of entirely grammatically correct statements. This will allow for minimal error checking for the purposes of developing and demonstrating the prototype. Second, the host machine is assumed to have sufficient specifications to be able to run the LAASI prototype. 3.2.2 Constraints A number of constraints will be used to limit the scope of the prototype to simplify the development process. First the type of documents that the LASI prototype can accept has been limited to DOC, DOCX, and TXT. Images cannot be analyzed. This also means that if a document cannot be converted into raw text, it cannot be accepted. Second, the prototype will limit the number of document that can be added to one project to five. This is to insure that the algorithm can function in a timely manner. 3.2.3 Dependencies There are two dependencies that have been identified for the LASI prototype. The hardware used for the prototype demonstration must have .NET version 4.5 or greater installed. The host machine must all be using a 64-bit Windows operating system since the prototype is not designed to work on other operating systems. It lacks a GUI designed for other operating systems and is untested in such an environment. ODU servers are expected to be available to host the LASI components. If the ODU servers are unavailable, personal hardware would need to be used. 3.3 Non-Functional Requirements 3.3.1 Security No security requirements are required for the prototype. It is advised that users keep sensitive documents protected within their personal storage drives. Sensitive documents that are handled by LASI are only subject to security risks if the user leaves their personal machine unprotected. 3.3.2 Maintainability The prototype’s entire functionality can be maintained. The CS411 Red Team plans to release the software as open-source so that the community at large can continue to approve upon our intended output or manipulate the current infrastructure in order to accomplish tasks that LASI was not initially intended for.