Lab 1 - LASI Product Description Lab 1 - LASI Product Description Red Aluan Haddad CS411W Gene H. Price, Janet B. Brunelle 03/18/2013 Version 2 1 Lab 1 - LASI Product Description 2 Contents LIST OF FIGURES ................................................... ERROR! BOOKMARK NOT DEFINED. 1. INTRODUCTION ...................................................................................................................... 4 2. PRODUCT DESCRIPTION ....................................................................................................... 5 2.1 Key Product Features and Capabilities .............................................................................. 6 2.1.1 Deployment and Modularity........................................................................................... 7 2.1.3 Native Source Document Compatibility......................................................................... 9 2.1.4 Output ........................................................................................................................... 11 2.2 MAJOR FUNCTIONAL COMPONENTS (HARDWARE AND SOFTWARE) ..................................... 14 2.1.1 HOST PLATFORM SOFTWARE SYSTEM REQUIREMENTS ERROR! BOOKMARK NOT DEFINED. 2.2.2 CURRENT OFF THE SHELF SOFTWARE COMPONENTS ......................................................... 15 2.2.4 Key Type System and Grammatical Representation Foundations ............................... 17 2.2.3 KEY ALGORITHMS ............................................................................................................. 19 3. IDENTIFICATION OF CASE STUDY ................................................................................... 21 PROTOTYPE DESCRIPTION .................................................................................................... 22 4.1 MAJOR FUNCTIONAL COMPONENTS (HARDWARE AND SOFTWARE) ..................................... 22 4.2 FEATURES AND CAPABILITIES .............................................................................................. 23 GLOSSARY OF TERMS ............................................................................................................. 24 Lab 1 - LASI Product Description 3 List of Figures Figure 1 the user flow through the LASI GUI .............................................................................. 10 Figure 2 Tornado Chart of words in a paragraph written by Drs. Hester and Myers ................... 11 Figure 3 demonstrates the use of syntax highlighting in the GUI ................................................ 12 Figure 4 demonstrates a statistical view of a parsed document .................................................... 13 Figure 6 provides an abstract conceptual description of the syntactic anaylsis process ............... 19 Figure 7 loosely describes LASI's hardware and softawre separation .......................................... 22 Figure 8 contrasts the functionality of the prototype with conceptual package ........................... 23 Lab 1 - LASI Product Description 4 1. Introduction LASI (Linguistic Analysis for Subject Identification) is a natural language processing engine that will combine raw lexical analysis heuristics with sophisticated syntax aware heuristics and thereby form a basis to extrapolate, determine ways to interrelate, and abstract statistically derived semantic content over an input domain containing multiple English written works. The process of linguistic analysis, defined herein as the procedural study of how the words and phrases within a written work compose to form emergent meanings, is the central concept behind the LASI project. The use of language is a constantly evolving, self-describing process which complexifies on the composition of syntactic rules to express complex, emergent ideas, which in turn compose together to form themes: the distillation and relationship of the relationships literally described by the document. Essentially, linguistic analysis in this context aims to, at least conceptually, to sufficiently quantify qualitative information, thereby reducing trivial disagreements to be dispelled allowing for faster and more effective decision making. Such analysis tools can provide key services in role as decision support tools. LASI is such a tool. The notion of theme refers to emergent, overlapping, intra and inter-textually derived, mental constructs which represent one of the key bases for human communication. In a sense, themes provide an abstraction interface which allows for the expression of linguistic ideas. However, as much as communicating via thematic abstraction is something without which humans would be unable to express complex ideas to one another; interpreting and expressing themes is often fraught with misunderstandings, conflation, and subject-arbitrary emotional associations. Any of these pitfalls can impede and stifle linguistic communication. For example, consider the case of an author who, while he genuinely expresses a certain theme with eloquence and brevity, is criticized for stating something he did not in fact assert, but the reference frame of the reader unpredictably clashed with that of the author in a way neither of them were capable of Lab 1 - LASI Product Description 5 predicting, and perhaps resulted in a mutual perception of disagreement over a subject on which, had the authors words been parameterized differently, they might have wholeheartedly agreed. Thus, in spite of or perhaps because of the critical role which themes play in communication and expression, a multitude of potentially baseless or conflated concepts are communicated between authors and readers as well as between individual readers. In the case of readers, small differences in their respective interpretations of some works can compose into serious disagreement over what a given author is trying to express. While this has many powerful and sometimes even positive implications, and while it forms one of the key underpinnings for metaexplorative disciplines such epistemology and philosophy, discord over needless misunderstanding can have very harmful effects in areas where justifiable, imperative decisions must made based solely on textual perusal. For example, consider a situation is when time critical decisions must be made by government agencies or large corporations who must carefully determine how to allocate of scarce resources, or make time-critical financial or military decisions. As these situations involve multiple individuals doing independent research and then pooling their knowledge, and since some such degree of semantic disagreement is inevitable in a relatively democratic environment, serious problems such as needless delays, resource misappropriations, or outright inaction may result and cause severe damage. 2. Product Description LASI is a software package which aims to provide such decision support and validation via combinations of high performance, context aware statistical heuristics and a graphical front which will allow researchers to glean more meaningful results more quickly from their sources of written information. The pairing of these sophisticated algorithms and a graphical user Lab 1 - LASI Product Description 6 interface (GUI) will allow the tool to reach and assist a broad range of individuals with widely varying technical backgrounds and areas of research. Additionally, because of the broad societal significance of the problem it approaches, LASI has many potential applications which exceed the domain of pure research. For example, its pattern recognition and synonym generalization features could help professors to identify plagiarism, which has been wrapped in a thin veneer by basic paraphrasing. On the other hand, its contextual awareness capabilities could help students to quickly find relevant sources to cite for written assignments. In more in depth contexts, advanced LASI users such as Researchers, like the progenitor of the LASI project Dr. Patrick T. Hester, can reap the benefits of the algorithms inferential capabilities, to provide their clients with quantitatively verifiable assessments of the complex systems they assess, and thereby providing recommendations that are more specialized. Broadly speaking, any individual needing to quickly become familiar with a single specific area of broad field of study or knowledge could employ LASI’s unique functionality to quickly hone in on increasingly relevant written resources. Due to the both the selection of C# and Dr. Hester’s use of Windows enterprise software, LASI will initially target Microsoft’s .Net framework. However, due to the availability of reliable C# framework implementations for non-Windows platforms, the slow but steady transition Microsoft is making towards and supporting open source programs, and the conservative selection of core language features used in its implementation, LASI will ultimately be accessible to users of a wide array of software platforms including Windows 7 and 8, various iterations of Mac OSX, and a multitude of Linux based platforms including RedHat and BSD. 2.1 Key Product Features and Capabilities LASI will feature a number of different assessment techniques over which the user can exert control as they desire to extrapolate and construct, from a linear set of word-tokens, a Lab 1 - LASI Product Description 7 reflexive web of syntactic and sematic associations which all of which will be revised and refined recursively as it continues to infer potential relationships. Before attempting any higher level analysis, a set of syntactic parsing libraries will examine the text and identify, statistically and locally, the likely part of speech of text’s lexical constructs. The result of this phase is a collection of words and phrases which have been usagewise categorized and thus mapped to program constructs which encapsulate their syntactic roles. After this initial step, which results in a dynamic word and phrase behavior driven data model, a large number of independent statistical functions and element association techniques will be applied, their results compared, procedures potentially reordered and reevaluated, and finally interrelated over multiple sources and representations in an attempt to find the common thematic ideas and shared concepts of the input domain. A key technique that allows this to be accomplished is the assignment of a variety of numerical weights, both to individual word and phrase elements and to sets of potentially associated constructs which are iteratively modified and scaled by each subsequent metric applied. 2.1.1 Deployment and Modularity LASI is an open source, portable, and standalone application. After its release version is finalized, its source code will be made freely available under a yet undetermined open source license. Although much of the functionality it aims to provide is designed to be hardware platform agnostic, in order to gain timely results from comparative analysis of multiple documents, a mid-range laptop or desktop computer, currently valued at roughly than 600 U.S.D will be required. The core of the LASI Framework consists of a set of modular algorithms and data structures, which together function as an engine which weaves together English textual information into an Lab 1 - LASI Product Description 8 queryiable run time data structure, thus allowing for new statistical functions to be added and new orderings of metrics to be defined with minimal overhead. This facilitates efficient debugging and ever increasing understanding as the LASI team develops the project and, once the source is publicly released, inherently defines a convenient flexible API under which new programmers can writing extensions or add to the base. Because LASI is designed as a standalone desktop application that can be run on affordable, commonplace computers, and because it has been designed with open source extensibility and accepted, familiar GUI conventions in mind, it has the potential to reach beyond the archetypal science academics that generally make use of heuristic data processing engines. The primary goal of LASI is to assess, compose, and interrelate large quantities of Sufficiently Contiguous English text. Initially, the domain of input documents will be limited to peer-reviewed research papers and academic journals articles in order to reduce early issues that may arise from the increasingly generally accepted use of some colloquializations in popular writing. LASI will then process these documents in order to extrapolate thematic content and ultimately, via a concept roughly analogous to set intersection, in order determine a valid, intertextual commonality between them if it can be found. Because it can be assumed there will be many causes where common patterns are found between documents that have little to do with one another, determining a range of valid thresholds for statistical significance, especially when analyzing the results of some of the more naïve metrics, is a key aspect of the implementation. (This space intentionally left blank.) Lab 1 - LASI Product Description 9 2.1.2 Native Source Document Compatibility In terms of common capabilities and user accessibility features, the LASI software package will accept English textual works in multiple popular file formats. Currently all Microsoft Word document types as well as raw ASCII text files are fully supported LASI will also provide native support for adding Adobe Acrobat documents directly to user projects at some point. Implementing this functionality has been given a relatively low priority by the team as it requires that an optical character recognition system be implemented or integrated successfully such that all potentially erroneous characters parsed from an Acrobat document containing scanned text must be differentiated and completely dealt with before the text is passed to the tagging module. In addition to parsing the data provided by the user, its functionally allows users to provide custom dictionary-like inputs containing weight adjustments, static associations, explicit synonym collections, and syntactic-role overrides for lexical entities in order to facilitate more focused, user-intent-driven results. While his has the advantage of increasing user control over the process and allowing for more customizable selection of results and their arrangements, it is inseparably tied to the a loss of a demonstrable validity, detracting from any assertions developers can make about accuracies and bias likelihoods when shipping an iteration of LASI which provides such a feature. The most agreeable middle ground probably is an approach allowing users to make some adjustments, through a properly abstracted interface, and providing clear, unmistakable warnings regarding the decreasing verifiability of results. The user interface provides standard, responsive navigation functions that explicitly provides for all of the possible branches as illustrated by Figure 1. Lab 1 - LASI Product Description Figure 1 the user flow through the LASI GUI 10 Lab 1 - LASI Product Description 11 2.1.3 Output The default UI results format will consist of a colorized graphical display of the key results. Tabbed views and context menus will allow the user to filter and organize the results, by specifying specific source documents, word and phrase relationships, and correlation views. A simple, but expressive, example of such a view is illustrated by (Figure 1). Figure 2 Tornado Chart of words in a paragraph written by Drs. Hester and Myers An additional results format, also exhibited by the prototype GUI, conveys syntactic information to the user via part-of-speech-colorized syntax highlighting. While static at the time of the rasterization below, the dynamic nature of the run-time representation of textual entities Lab 1 - LASI Product Description 12 allows for detailed association information to be displayed for each instance of each word via syntactic inference. The sample-colorized output is shown in (Figure 3). Figure 3 demonstrates the use of syntax highlighting in the GUI In addition to these contextual, high level views of LASI’s analysis results, more austerely quantitative representations will be provided. Such views, a prototype example of which can be seen in Figure 3 which displays the overall frequency sorted by part-of-speech and then by text-content of every word in the document, have the two-fold advantage of providing the user with comprehensive results and serving as a manual validation tool. (This space intentionally left blank.) Lab 1 - LASI Product Description 13 Figure 4 demonstrates a statistical view of a parsed document The goal is to provide, for each category of information which the LASI engine can infer from a document, a human readable, non-number-overloaded view which highlights the relevant information and provides contextual navigation to other perspectives. With this in mind, the prototype screen rasterizations shown in Figure 1, Figure 2, and Figure 3 are only some of the views that are intended.In addition to in program dynamic result renderings, the LASI UI will facilitate exporting static views of all results to common presentational, tabular, and or serialization-oriented file formats such as Adobe Acrobat and Microsoft Excel formats in addition to simple non-proprietary formats such as the CSV (Separated Value) , XML (Extensible Markup Language) , and JSON(JavaScript Object Notation) file formats. This Lab 1 - LASI Product Description 14 allows for results to be flexibly retained, viewed, and shared indecently of the LASI environment itself. 2.2 Major Functional Components (Hardware and Software) The requirements for the host operating system are fairly standard, consisting of an up-todate build of Microsoft Windows 7 Home Premium or above and an up-to-date version of the Microsoft .Net Framework v4.5 or above. Support for non-Microsoft based platforms, such as a RedHat build pared with the Mono Framework, is a planned feature. Support for the DotGNU UNIX platform is also a future possibility. The physical hardware requirements being targeted, irrespective of the operating system hosting LASI, are those of a fast but affordable desktop or notebook computer. While some requirements are more flexible than others, the absolute minimum system specifications required are that the Processor must have at least four logical cores (via an dual core Intel core series processor with hyper-threading support enabled, or a quad core AMD processor), be clocked at a frequency at or above 2.0GHz, and a minimum of eight gigabytes of DDR3 (Double Data Rate memory type 3) of total system memory clocked at or above 1,066MHz. For an optimal experience, or for an open source developer experimenting with the code post release, the recommended hardware requirements consist of a processor having at least eight logical cores (via an quad core Intel core series processor with hyper-threading support enabled or a eight core AMD processor), of eight gigabytes of low latency DDR3 clocked at or above 1,333MHz with timings memory access latencies not greater than 9-9-9, a solid state based data storage medium for document retrieval having at least 128 megabytes of onboard DDR3 cache and a rated random read speed of at least 40 megabytes per second for arbitrary 512 kilobyte data blocks. Lab 1 - LASI Product Description 15 2.2.1 Current off the Shelf Software Components The LASI project library contains source and executable code files from two preexisting open source C# projects. First, LASI incorporates executable code files from b2xtranslator, an open source binary to XML file format converter. Specifically, LASI contains two of its child programs, the precompiled executable doc2x which converts Legacy (1997) Microsoft Word .doc file to .docx open XML files and the precompiled executable ppt2x which converts Legacy (1997) Microsoft PowerPoint .ppt files to .pptx open XML files, which are included and used under the FreeBSD open source license. Secondly, and far more significantly, LASI contains the part-of-speech-tagging library SharpNLP, an open source C# fork of the OpenNLP and its dynamic link libraries, which are included and used under the limited GNU open source license. The methods provided by therein provide critical support to the LASI project as they are utilized to convert from ASCII text files containing whitespace delimited word-tokens into tagged files wherein these tokens are represented as serialized Tagged Word Objects (TWO) which contain the original lexical string annotated with embedded syntactic role information. The reasons for returning a constructor to an object instead of the object itself in this case are twofold. First, the pattern of returning a constructor provides beneficial abstraction between the Word and Phrase types used by the algorithm, only requiring that instantiated objects derive from the abstract class Word, and secondly, it allows for deferred execution of object instantiation which can be used with other patterns, such as monadic function composition, to provide unique and useful behavior not efficiently achieved otherwise. Lab 1 - LASI Product Description 16 An additional third party, but not strictly software, asset used by LASI is Princeton University’s free, manually compiled set of synonym database oriented text files. These files are mapped at runtime to thesaurus constructs which provide various types of synonym lookup. These thesauri make it possible for LASI to generalize many patterns that would otherwise be random or unsafe, thereby providing potentially higher levels of results and allowing for significant performance increases. (This space intentionally left blank.) Lab 1 - LASI Product Description 17 2.2.4 Key Type System and Grammatical Representation Foundations The core analysis functionalities LASI implements are built around compositions and permutations of Enumerable collections of redundantly linked data structures which directly represent words and phrases as instances of corresponding class types. Figure 4 provides a detailed view of the static composition of non-syntactic-linked-based representation of a document object at runtime. In particular note the multidirectional many-to-one and one-to-many aggregation relationships as well as the deliberate multi-parent and multi-child redundancy which allow for independent iteration of the document representation from any construct within it. For example, this allows for useful data abstractions such as functions which can return free words without the need to store, maintain and return the unused context of the word in the function which returns it. Figure 4 illustrates the reflexive of lexical elements Lab 1 - LASI Product Description 18 To help overcome the challenge of mapping the fluidity of relationships between words in the English language, it becomes necessary to associate objects which have no direct inheritance suitable relationships. For example, the object of a transitive verb is a role compatible with both nouns and noun phrases, but it is conceptually, and here programmatically, incorrect to have noun and noun phrase share compositional inheritance relationships because phrases are compositions of words and not words themselves, so to cause noun phrase to derive from noun would introduce a literal and intellectual circular dependency and additionally lead to inexpressive, awkwardly written functions. To provide the desired syntax role generalization between words and phrases which have parallel behaviors but not compositions or derivations, a number of interface types are defined, which allow for elegant coding patterns and a level of abstraction which more closely matches one’s mental concept of their parallel relationships (This space intentionally left blank.) Lab 1 - LASI Product Description 19 2.2.3 Key Algorithms The main analysis process can be conceptually broken down into 3 distinct phases. Each phase is designed to build upon the links established and statistical information gathered during the previous one. It is significant that, while this linear breakdown provides a useful conceptual view of how the system operates, the component algorithms may be reordered, and or executed multiple times, depending on the nature of input set. Figure 5 provides an abstract conceptual description of the syntactic anaylsis process Lab 1 - LASI Product Description 20 2.2.2.3. A Primary Phase During the initial phase of analysis, the primary focus will be to build a textually agnostic set of associations based on the determined part of speech for each token in a document. At this relatively low level of abstraction, the document will be represented by a text-wise-linear collection of role specialized word objects where each corresponds to a text token and an inferred part of speech. Through this, a frequency assessment of the words will be used to determine an initial, naïve significance metric for each usage of each word. 2.2.2.3. B Secondary Phase After the initial assessment, this naïve metric will be refined through two syntactic association procedures. First, binding instances of pronouns with the nouns they refer to will significantly increase the accuracy of the noun weights determined during the primary phase. Further, by binding adjectives to the nouns they describe, we begin to associate adjective counts to specific noun instances. Both of these procedures are significant in that they begin the process of associating together the linear text into constructs, thus beginning to raise the level of abstraction from individual words to inter-word relationships. 2.2.2.3. C Tertiary Phase Following the secondary phase here, we will begin to bind subject and object entities to each other by way of the verbs which associate, through this, we can identify and correlate the ways entities are related to one another and the significance of these associations. These linkages form the basis for statistical synonym association and, most significantly, form the basis for the themes the LASI analysis engine will identify for the user. Lab 1 - LASI Product Description 21 2.2.2.3. D Progression and Refinement As more and more constructs are linked together, the relationships between them begin to complexify into abstract semantic associations allowing for continuing refinement of the results via multi-pass, iterative refinements and branching assessment of parallel possible implications. These are designed to form a basis for future higher order algorithm phases to be developed as abstractions on top of the analysis provided by the three outlined here. 3. Identification of Case Study The case study which sparked the creation and realistic purpose of the developing LASI derives from the work of Dr. Patrick T. Hester and Dr. Tom Meyers through their organization NCSOSE (the National Centers for System of Systems Engineering).Through NCSOSE, they provide critical research and analysis services to corporation and government agencies. In this context, they provide information and perspective to help guide high level decisions. In order to provide such a service, they must combine domain specific insight with strong independent . They generally spend hours and even days researching the technical aspects inherent in the complex organization systems of their clients. A high-performance linguistic analysis tool that could elucidate and validate the key areas of interest with respect to their client would help them efficiently research a domain. Additionally, it would be able to synthesize a concise, human-readable synopses derived orthogonally to its input data, encapsulating the commonalities between documents and what they focus on in a timely objective manner, would be a powerful asset which would increase the accuracy, efficiency and client satisfaction. Lab 1 - LASI Product Description 22 4. Prototype Description The LASI prototype represents a scaled back version of the real world solution. Although it retains the majority of the language processing functionality proposed, it has been reduced in scope due to the time constraints imposed by the nature of undergraduate university education. It has however, been constrained in such a way as to allow for truncated features can be developed and easily integrated in the future. 4.1 Major Functional Components (Hardware and Software) The major functional components of the LASI project are derived hardware and software categories. As an application designed for desktop, software components comprise the bulk of the discrete elements of the package. Broadly these consist of a set of composable algorithms which perform weighting, syntactic element binding and semantic referencing. On the hardware front, all that will be required is a single personal computer. MAJOR FUNCTIONAL COMPONENTS Figure 6 loosely describes LASI's hardware and softawre separation Lab 1 - LASI Product Description 23 4.2 Features and Capabilities Although it remains fairly robust, retaining most of its planned core features, , the scaled back prototype version lacks several noteworthy features. These include a human-readable explanation its reasoning process, and support for scanned text formats. Additionally, LASI prototype will support an input set of at most five documents while the real-world solution would be able to handle an arbitrary number of documents. Figure 7 contrasts the functionality of the prototype with conceptual package (This space intentionally left blank.) Lab 1 - LASI Product Description 24 Glossary of Terms Theme: A subject-object-verb relationships that LASI is attempting to generate from the input set. LASI: Linguistic Analysis for Subject Identification. Document Converter: Takes in DOC and DOCS files and converts them to TXT files. WordNet: A library provided of synonym information provided by Princeton University. Word (noted by a capital W): an instance of Word class. Phrase (noted by a capital P): An instance of the Phrase class or one of its descendants. Phrase: A group of words standing together as a conceptual unit, typically forming a new component. Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation. Linguistic Analysis: The technical analysis of language. Tag: A label or the act of attaching a label, that specifies the role (such part of speech or location) of a selected element in a document. Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research. Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, which indicates the relative significance of that word. Tornado Chart: A horizontal bar graph like visualization, representing the relative frequency or significance of elements, sorted in descending order by magnitude. Head word: A Head Word is the locally distinct word within a phrase which, by its syntactic associations, thereby determining the syntactic role of the phrase itself. C#: a programming language which provides safe, and flexible, and performant abstractions. across multiple programming paradigms, LASI will initially target Microsoft’s .net framework. However, due to the availability of reliable framework Lab 1 - LASI Product Description 25 implementations for non-windows platforms, and the conservative selection of core language features used in its implementation, LASI will ultimately target any number of operating platforms including Windows 7 and 8, Mac OSX, and a multitude of Linux based platforms including RedHat and BSD. Sharp NLP: A natural language processing tool used to parse and tag parts-of-speech, written in C#. Part of Speech Tagging: The process of binding part-of-speech to a word. Tagged Word Object: A word that has an associated part-of-speech. Tagged Set: A group of words whose parts of speech have been identified by a parser. Lexer: a piece of our parsing tool that isolates each word and its part of speech, and location in a sentence into machine readable tokens. Syntactic Analysis: A form of linguistic analysis that focuses on grammar in sentences and identifies themes based on sentence structure and formatting. Semantic Analysis: A form of linguistic analysis that identifies key words based on their location in the sentence rather than their overall meaning throughout the document. Subject Identification: Identifies the main actor in a sentence. However, in a broader sense, the word subject is synonymous with the theme of a document. Subject identification is the process of determining subjects, or themes of a document or documents. Part of Speech Tagger: Software that parses text and assigns labels identifying their use. Semantic Analysis: Relating syntactic structures to the independent meaning of words. A.I.D. Process: Assessment Improvement Design Process developed and utilized by Dr. Patrick T. Hester and Dr. Tom Meyers to determine problems and solutions. Strategic Document: A document that defines goals, visions, and mission statements. Sufficiently Contiguous: The requirement that there must be at least a single paragraph from a single written work for results beyond to be extracted. A corollary or caveat of the above is that, if a source file is merely contiguously stored in, but actually contains sentences which, while grammatically correct, are strung together arbitrarily, the same condition applies In both cases, LASI creators Red Team, have do not provide an guarantee of rational output given nonsensical input, conversely, Red Team unfortunately cannot guarantee that LASI will recognize and reject all nonsense. That is invoking LASI on nonsense is not necessarily closed over a semantic nonsense subspace because, when comparing multiple or very long nonsensical files patterns may emerge and be considered.