Aluan Haddad - ODU Computer Science

1 Introduction LASI (Linguistic Analysis for Subject Identification) is a natural language processing engine that will combine raw lexical analysis heuristics with sophisticated syntax aware heuristics and thereby form a basis to extrapolate, determine ways to interrelate, and abstract statistically derived semantic content over an input domain containing multiple English written works. The process of linguistic analysis, defined herein as the procedural study of how the words and phrases within a written work compose to form emergent meanings, is the central concept behind the LASI project. The use of language is a constantly evolving, self-describing process which complexifies on the composition of syntactic rules to express complex, emergent ideas, which in turn compose together to form themes: the distillation of the relationships literally described by the document. 1.1 Purpose LASI is a software package which aims to provide decision support and validation by pairing a set of high performance, context-sensitive heuristics with a graphical frontend in order to assist researchers in quickly gleaning meaningful content from written sources of information. Additionally, by pairing these algorithms with a graphical user interface (GUI) it will be able to assist a broad range of individuals with widely varying and areas of research and levels of technical proficiency. Furthermore, because of the broad societal significance of the problem it approaches, there are many potential applications which go beyond the domain of pure research. For example, its pattern recognition and synonym generalization features could assist professors in identifying non verbatim plagiarism, such as in the case of content which has been wrapped in a thin veneer by basic paraphrasing. In terms of students, its contextual awareness capabilities could be of assistance to students by helping them to quickly find relevant sources to cite for written assignments. In more in depth contexts, advanced users, such as Researchers like the progenitor of the LASI project Dr. Patrick T. Hester, could reap the benefits of the algorithms’ inferential capabilities to provide their clients with more specialized, quantitatively verifiable assessments of the complex systems. Broadly speaking, any individual needing to quickly become familiar with a single specific area of broad topic could employ LASI’s unique functionality to quickly hone in on increasingly relevant written resources. Essentially, linguistic analysis in this context aims to, at least conceptually, to sufficiently quantify qualitative information, thereby reducing trivial disagreements to be dispelled potentially allowing for faster and more effective decision making. Such analysis tools can provide key services in role as decision support tools. LASI is such a tool. The notion of theme refers to emergent, overlapping, intra and inter-textually derived, mental constructs which represent one of the key bases for human communication. In a sense, themes provide an abstraction interface which allows for the expression of linguistic ideas. However, as much as communicating via thematic abstraction is something without which humans would be unable to express complex ideas to one another; interpreting and expressing themes is often fraught with misunderstandings, conflation, and subject-arbitrary emotional associations. Any of these pitfalls can impede and stifle linguistic communication. For example, consider the case of an author who, while he genuinely expresses a certain theme with eloquence and brevity, is criticized for stating something he did not in fact assert, but the reference frame of the reader unpredictably clashed with that of the author in a way neither of them were capable of predicting, and perhaps resulted in a mutual perception of disagreement over a subject on which, had the authors words been parameterized differently, they might have wholeheartedly agreed. Thus, in spite of or perhaps because of the critical role which themes play in communication and expression, a multitude of potentially baseless or conflated concepts are communicated between authors and readers as well as between individual readers. In the case of readers, small differences in their respective interpretations of some works can compose into serious disagreement over what a given author is trying to express. While this has many powerful and sometimes even positive implications, and while it forms one of the key underpinnings for meta-explorative disciplines such epistemology and philosophy, discord over needless misunderstanding can have very harmful effects in areas where justifiable, imperative decisions must made based solely on textual perusal. For example, consider a situation is when time critical decisions must be made by government agencies or large corporations who must carefully determine how to allocate of scarce resources, or make time-critical financial or military decisions. As these situations involve multiple individuals doing independent research and then pooling their knowledge, and since some such degree of semantic disagreement is inevitable in a relatively democratic environment, serious problems such as needless delays, resource misappropriations, or outright inaction may result and cause severe damage. 1.2 Scope The prototype version of LASI, while it retains and implements much of the real world product’s proposed functionality, nevertheless suffers from some significant cutbacks. These cutbacks have been instigated to cope with the time and manpower constraints imposed by the undergraduate academic schedule on the software development process. Pronoun binding algorithms, PDF input file parsing, and session suspension are all among the key features which will not be part of the prototype package. 1.3 Definitions, Acronyms, and Abbreviations A.I.D.: Assessment Improvement Design A.I.D. Process: A process that provides quantitative and qualitative basis to identify problems and determine the feasibility of solutions. Analysis: Detailed examination of the elements or structure of something, typically as a basis for interpretation. Document: A document herein refers to a formally written, expository paper which expounds, via a declarative approach, on a relatively quantifiable issue, goal, or area of research. Head word: A locally distinct word within a phrase which, by its syntactic associations, determines the category of the phrase itself. LASI: Linguistic Analysis for Subject Identification Linguistic Analysis: The scientific analysis of a language. Parser: Takes in DOC and DOCX files and converts them to TXT files. Part of Speech Tagger: Software utility that associates words with the parts-of-speech in a sentence. Phrase: An instance of the Phrase class. Phrase: (Linguistically) A group of words standing together as a conceptual unit. Phrase Class: The root of the class taxonomy whose members correspond to the syntactic roles of phrase level elements. Instances of the types derived from the Phrase class contain a collection of Word instances which together represent a linguistic phrase. Semantic Analysis: Relating the syntactical structure of words to their language independent meanings. Sharp NLP: A natural language processing tool used to parse and tag parts-of-speech. It is written in C#. Strategic Document: Document produced by a client that defines their Goals, Visions and Missions. Subject Identification: The process by which the subject matter and thematic content of documents is determined. Syntactic Analysis: Identifies key words based on their location in the sentence, rather than their overall meaning throughout the document. .TAGGED: The type of file that stores the output of the part-of-speech tagger containing the all of the text of the document with embedded syntactic annotations. Theme: Subject-object-verb relationships that LASI is attempting to generate from the input set. Tag: A label, or the act of attaching a label, that specifies the syntactic role of a selected element in a document. Tagged Set: A group of words, whose part of speech and location in a sentence have been identified by the parser. WordNet: Compiler and provider of the data files which forms the basis for the LASI thesaurus. Word Class: The root of the taxonomy of class types which correspond to parts-of-speech at the word level and whose instances encapsulate each occurrence of a textually identified word. Word Weight: A numeric value, associated with each syntactically and lexically unique word in a written work, indicating its significance. 1.4 References Haddad, A. (2013). Lab 1 - lasi product description. Unpublished manuscript, Computer Science, Old Dominion University, Norfolk, VA, . 1.5 Overview The following is a description of the core modules which comprise the LASI software package. It contains an overview of component interactions, conceptual algorithm descriptions, and abstract data type descriptions. 2 General Description 2.1 Prototype Architecture Description Architecturally, the prototype version of LASI is broken down into three components or modules: the Algorithm, the File System, and the User Interface. The interaction between these components is illustrated by Figure 1. As shown, the modules interact via a constrained set of public interface functions. Figure 1 Major Functional Component Modules 2.2 Prototype Functional Description LASI will feature a number of different assessment techniques which will attempt to extrapolate and construct, from a linear set of words, a reflexive web of syntactic and sematic associations which will be revised and refined recursively as it continues to infer potential relationships. Before attempting any higher level analysis, a set of syntactic parsing libraries will examine the text and identify, statistically and locally, the likely part of speech of text’s lexical constructs. The result of this phase is a collection of words and phrases which have been usage-wise categorized and thus mapped to program constructs which encapsulate their syntactic roles. After this initial step, which results in a dynamic word and phrase behavior driven data model, a large number of independent statistical functions and element association techniques will be applied, their results compared, procedures potentially reordered and reevaluated, and finally interrelated over multiple sources and representations in an attempt to find the common thematic ideas and shared concepts of the input domain. A key technique that allows this to be accomplished is the assignment of a variety of numerical weights, both to individual word and phrase elements and to sets of potentially associated constructs which are iteratively modified and scaled by each subsequent metric applied. 2.1.1 Source Document Formats In terms of common capabilities and user accessibility features, the LASI software package will accept English textual works in multiple popular file formats. Currently all Microsoft Word document types as well as raw ASCII text files are fully supported LASI will also provide native support for adding Adobe Acrobat documents directly to user projects at some point. Implementing this functionality has been given a relatively low priority by the team as it requires that an optical character recognition system be implemented and then integrated such that all potentially erroneous characters parsed from an Acrobat document containing scanned text must be differentiated and completely dealt with before the text is passed to the tagging module. In addition to parsing the data provided by the user, its functionally allows users to provide custom dictionary-like inputs containing weight adjustments, static associations, explicit synonym collections, and syntactic-role overrides for lexical entities in order to facilitate more focused, user-intent-driven results. While his has the advantage of increasing user control over the process and allowing for more customizable selection of results and their arrangements, it is inseparably tied to the a loss of a demonstrable validity, detracting from any assertions developers can make about accuracies and bias likelihoods when shipping an iteration of LASI which provides such a feature. The most agreeable middle ground probably is an approach allowing users to make some adjustments, through a properly abstracted interface, and providing clear, unmistakable warnings regarding the decreasing verifiability of results. The user interface provides standard, responsive navigation functions that explicitly provides for all of the possible branches as illustrated by Figure 2. (This space intentionally left blank.) Figure 2 User Interaction Flow through the GUI The User Interface thus provides, for each category of information which the LASI engine can infer from a document, a human readable view which highlights the relevant information and provides contextual navigation to other perspectives. However, in addition to in dynamic result renderings, the LASI UI will facilitate exporting static representations of all results to common presentational, tabular, and serialization-oriented file formats such as Adobe Acrobat and Microsoft Excel formats and addition to simple non-proprietary formats such as the CSV (Separated Value) , XML (Extensible Markup Language) , and JSON(JavaScript Object Notation) file formats. This allows for results to be flexibly retained, viewed, and shared indecently of the LASI environment itself. 2.1.1 Host Operating System and Software Platform Description Due to the both the selection of C# and Dr. Hester’s use of Windows enterprise software, LASI will initially target Microsoft’s .Net framework. However, due to the availability of reliable C# framework implementations for non-Windows platforms, the slow but steady transition Microsoft is making towards and supporting open source programs, and the conservative selection of core language features used in its implementation, LASI will ultimately be accessible to users of a wide array of software platforms including Windows 7 and 8, various iterations of Mac OSX, and a multitude of Linux based platforms including RedHat and BSD. The requirements for the host operating system are fairly standard, consisting of an up-todate, 64 bit build of Microsoft Windows 7 Home Premium or above. The software framework requirements are equally standard consisting of an up-to-date version of the Microsoft .Net Framework v4.5 or above. Support for non-Microsoft based platforms, such as a RedHat build pared with the Mono Framework, is a planned feature. Support for the DotGNU UNIX platform is also a future possibility. 2.1.2 Hardware Platform Description The physical hardware requirements being targeted, irrespective of the operating system hosting LASI, are those of a fast but affordable desktop or notebook computer. While some requirements are more flexible than others, the absolute minimum system specifications required are that the Processor must have at least four logical cores (via an dual core Intel core series processor with hyper-threading support enabled, or a quad core AMD processor), be clocked at a frequency at or above 2.0GHz, and a minimum of eight gigabytes of DDR3 (Double Data Rate memory type 3) of total system memory clocked at or above 1,066MHz. For an optimal experience, or for an open source developer experimenting with the code post release, the recommended hardware requirements consist of a processor having at least eight logical cores (via an quad core Intel core series processor with hyper-threading support enabled or a eight core AMD processor), of eight gigabytes of low latency DDR3 clocked at or above 1,333MHz with timings memory access latencies not greater than 9-9-9, a solid state based data storage medium for document retrieval having at least 128 megabytes of onboard DDR3 cache and a rated random read speed of at least 40 megabytes per second for arbitrary 512 kilobyte data blocks. 2.2.1 External Interfaces and Third Party Components The LASI project library contains source and executable code files from two preexisting open source C# projects. First, LASI incorporates executable code files from b2xtranslator, an open source binary to XML file format converter. Specifically, LASI contains two of its child programs, the precompiled executable doc2x which converts Legacy (1997) Microsoft Word DOC files to DOCX open XML files and the precompiled executable ppt2x which converts Legacy (1997) Microsoft PowerPoint PPT files to PPTX open XML files, which are included and used under the FreeBSD open source license. Secondly, and far more significantly, LASI contains the part-of-speech-tagging library SharpNLP, an open source C# fork of OpenNLP, which are included and used under the limited GNU open source license. The methods provided by therein provide critical support to the LASI project as they are utilized to convert from ASCII text files containing whitespace delimited word-tokens into TAGGED files wherein these tokens are re-serialized to incorporate the original lexical string annotated with embedded syntactic role information. The reasons for returning a constructor to an object instead of the object itself in this case are twofold. First, the pattern of returning a constructor provides beneficial abstraction between the Word and Phrase types used by the algorithm, only requiring that instantiated objects derive from the abstract class Word, and secondly, it allows for deferred execution of object instantiation which can be used with other patterns, such as monadic function composition, to provide unique and useful behavior not efficiently achieved otherwise. An additional third party, but not strictly software, asset used by LASI is Princeton University’s free, manually compiled set of synonym database files. These files are mapped at runtime to thesaurus constructs which provide various types of synonym lookup. These thesauri make it possible for LASI to generalize many patterns that would otherwise rely on random guessing techniques, thereby providing potentially higher levels of results and allowing for significant performance increases. (This space intentionally left blank.) 2.2.4 Fundamental Data Abstractions and Document Representation The core analysis functionalities LASI implements are built around compositions and permutations of Enumerable collections of redundantly linked data structures which directly represent words and phrases as instances of corresponding class types. Figure 4 provides a detailed view of the static composition of the linear and compositional relationships between the objects which describe a document at runtime. Of particular importance are the multidirectional many-to-one and one-to-many aggregation relationships as well as the deliberate multi-parent and multi-child redundancy relationships which allow for independent iteration over the contents to begin at any construct. This allows for useful data abstractions such as functions which can return free words or phrases without the need to store, maintain, and return their indirect lexical contexts. Figure 3 Illustrates the Reflexive Links between Lexical Elements 2.2.4.1 Word Level Syntactic Class Types The class taxonomy which defines lexical elements at the word level consists of classes which represent the text of individual words together with strongly typed syntactic behaviors corresponding to their part of speech. Instances of word types serve to wrap and represent lexically distinct words with the encapsulation of their behavioral capabilities. Figures v and w illustrate the sets of word classes which represent nouns and verbs. (This space intentionally left blank.) Figure 4 Class Hierarchy of Verb Types Figure 5 Additional Word Types 2.2.4.2 Phrase Level Syntactic Class Types The class taxonomy which defines lexical elements at the phrase level consists of classes which represent the aggregate of one or more words together with a parallel, but more generalized concept of syntactic specializations. Many of the core algorithms within the LASI prototype operate primarily on instances of these types. Figure x illustrates the set of phrase classes and their inheritance relationships. Figure 6 Class Hierarchy of Phrase Types 2.2.4.3 Generalizing Syntactic Interface Types To represent the fluidity of relationships between constructs within the English language, it becomes necessary to associate objects which have no direct inheritance suitable relationships. For example, the object of a transitive verb is a role compatible with both nouns and noun phrases, but it is conceptually, and here programmatically, incorrect to have noun and noun phrase share compositional inheritance relationships because phrases are compositions of words and not words themselves, so to cause noun phrase to derive from noun would introduce a literal and intellectual circular dependency and additionally lead to inexpressive, awkwardly written functions. To provide the desired syntax role generalization between words and phrases which have parallel behaviors but not compositions or derivations, a number of interface types are defined, which allow for elegant coding patterns and a level of abstraction which more closely matches one’s mental concept of their parallel relationships. Figure 7 Hierarchy of Interface Types 2.2 Prototype Functional Description 2.2.1 Binding Algorithms The two primary binding algorithms within the LASI prototype operate primarily at the phrasal level. They are a Subject Binder and an Object Binder. Both of them operate at similar levels of abstraction and primarily serve to associate nouns, noun phrases, and other explicitly mentioned entities with the verb phrases which specify their relationships and behaviors. Together, they comprise the core logic of the analysis process by attempting to determine, broadly speaking, who does what and to whom. The core logic of the object binder is most transparently modeled and understood through the lens of finite state automaton logic. The process illustrated by the state diagram comprising Figure j identifies and associates the subjects of each verb phrase. The subject information embedded in the links established by the subject binder comprises the most significant associations made during analysis. Figure 8 Subject Binder Logic Staet Diagram Similarly, the core logic of the object binder is most transparently modeled and understood through the lens of finite state automaton logic. The process illustrated by the state diagram comprising Figure k identifies and associates the objects of each verb phrase. The object binder additionally attempts to distinguish between direct objects, indirect objects, and prepositional objects. Figure 9 Object Binder Logic State Diagram

Aluan Haddad - ODU Computer Science

Related documents

Products

Support

Aluan Haddad - ODU Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib