Object-oriented Models for Text and Document Processing Dr. Tom Horton Dept. of Computer Science and Engineering Florida Atlantic University Boca Raton, FL 33431 Email: tom@cse.fau.edu © T. Horton 1 My Areas of Interest • CS and Software Engineering Education • Software Engineering and Development • Humanities Computing © T. Horton 2 Software Engineering and Development • My interests in these topics are primarily: – How to teach such things better – How to apply them to certain problem areas • Object-oriented analysis and design – Stuff taught in CS494 (patterns, architecture) – SW architecture (particularly in problem domains) – GUI development, testing – HCI and usability (CS305) • including evaluation experiments) • Pen and ink in tablet PCs © T. Horton 3 CS and SW Engin. Education • Important: Needs to be an education research oriented project, which means: – create something new – try it out experimentally – evaluate results • Doing this with real subjects can: – require lots of advance planning – be time-consuming – be messy © T. Horton 4 CS and SW Engin. Education • Possible topic areas: – How do beginning students learn how to code, or use tools? – How do more experienced students learn how to design, debug, problem-solve? – Usability of tools (like IDEs, debuggers, design tools) – Creating tools or environments to support education • collecting feedback on student learning, etc. © T. Horton 5 Humanities Computing: An “Old” User Community • 1964 Literary Data Processing Conference – Papers on corpus preparation, stylistics, dictionaries – Most common software tool: concordance program • Joseph Raben’s research problem: – Given two texts, find pairs of sentences that contain verbal echoes – Shelley’s Prometheus Unbound heavily influenced by the language of Milton’s Paradise Lost – Raben’s papers emphasize algorithm © T. Horton 6 Humanities Computing • Applying software, algorithms, etc. to problems of interest to humanities scholars – In particular, literary text analysis – Tools for finding things, features, “chunks”, or showing relationships between texts • Data mining and text mining • Information visualization • XML processing, databases of features from documents structured in XML • Working with real users, real problems here at UVa © T. Horton 7 Overview • • • • Background: text processing, humanities computing Domain analysis and engineering Applying DA to text processing A common architecture – Regions – Object model • Some benefits – Relation to XML solutions – Query visualization • Conclusions © T. Horton 8 Background and Assumptions • More and better software tools are possible • Modern markup makes tool development harder – So many new possibilities for using markup! How? – Processing SGML has not been simple • New developments will help: – XML, XSL, Document Object Model (DOM) – Java, Unicode, SGML/XML tool support • Perhaps it’s the time for tools to catch up with markup © T. Horton 9 My Interests • Improving our ability to develop new text processing and analysis software tools for humanities users – Based on my viewpoint as a software engineer – Includes study of user requirements, designs, frameworks, reusable components • We could develop a family of software tools that: – satisfy common core requirements in the same way – share common core concepts and approaches – are based on a general model of text processing – that use a flexible software architecture © T. Horton 10 Background: Text Processing and Humanities Computing • Users: scholars studying texts, linguists, etc. for the purpose of – preparing editions, – carrying out stylistic or authorship analyses, – finding relationships between multiple texts, – finding parts of a text (perhaps in a large corpus) with certain traits, • Characteristics: – Texts often not modern English – Mark-up such as SGML or XML very important © T. Horton 11 An “Old” User Community • 1964 Literary Data Processing Conference – Papers on corpus preparation, stylistics, dictionaries – Most common software tool: concordance program • Joseph Raben’s research problem: – Given two texts, find pairs of sentences that contain verbal echoes – Shelley’s Prometheus Unbound heavily influenced by the language of Milton’s Paradise Lost – Raben’s papers emphasize algorithm © T. Horton 12 Software Tools Needed • Few software tools have been developed – The user community recognizes this a major problem – Example: No program exists to solve Raben’s problem • Reasons: – Community is dispersed and is not resource rich – Supporting multi-lingual definitions of alphabets, collation sequence, etc. and their output – Recently: SGML markup adds complexity – Users need good user interfaces © T. Horton 13 Existing Software Tools • Concordance programs – Define text characteristics – Find and group words, in variety of formats • Text retrieval systems – Centered on data repository – Possibly client-server (e.g. Sara) – Interactive (not batch) so promotes text exploration – Work at finer level of text granularity than traditional information retrieval systems © T. Horton 14 Existing Software Tools (2) • Collation programs – Process and group two or more related texts © T. Horton 15 Domain Engineering and Text Processing • Use domain-specific approach for software development for the text analysis domain. • Develop common definitions of requirements, objects, desired features, etc. • We should work to develop requirements, architectures and components that are – reusable – not limited to one environment or architectural approach © T. Horton 16 Domain Engineering • Domain engineering is an approach to support systematic reuse when developing multiple systems in a product family • Domain: a problem space for a family of applications with similar requirements • Domain boundary: separates one domain from another (possibly related domain) • Subdomain: a subset of a domain that describes related components or assets © T. Horton 17 Software Reuse • Three levels of reuse – ad hoc; opportunistic; and, systematic • Systematic reuse: – well-planned; cost-effective; part of an organization’s process – requires infrastructure, asset management, culture-change, standards, policies • Question: Can we hope for systematic reuse in the text processing domain? Probably not: – No “central” development organization © T. Horton 18 Domain Engineering Model Domain Analysis Domain Design Domain Repository © T. Horton Domain Implementation 19 Domain Analysis • A “metalevel” version of conventional requirements analysis – Define a domain engineering lifecycle similar to software engineering, but with the goal of developing reusable requirements, designs, components, etc. – Output of domain analysis is used in the next stage to “design” reusable architectures and components that can be used in many systems. • Similar in concept to knowledge engineering in expert systems development © T. Horton 20 Domain Analysis Activities • Find and agree upon: – – – – domain definition and boundary common objects, functions, features user characteristics scenarios to determine what are common interactions • Outputs: Domain model – common vocabulary – generic requirements that apply to more than one system – combinations and trade-offs of features • Form: text, UML diagrams, etc. © T. Horton 21 Approaches to Domain Analysis • Two views of domain analysis: – Study the domain top-down, as an abstract problem. – Study existing products and systems. • For second approach: – Analyze existing software applications – What do they have in common? – Why are there differences? “Accidental” or necessary? © T. Horton 22 How to Use the Domain Model • As a shared language for describing existing systems (or new systems) • To define product requirements for new systems to be developed – derived requirements • To design a reusable domain architecture and smaller reusable components © T. Horton 23 Applying All This to Text Processing • Broad categories of common requirements: – Definition and description of words and text structure. – Initial text processing, e.g. recognizing tokens, mark-up – Defining and selecting context or regions within a text – Retrieval or search. – Document organization – Quantitative results, e.g. counts, statistics. – User-interface issues (displaying context; displaying tagged text; export of information) • The following slides give examples of reusable components and systems they might produce. © T. Horton 24 How is Mark-up Used? • Highest level of description (abstraction): – (A) To select a subdocument(s), or restrict an action to part of a document. – (B) To identify a location in a text for output. • Reference identifiers in KWIC, etc. – (C) To navigate within a text, or reference next entity, parent, etc. • Get next sentence, or get chapter number for current sentence. © T. Horton 25 How Mark-up is Used? (cont’d) – (D) To contain a word-level alternative for some part of the text for processing. • E.g., in TEI: Bob likes <corr sic=“it’s”> its </corr> name. – (E) To qualify a word to distinguish it from words with same orthographic representation. • E.g. <mentioned lang=fra>Savoirfaire</mentioned> is French for know-how. <name>Mark</name> put his mark there. © T. Horton 26 SW Requirements To Support These Needs • Store and navigate an SGML/XML element tree (Needs A, C) • Given a location, find set of elements its stored in (B, C) • Retrieve elements based on attribute values (Need A) – E.g. all elements of any type with lang=fra © T. Horton 27 SW Requirements (cont’d) • For word extraction from PCdata: – Choose among alternatives for <corr>, <del>, etc. – Distinguish words with same string values but with mark-up for name, foreign, etc. © T. Horton 28 Reusable Requirements • (R1) Standards for word, character set, alphabet, etc. • (R2) Definitions and requirements for selecting a region or sub-document in a text. • (R3) Search, defined in terms of abstract concepts relating to words, regions, mark-up, etc. © T. Horton 29 Reusable Designs, Components, Interfaces, Etc. • (D1) SGML or XML front-end components. • (D2) Data structure models for R1, word and alphabet standards. • (D3) Fuzzy word matching algorithm and implementation. • (D4) An API for a text-base query language (XML XPATH) • (D5) A design and reusable code for a text database management system (Berkeley DB) • (D6) An API for document structure manipulation (XML DOM). © T. Horton 30 Programs and Tools That Could Result • Repository-centered systems – New TACT: a information-retrieval like system – WordCluster: find parts of a document with concentrations of words from one or more categories or themes. – PageTurner: a program to manipulate and “conditionally” display a text according to its mark-up • Pipeline (or pipe and filter) systems: – New concordance-like programs – TextComp: an implementation of Raben’s solution for finding passages in a pair of texts that echo each other • Others might include: Collate, a SGML Tag Editor © T. Horton 31 A Region-based Architecture • Used by sgrep, a command-line search (or grep) that: – models “things” in files as regions (AKA spans) – similar to TIPSTER Info. Retr. architecture (GATE) • Why isn’t a markup-centered approach enough? – The DOM for XML provides a model and an API for software to manipulate structured documents. – Query languages are coming: XQL or ??? • We need both: A markup-centered approach does not completely address: – finer-grained features not marked up – non-hierarchical features © T. Horton 32 Definitions • sgrep queries and results defined as: – region: a chunk of text defined by a starting and ending byte-position in a file – region sets: a collection of regions • Region sets can be combined in powerful ways – sgrep finds nested regions – given two region sets, finds subset of one that is contained in another – given two region-sets, finds subset of one that contains some member of the other – sets can be merged etc. © T. Horton 33 Example: Regions and Region Sets (1) All Occurrences of “honor”: Text: “honor”: (2) All Occurrences of DIV1 elements: Text: DIV1 (3) All Occurrences of “honor” in a DIV1 element: Text: “honor”: © T. Horton 34 Region Examples • Region sets: – all words in a text; all characters; all syllables – all occurrences of a given token – all DIV1 elements – all elements that have attribute with a given value • Queries or subtext selection. Examples: Let’s find: – Speeches by Hamlet in Acts 4 and 5,... – choosing only those marked-up as verse,... – choosing only those with the word “honor” • This example illustrates selection of a document or subdocument, a core user requirement © T. Horton 35 Benefits of Using Regions • Provides a general model of things in texts – Markup such as SGML elements can be modeled as regions – sgrep’s model of regions uses a concept of nesting or inclusion, which is naturally useful • A good model can be quite powerful – Spreadsheets: cells, rows, columns, ranges © T. Horton 36 A General Model for Text Processing • Text Object (TO): A thing that a user wants to identify and manipulate in a text. Examples: markup elements, tokens, syllables, userdefined regions (e.g. an “analogy”) • Text Object Occurrence (TOO): an occurrence of a TO, represented by a region. (Or a region set if noncontiguous.) Examples: – the particular word-token found on line 3 – the 2nd DIV1 markup element in the file – the next-to-last syllable of a given word – a region selected by the user using the mouse © T. Horton 37 A General Model (continued) • Text Object Occurrence List (TOO-List): a Text Object and a set of occurrences (TOOs) associated with it. Examples: – All word tokens – All occurrences of “the” – All pronouns – All occurrences of “he” – All syllables – All markup elements – All DIV1 elements – All elements with LANG attribute value = “DE” © T. Horton 38 Manipulating TOO-Lists • New lists can be found using sgrep-like operations – Occurrences of “the” in DIV1 elements: – All DIV1 elements where element attribute ATTR="y" – Occurrences of syllable "-er" in pronouns. – DIV1 elements that contain pronouns. – Pronouns containing "er" spoken by Hamlet. – Occurrences of “the” or words with "-er" in that chunk of text the user selected with the mouse and in DIV1 elements that are not in German. • Or any combination of these! © T. Horton 39 Where do TOO-Lists Come From? • TOO-Lists can be found by software components: – An XML parser or sgrep could find all elements – A tokenizer could find all word-tokens – Markup tags could be used to identify all pronouns – A linguistic tool or an exhaustive dictionary could be used to create a list of syllable occurrences – A user could select a set of regions by XML query, mouse-selection, etc. • Some TOO-Lists are stored in a tool’s database • Some are calculated on the fly in response to queries, operations © T. Horton 40 Benefits: Supports Core Needs • Does it support core user requirements? – Selection of documents and part of documents – Identify occurrences of text features – Allow queries and other operations on text features within a document selection – Support common operations other than search: • • • • Identify, name and store Output Visualization Number and count occurrences. Operations based on sequencing. (E.g. find “next” word.) • Other processing or analysis. Linguistic, etc. © T. Horton 41 Common View of Markup, Words, Etc. • Markup and words (or other objects) have a common representation. – Search for any Text Object (TO) is handled the same by the software – Query language for the user may not be simple! – There’s no difference between: • Finding markup unit containing a particular word E.g. All occurrences of “love” in a DIV1 • Finding word containing a particular markup tag E.g. All word-tokens that have CORR tags • A common representation simplifies software development, even if user perceives differences. © T. Horton 42 Multifile Documents and Corpora • sgrep finds patterns in files – A region is a part of a file • We need more: – SGML/XML allows more than one file to comprise a single document – Corpora and Digital Libraries • Extend a region to include a document or file identifier: – A triple not a pair: (docID, start, end) – Requires a relative ordering of docIDs for multifile documents © T. Horton 43 Non-contiguous Text Objects • We may wish to define a Text Object (TO) that cannot be represented by one single span. Examples: – Occurrence of verb forms in German – Some document-defined element – User-defined objects • A Text Object Occurrence becomes a region set instead of a region. • Modify primitive region operations like includes, contained in, etc. to work on two region sets © T. Horton 44 Integrating Markup- and RegionCentered Approaches • Clearly hierarchical structure of SGML/XML documents is powerful and necessary • The Document Object Model (DOM) for XML – A model for accessing XML structures by software – Defines APIs for accessing nodes, subtrees, etc. – XML parsers can build a DOM – Built-on-the-fly and temporary vs. persistent • Query languages for XML documents – Retrieves a node, a subtree, or a set of nodes. – Not yet standard. Proposals, like XSL. © T. Horton 45 Integrating the Two Approaches • Region view can address aspects not well-addressed in a DOM: – Objects at a finer grain of detail than XML node – Objects defined without markup • DOM nodes correspond to regions – A DOM implementation could include region-info – Or, a separate index could be created to map DOM node-IDs to regions • Thus, a software tool could have two integrated databases reflecting both views – Allowing XSL-like queries by the user – Allowing sgrep-like manipulations too © T. Horton 46 Overlapping Hierarchies • Region-based view not limited by hierarchical structures • Example: consider use of milestones to indicate verses in a play where speeches are marked-up using tags. – A verse may span several speeches – A verse may not start or end at a speech boundary • Perhaps we wish to do some operation on word tokens by speech and also by verse (e.g. word-list of all words, find first word, etc.) • Regions used to represent words, and also – SPEECH elements in SGML/XML – Text between verse-begin and verse-end milestone tags © T. Horton 47 Example: Overlapping Hierarchies Words in lines: Text: Words: Lines Words in verses: Text: Words: Verses: © T. Horton 48 Numbering or Sequencing Elements • Relative position of Text Objects is needed: – Find all words after “of”. – Find occurrences of “in the”. – Select first and second speeches of the first two scenes of each act in a play. – Select successive 500-word blocks. – Find the first noun (per markup) in each sentence. – Find the first verb (per markup) after each noun (per markup). – Find the next syllable after “er” (if any) in words. © T. Horton 49 Problem: How to Number Things? • How to handle markup boundaries? – Is the next word after the last word in a sentence the next word? – What if there’s a chapter break? • Intervening markup elements. E.g. a footnote. • In critical editions, how to handle tags for SIC/CORR, REG/ORIG, GAP/ADD/DEL etc. © T. Horton 50 Sequencing of TOs using Region Sets • Goal: Create a sequence of Text Objects (TOs) • Define a Sequence Context to be the set of “units” inside of which we wish to number. – Each “unit” might be a contiguous region, or it might be a region set – Thus the Sequence Context is a sequence of region sets • To sequence occurrences of a given TO, process its region-set against the Sequence Context – Ignore TO Occurrences not in the context – Optionally restart counter for each region-set in the context © T. Horton 51 Example: Sequencing TOs Numbered TOs in Some Context, Ignoring Some: Text: TOs: Context 1 1 2 1 2 3 1 2 1 2 Numbered TOs in Another Context: Text: TOs: 1 2 3 1 2 3 Context 2: © T. Horton 52 Visualization • Previous slide shows a simple visualization of TOs inside of regions • Potential for a general model of visualization of TOs (words, markup elements etc.) in relation to other TOs (regions, markup elements, etc.) • Examples: – Show presence or absence of some TO in a selected region – Counts of how many occurrences in each region • Given a region-based database of TO Occurrences, we have a dynamic and flexible way to visualize all TOs known to the tool © T. Horton 53 Tilebars • Marti Hearst developed tilebars as a method for visualizing query results • Dimensions: – documents, – terms (text objects), – segments (regions) • Shading of each cell shows strength of occurrence of a term • Adjacent shaded cells show co-occurrence of terms © T. Horton 54 Tilebars (example) © T. Horton 55 Conclusions • Regions and region sets show promise as a general model for representing all types of Text Objects – Integrates markup and other document features such as word tokens • Supports selection of documents and subdocuments – Apply operations (display, count, remember, etc) to a selection • Should integrate with markup-based document models (e.g. the XML DOM) • Supports numbering/sequencing and visualization techniques. © T. Horton 56 What’s Next? • A proof-of-concept application to show the usefulness of the approach • Development of software components based on regions to support the “middle-level” of software architectures • How to best store region-based data for use by software applications • Develop a model for distributed text-bases based on Text Objects, TOO-Lists, etc. – Implement this for existing texts in an existing Digital Library © T. Horton 57 Educational Research in CS/SE • We have a CS Education Group here at UVa – Teaching Faculty: Horton, Milner, Anderson, Prey – Also, others like Knight, Evans, Cohoon • Education topics are suitable for a Senior Thesis! © T. Horton 58 Education Projects • SW Engineering Education (Horton and Knight) – Share across the Net with other universities • CS201 camera client/server system • CS340 Robot systems? – Web Repository for educational artifacts • MySQL and PHP – Winding down for now, but future could include: • Case studies for use in CS202, CS340, etc. • Pair-programming (like XP) in CS101? • New robot-like project suitable for CS340? © T. Horton 59 Projects (cont’d) • Next Generation Computer Labs (Knight, Horton, Milner, Anderson) – Focuses on CS101 and CS201 – Do they have to be in lab? If so, what should they be doing as a group with TAs? – Can we use technology better? • Netmeeting, Instant Messaging, Web discussion boards • Collaborative learning: small groups, using a Web board etc. – Studio Labs (like for CS340) in CS201? – Evaluate effectiveness (surveys, interviews, etc.) © T. Horton 60 Projects (cont’d) • Fourth-year course/project that does: – Larger scale system development – Distributed computing • Client/server • Database – Integration of custom code with existing components • Database, middle-ware, COTS components – Uses .NET, SOAP/XML, etc. • What would this course or project look like? © T. Horton 61