Object-oriented Models for Text and Document Processing

advertisement
Object-oriented Models for
Text and Document Processing
Dr. Tom Horton
Dept. of Computer Science and Engineering
Florida Atlantic University
Boca Raton, FL 33431
Email: tom@cse.fau.edu
© T. Horton
1
My Areas of Interest
• CS and Software Engineering
Education
• Software Engineering and Development
• Humanities Computing
© T. Horton
2
Software Engineering and
Development
• My interests in these topics are primarily:
– How to teach such things better
– How to apply them to certain problem areas
• Object-oriented analysis and design
– Stuff taught in CS494 (patterns, architecture)
– SW architecture (particularly in problem domains)
– GUI development, testing
– HCI and usability (CS305)
• including evaluation experiments)
• Pen and ink in tablet PCs
© T. Horton
3
CS and SW Engin. Education
• Important: Needs to be an education research
oriented project, which means:
– create something new
– try it out experimentally
– evaluate results
• Doing this with real subjects can:
– require lots of advance planning
– be time-consuming
– be messy
© T. Horton
4
CS and SW Engin. Education
• Possible topic areas:
– How do beginning students learn how to code, or
use tools?
– How do more experienced students learn how to
design, debug, problem-solve?
– Usability of tools (like IDEs, debuggers, design
tools)
– Creating tools or environments to support
education
• collecting feedback on student learning, etc.
© T. Horton
5
Humanities Computing:
An “Old” User Community
• 1964 Literary Data Processing Conference
– Papers on corpus preparation, stylistics,
dictionaries
– Most common software tool: concordance
program
• Joseph Raben’s research problem:
– Given two texts, find pairs of sentences that
contain verbal echoes
– Shelley’s Prometheus Unbound heavily influenced
by the language of Milton’s Paradise Lost
– Raben’s papers emphasize algorithm
© T. Horton
6
Humanities Computing
• Applying software, algorithms, etc. to problems of
interest to humanities scholars
– In particular, literary text analysis
– Tools for finding things, features, “chunks”, or
showing relationships between texts
• Data mining and text mining
• Information visualization
• XML processing, databases of features from
documents structured in XML
• Working with real users, real problems here at UVa
© T. Horton
7
Overview
•
•
•
•
Background: text processing, humanities computing
Domain analysis and engineering
Applying DA to text processing
A common architecture
– Regions
– Object model
• Some benefits
– Relation to XML solutions
– Query visualization
• Conclusions
© T. Horton
8
Background and Assumptions
• More and better software tools are possible
• Modern markup makes tool development harder
– So many new possibilities for using markup! How?
– Processing SGML has not been simple
• New developments will help:
– XML, XSL, Document Object Model (DOM)
– Java, Unicode, SGML/XML tool support
• Perhaps it’s the time for tools to catch up with markup
© T. Horton
9
My Interests
• Improving our ability to develop new text processing
and analysis software tools for humanities users
– Based on my viewpoint as a software engineer
– Includes study of user requirements, designs,
frameworks, reusable components
• We could develop a family of software tools that:
– satisfy common core requirements in the same
way
– share common core concepts and approaches
– are based on a general model of text processing
– that use a flexible software architecture
© T. Horton
10
Background: Text Processing and
Humanities Computing
• Users: scholars studying texts, linguists, etc. for the
purpose of
– preparing editions,
– carrying out stylistic or authorship analyses,
– finding relationships between multiple texts,
– finding parts of a text (perhaps in a large corpus)
with certain traits,
• Characteristics:
– Texts often not modern English
– Mark-up such as SGML or XML very important
© T. Horton
11
An “Old” User Community
• 1964 Literary Data Processing Conference
– Papers on corpus preparation, stylistics,
dictionaries
– Most common software tool: concordance
program
• Joseph Raben’s research problem:
– Given two texts, find pairs of sentences that
contain verbal echoes
– Shelley’s Prometheus Unbound heavily influenced
by the language of Milton’s Paradise Lost
– Raben’s papers emphasize algorithm
© T. Horton
12
Software Tools Needed
• Few software tools have been developed
– The user community recognizes this a major
problem
– Example: No program exists to solve Raben’s
problem
• Reasons:
– Community is dispersed and is not resource rich
– Supporting multi-lingual definitions of alphabets,
collation sequence, etc. and their output
– Recently: SGML markup adds complexity
– Users need good user interfaces
© T. Horton
13
Existing Software Tools
• Concordance programs
– Define text characteristics
– Find and group words, in variety of formats
• Text retrieval systems
– Centered on data repository
– Possibly client-server (e.g. Sara)
– Interactive (not batch) so promotes text
exploration
– Work at finer level of text granularity than
traditional information retrieval systems
© T. Horton
14
Existing Software Tools (2)
• Collation programs
– Process and group two or more related texts
© T. Horton
15
Domain Engineering and Text
Processing
• Use domain-specific approach for software
development for the text analysis domain.
• Develop common definitions of requirements,
objects, desired features, etc.
• We should work to develop requirements,
architectures and components that are
– reusable
– not limited to one environment or architectural
approach
© T. Horton
16
Domain Engineering
• Domain engineering is an approach to support
systematic reuse when developing multiple systems
in a product family
• Domain: a problem space for a family of applications
with similar requirements
• Domain boundary: separates one domain from
another (possibly related domain)
• Subdomain: a subset of a domain that describes
related components or assets
© T. Horton
17
Software Reuse
• Three levels of reuse
– ad hoc; opportunistic; and, systematic
• Systematic reuse:
– well-planned; cost-effective; part of an
organization’s process
– requires infrastructure, asset management,
culture-change, standards, policies
• Question: Can we hope for systematic reuse in the
text processing domain? Probably not:
– No “central” development organization
© T. Horton
18
Domain Engineering Model
Domain
Analysis
Domain
Design
Domain
Repository
© T. Horton
Domain
Implementation
19
Domain Analysis
• A “metalevel” version of conventional requirements
analysis
– Define a domain engineering lifecycle similar to
software engineering, but with the goal of
developing reusable requirements, designs,
components, etc.
– Output of domain analysis is used in the next
stage to “design” reusable architectures and
components that can be used in many systems.
• Similar in concept to knowledge engineering in expert
systems development
© T. Horton
20
Domain Analysis Activities
• Find and agree upon:
–
–
–
–
domain definition and boundary
common objects, functions, features
user characteristics
scenarios to determine what are common interactions
• Outputs: Domain model
– common vocabulary
– generic requirements that apply to more than one
system
– combinations and trade-offs of features
• Form: text, UML diagrams, etc.
© T. Horton
21
Approaches to Domain Analysis
• Two views of domain analysis:
– Study the domain top-down, as an abstract
problem.
– Study existing products and systems.
• For second approach:
– Analyze existing software applications
– What do they have in common?
– Why are there differences? “Accidental” or
necessary?
© T. Horton
22
How to Use the Domain Model
• As a shared language for describing existing systems
(or new systems)
• To define product requirements for new systems to
be developed
– derived requirements
• To design a reusable domain architecture and smaller
reusable components
© T. Horton
23
Applying All This to Text Processing
• Broad categories of common requirements:
– Definition and description of words and text structure.
– Initial text processing, e.g. recognizing tokens, mark-up
– Defining and selecting context or regions within a text
– Retrieval or search.
– Document organization
– Quantitative results, e.g. counts, statistics.
– User-interface issues (displaying context; displaying
tagged text; export of information)
• The following slides give examples of reusable
components and systems they might produce.
© T. Horton
24
How is Mark-up Used?
• Highest level of description (abstraction):
– (A) To select a subdocument(s), or restrict an
action to part of a document.
– (B) To identify a location in a text for output.
• Reference identifiers in KWIC, etc.
– (C) To navigate within a text, or reference next
entity, parent, etc.
• Get next sentence, or get chapter number for current
sentence.
© T. Horton
25
How Mark-up is Used? (cont’d)
– (D) To contain a word-level alternative for some
part of the text for processing.
• E.g., in TEI: Bob likes <corr sic=“it’s”> its
</corr> name.
– (E) To qualify a word to distinguish it from words
with same orthographic representation.
• E.g. <mentioned lang=fra>Savoirfaire</mentioned> is French for know-how.
<name>Mark</name> put his mark there.
© T. Horton
26
SW Requirements To Support These
Needs
• Store and navigate an SGML/XML element tree
(Needs A, C)
• Given a location, find set of elements its stored in (B,
C)
• Retrieve elements based on attribute values (Need A)
– E.g. all elements of any type with lang=fra
© T. Horton
27
SW Requirements (cont’d)
• For word extraction from PCdata:
– Choose among alternatives for <corr>, <del>, etc.
– Distinguish words with same string values but with
mark-up for name, foreign, etc.
© T. Horton
28
Reusable Requirements
• (R1) Standards for word, character set, alphabet, etc.
• (R2) Definitions and requirements for selecting a
region or sub-document in a text.
• (R3) Search, defined in terms of abstract concepts
relating to words, regions, mark-up, etc.
© T. Horton
29
Reusable Designs, Components,
Interfaces, Etc.
• (D1) SGML or XML front-end components.
• (D2) Data structure models for R1, word and
alphabet standards.
• (D3) Fuzzy word matching algorithm and
implementation.
• (D4) An API for a text-base query language (XML
XPATH)
• (D5) A design and reusable code for a text database
management system (Berkeley DB)
• (D6) An API for document structure manipulation
(XML DOM).
© T. Horton
30
Programs and Tools That Could Result
• Repository-centered systems
– New TACT: a information-retrieval like system
– WordCluster: find parts of a document with
concentrations of words from one or more categories
or themes.
– PageTurner: a program to manipulate and
“conditionally” display a text according to its mark-up
• Pipeline (or pipe and filter) systems:
– New concordance-like programs
– TextComp: an implementation of Raben’s solution for
finding passages in a pair of texts that echo each other
• Others might include: Collate, a SGML Tag Editor
© T. Horton
31
A Region-based Architecture
• Used by sgrep, a command-line search (or grep) that:
– models “things” in files as regions (AKA spans)
– similar to TIPSTER Info. Retr. architecture (GATE)
• Why isn’t a markup-centered approach enough?
– The DOM for XML provides a model and an API
for software to manipulate structured documents.
– Query languages are coming: XQL or ???
• We need both: A markup-centered approach does not
completely address:
– finer-grained features not marked up
– non-hierarchical features
© T. Horton
32
Definitions
• sgrep queries and results defined as:
– region: a chunk of text defined by a starting and
ending byte-position in a file
– region sets: a collection of regions
• Region sets can be combined in powerful ways
– sgrep finds nested regions
– given two region sets, finds subset of one that is
contained in another
– given two region-sets, finds subset of one that
contains some member of the other
– sets can be merged etc.
© T. Horton
33
Example: Regions and Region Sets
(1) All Occurrences of “honor”:
Text:
“honor”:
(2) All Occurrences of DIV1 elements:
Text:
DIV1
(3) All Occurrences of “honor” in a DIV1 element:
Text:
“honor”:
© T. Horton
34
Region Examples
• Region sets:
– all words in a text; all characters; all syllables
– all occurrences of a given token
– all DIV1 elements
– all elements that have attribute with a given value
• Queries or subtext selection. Examples: Let’s find:
– Speeches by Hamlet in Acts 4 and 5,...
– choosing only those marked-up as verse,...
– choosing only those with the word “honor”
• This example illustrates selection of a document or
subdocument, a core user requirement
© T. Horton
35
Benefits of Using Regions
• Provides a general model of things in texts
– Markup such as SGML elements can be modeled
as regions
– sgrep’s model of regions uses a concept of
nesting or inclusion, which is naturally useful
• A good model can be quite powerful
– Spreadsheets: cells, rows, columns, ranges
© T. Horton
36
A General Model for Text Processing
• Text Object (TO): A thing that a user wants to
identify and manipulate in a text.
Examples: markup elements, tokens, syllables, userdefined regions (e.g. an “analogy”)
• Text Object Occurrence (TOO): an occurrence of a
TO, represented by a region. (Or a region set if noncontiguous.) Examples:
– the particular word-token found on line 3
– the 2nd DIV1 markup element in the file
– the next-to-last syllable of a given word
– a region selected by the user using the mouse
© T. Horton
37
A General Model (continued)
• Text Object Occurrence List (TOO-List): a Text
Object and a set of occurrences (TOOs) associated
with it. Examples:
– All word tokens
– All occurrences of “the”
– All pronouns
– All occurrences of “he”
– All syllables
– All markup elements
– All DIV1 elements
– All elements with LANG attribute value = “DE”
© T. Horton
38
Manipulating TOO-Lists
• New lists can be found using sgrep-like operations
– Occurrences of “the” in DIV1 elements:
– All DIV1 elements where element attribute
ATTR="y"
– Occurrences of syllable "-er" in pronouns.
– DIV1 elements that contain pronouns.
– Pronouns containing "er" spoken by Hamlet.
– Occurrences of “the” or words with "-er" in that
chunk of text the user selected with the mouse
and in DIV1 elements that are not in German.
• Or any combination of these!
© T. Horton
39
Where do TOO-Lists Come From?
• TOO-Lists can be found by software components:
– An XML parser or sgrep could find all elements
– A tokenizer could find all word-tokens
– Markup tags could be used to identify all pronouns
– A linguistic tool or an exhaustive dictionary could
be used to create a list of syllable occurrences
– A user could select a set of regions by XML query,
mouse-selection, etc.
• Some TOO-Lists are stored in a tool’s database
• Some are calculated on the fly in response to
queries, operations
© T. Horton
40
Benefits: Supports Core Needs
• Does it support core user requirements?
– Selection of documents and part of documents
– Identify occurrences of text features
– Allow queries and other operations on text
features within a document selection
– Support common operations other than search:
•
•
•
•
Identify, name and store
Output
Visualization
Number and count occurrences. Operations based on
sequencing. (E.g. find “next” word.)
• Other processing or analysis. Linguistic, etc.
© T. Horton
41
Common View of Markup, Words, Etc.
• Markup and words (or other objects) have a common
representation.
– Search for any Text Object (TO) is handled the
same by the software
– Query language for the user may not be simple!
– There’s no difference between:
• Finding markup unit containing a particular word
E.g. All occurrences of “love” in a DIV1
• Finding word containing a particular markup tag
E.g. All word-tokens that have CORR tags
• A common representation simplifies software
development, even if user perceives differences.
© T. Horton
42
Multifile Documents and Corpora
• sgrep finds patterns in files
– A region is a part of a file
• We need more:
– SGML/XML allows more than one file to comprise
a single document
– Corpora and Digital Libraries
• Extend a region to include a document or file
identifier:
– A triple not a pair: (docID, start, end)
– Requires a relative ordering of docIDs for multifile
documents
© T. Horton
43
Non-contiguous Text Objects
• We may wish to define a Text Object (TO) that cannot
be represented by one single span. Examples:
– Occurrence of verb forms in German
– Some document-defined element
– User-defined objects
• A Text Object Occurrence becomes a region set
instead of a region.
• Modify primitive region operations like includes,
contained in, etc. to work on two region sets
© T. Horton
44
Integrating Markup- and RegionCentered Approaches
• Clearly hierarchical structure of SGML/XML
documents is powerful and necessary
• The Document Object Model (DOM) for XML
– A model for accessing XML structures by software
– Defines APIs for accessing nodes, subtrees, etc.
– XML parsers can build a DOM
– Built-on-the-fly and temporary vs. persistent
• Query languages for XML documents
– Retrieves a node, a subtree, or a set of nodes.
– Not yet standard. Proposals, like XSL.
© T. Horton
45
Integrating the Two Approaches
• Region view can address aspects not well-addressed
in a DOM:
– Objects at a finer grain of detail than XML node
– Objects defined without markup
• DOM nodes correspond to regions
– A DOM implementation could include region-info
– Or, a separate index could be created to map
DOM node-IDs to regions
• Thus, a software tool could have two integrated
databases reflecting both views
– Allowing XSL-like queries by the user
– Allowing sgrep-like manipulations too
© T. Horton
46
Overlapping Hierarchies
• Region-based view not limited by hierarchical structures
• Example: consider use of milestones to indicate verses
in a play where speeches are marked-up using tags.
– A verse may span several speeches
– A verse may not start or end at a speech boundary
• Perhaps we wish to do some operation on word tokens
by speech and also by verse (e.g. word-list of all words,
find first word, etc.)
• Regions used to represent words, and also
– SPEECH elements in SGML/XML
– Text between verse-begin and verse-end milestone
tags
© T. Horton
47
Example: Overlapping Hierarchies
Words in lines:
Text:
Words:
Lines
Words in verses:
Text:
Words:
Verses:
© T. Horton
48
Numbering or Sequencing Elements
• Relative position of Text Objects is needed:
– Find all words after “of”.
– Find occurrences of “in the”.
– Select first and second speeches of the first two
scenes of each act in a play.
– Select successive 500-word blocks.
– Find the first noun (per markup) in each sentence.
– Find the first verb (per markup) after each noun
(per markup).
– Find the next syllable after “er” (if any) in words.
© T. Horton
49
Problem: How to Number Things?
• How to handle markup boundaries?
– Is the next word after the last word in a sentence
the next word?
– What if there’s a chapter break?
• Intervening markup elements. E.g. a footnote.
• In critical editions, how to handle tags for SIC/CORR,
REG/ORIG, GAP/ADD/DEL etc.
© T. Horton
50
Sequencing of TOs using Region Sets
• Goal: Create a sequence of Text Objects (TOs)
• Define a Sequence Context to be the set of “units”
inside of which we wish to number.
– Each “unit” might be a contiguous region, or it
might be a region set
– Thus the Sequence Context is a sequence of
region sets
• To sequence occurrences of a given TO, process its
region-set against the Sequence Context
– Ignore TO Occurrences not in the context
– Optionally restart counter for each region-set in
the context
© T. Horton
51
Example: Sequencing TOs
Numbered TOs in Some Context, Ignoring Some:
Text:
TOs:
Context 1
1
2
1
2
3
1
2
1
2
Numbered TOs in Another Context:
Text:
TOs:
1
2
3
1
2
3
Context 2:
© T. Horton
52
Visualization
• Previous slide shows a simple visualization of TOs
inside of regions
• Potential for a general model of visualization of TOs
(words, markup elements etc.) in relation to other
TOs (regions, markup elements, etc.)
• Examples:
– Show presence or absence of some TO in a
selected region
– Counts of how many occurrences in each region
• Given a region-based database of TO Occurrences,
we have a dynamic and flexible way to visualize all
TOs known to the tool
© T. Horton
53
Tilebars
• Marti Hearst developed tilebars as a method for
visualizing query results
• Dimensions:
– documents,
– terms (text objects),
– segments (regions)
• Shading of each cell shows strength of occurrence of
a term
• Adjacent shaded cells show co-occurrence of terms
© T. Horton
54
Tilebars (example)
© T. Horton
55
Conclusions
• Regions and region sets show promise as a general
model for representing all types of Text Objects
– Integrates markup and other document features
such as word tokens
• Supports selection of documents and subdocuments
– Apply operations (display, count, remember, etc)
to a selection
• Should integrate with markup-based document
models (e.g. the XML DOM)
• Supports numbering/sequencing and visualization
techniques.
© T. Horton
56
What’s Next?
• A proof-of-concept application to show the usefulness
of the approach
• Development of software components based on
regions to support the “middle-level” of software
architectures
• How to best store region-based data for use by
software applications
• Develop a model for distributed text-bases based on
Text Objects, TOO-Lists, etc.
– Implement this for existing texts in an existing
Digital Library
© T. Horton
57
Educational Research in CS/SE
• We have a CS Education Group here at UVa
– Teaching Faculty: Horton, Milner, Anderson, Prey
– Also, others like Knight, Evans, Cohoon
• Education topics are suitable for a Senior Thesis!
© T. Horton
58
Education Projects
• SW Engineering Education (Horton and Knight)
– Share across the Net with other universities
• CS201 camera client/server system
• CS340 Robot systems?
– Web Repository for educational artifacts
• MySQL and PHP
– Winding down for now, but future could include:
• Case studies for use in CS202, CS340, etc.
• Pair-programming (like XP) in CS101?
• New robot-like project suitable for CS340?
© T. Horton
59
Projects (cont’d)
• Next Generation Computer Labs (Knight, Horton,
Milner, Anderson)
– Focuses on CS101 and CS201
– Do they have to be in lab? If so, what should they
be doing as a group with TAs?
– Can we use technology better?
• Netmeeting, Instant Messaging, Web discussion boards
• Collaborative learning: small groups, using a Web board
etc.
– Studio Labs (like for CS340) in CS201?
– Evaluate effectiveness (surveys, interviews, etc.)
© T. Horton
60
Projects (cont’d)
• Fourth-year course/project that does:
– Larger scale system development
– Distributed computing
• Client/server
• Database
– Integration of custom code with existing
components
• Database, middle-ware, COTS components
– Uses .NET, SOAP/XML, etc.
• What would this course or project look like?
© T. Horton
61
Download