Over the past twenty years, Humanities Computing has developed

advertisement
SEASR (Software Environment for the Advancement of Scholarly Research)
Engineering Knowledge for the Humanities
SEASR’s mission is to advance and innovate technologies
to support knowledge discovery in the humanities.
______________________
principal investigators:
Michael Welge, NCSA GSLIS
Loretta Auvil, NCSA
John Unsworth, GSLIS
a collaboration between the National Center for Supercomputing Applications
and the Graduate School of Library and Information Science
at the University of Illinois at Urbana-Champaign
initiated June 2007, funded by The Andrew W. Mellon Foundation
______________________
Over the past twenty years, humanities
computing has developed technologies to support
research–from the archiving of electronic texts,
images, audio, and email; to the sharing of web-based
research sites and exhibitions; to the establishing of
electronic discussion lists and research forums; to the
publishing of electronic journals; to the creation of
ever-more advanced search applications. As a result,
humanities researchers have greater information
access and search capabilities than ever before. But
because scholars lack sufficiently integrated and
refined data-mining tools, they don’t yet have the
potential for analyzing large collections in the
sophisticated ways that can inspire new insights.
The informatics specialists at NCSA and GSLIS
behind The Andrew W. Mellon Foundation-funded
SEASR project–the Software Environment for the
Advancement of Scholarly Research–saw the digital
humanities’ need to improve data-mining tools for
analyzing (and not simply searching) large bodies of
information, to build software bridges for
communicating between applications, and to create
enhanced environments for technology and
information sharing. In answer, SEASR is
developing a state-of-the-art software environment
for managing unstructured data; analyzing digital
libraries, repositories, and archives; and providing an
improved end-to-end software system that integrates
workflow engines, analysis and visualization tools,
and collaborative tools. In doing so, SEASR
advances the informatics capabilities developed in
some of the very latest digital humanities efforts:
Wordhoard, led by Martin Mueller of Northwestern
University; the multi-institutional Nora, led by John
Unsworth of University of Illinois, and IMIRSEL
(International Music Information Retrieval Systems
Evaluation Laboratory, led by J. Stephen Downie of
University of Illinois; and the two-year combined
project that furthers Wordhoard and Nora’s
technologies, MONK (Metadata Offer New
Knowledge), led by John Unsworth and Martin
Mueller.
Drawing upon the most current informatics
research, the SEASR team has combined IBM’s
UIMA (Unstructured Information Management
Architecture) and NCSA’s D2K (Data to Knowledge)
machine learning to produce a powerful, componentbased analytic framework. Component-based
applications like SEASR have an advantage over
massive, single-domain applications in that they
allow for sharing and reusing components across
domains. This approach enables SEASR’s
developers to develop, integrate, deploy, and sustain
a set of reusable and extensible software
components—and empowers other digital humanities
developers who use SEASR to do the same.
Moreover, SEASR’s framework



provides portability through developing tools
that can be installed on hardware footprints
small and large, so that the tools can be
brought to data sets where they are housed
and where users choose to work with them;
creates a repository for open-source
components and workflows that supports
sharing and publishing among users; and
enables scalability so that components may
run on a large variety of hardware footprints,
including shared memory processors and
clusters.
How does SEASR work? SEASR employs
leading technology to transform raw data into the
semi-structured and structured information that can
be processed by machine learning and data analysis
applications. Specifically, SEASR uses IBM’s opensource UIMA to construct data services that access
and normalize unstructured information. UIMA,
according to IBM, advances data synthesis by
providing “a technology designed to support a new
breed of software applications that can process text
within documents and other content sources to
understand the latent meaning, relationship and
relevant facts buried within…” The SEASR team
has chosen to work with UIMA, since it has become
a new formal standard with wide use in many fields.
Members of the SEASR development team actively
serve on the OASIS technical committee to establish
semantic search and content analytics specifications
for UIMA.
Technically, UIMA’s appeal is two-fold: it
offers a rich metadata standard that allows for
expressing structure in complex ways and it provides
a run-time environment in which developers can
build, deploy, plug in, and run UIMA component
implementations, along with other independentlydeveloped components. UIMA’s component-based
framework enables reuse, so that developers can
leverage third-party codes across platforms and
development environments.
The unstructured information SEASR
transforms through component-based data services is
processed a step further by our service-oriented
architecture (i.e., also component-based), which
includes ontology libraries and analytics services. To
mine semi-structured and structured data and enable
previously incompatible formats and platforms to
communicate, SEASR draws upon the best practices
developed in NCSA’s D2K project over the last
decade. D2K is a rapid, flexible data mining and
machine learning system that integrates analytical
data mining methods for prediction, discovery, and
deviation detection with data and information
visualization tools. D2K’s visual programming
environment allows developers to connect
programming modules together to build data mining
applications and supplies a core set of modules,
application templates, and a standard API for
software component development.
The technical legacy of D2K’s data and
visualization advances for developers is passed on to
humanities researchers with SEASR. Users will
leverage SEASR through NSCA and GSLISdesigned developer toolkits—as rich client
applications—and user toolkits—in the form of rich
internet applications—or through custom user
interfaces, developed by anyone in the digital
humanities development community.
The architectural and software advancements
SEASR makes are clear; but, how does SEASR
improve upon existing digital knowledge discovery?
SEASR provides an enhanced range of data synthesis
operations over those currently available to digital
humanities applications: from focused data retrieval
and data integration, to intelligent human-computer
interactions for knowledge access, to semantic data
enrichment, to entity and relationship discovery, to
knowledge discovery and hypothesis generation.
Truly, SEASR is engineering knowledge for the
humanities.
How can you participate in SEASR? We are
eager for members of the digital humanities and
humanities communities to collaborate on application
development and ontology creation, as well as to
contribute to component development for analytics
and data access. We also invite community members
to participate in visualization and UI design. And,
we welcome expert advisors who can help SEASR to
make the best possible contributions to the evolving
cyberinfrastructure for the humanities.
www.seasr.org
Download