SEASR (Software Environment for the Advancement of Scholarly Research) Engineering Knowledge for the Humanities SEASR’s mission is to advance and innovate technologies to support knowledge discovery in the humanities. ______________________ principal investigators: Michael Welge, NCSA GSLIS Loretta Auvil, NCSA John Unsworth, GSLIS a collaboration between the National Center for Supercomputing Applications and the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign initiated June 2007, funded by The Andrew W. Mellon Foundation ______________________ Over the past twenty years, humanities computing has developed technologies to support research–from the archiving of electronic texts, images, audio, and email; to the sharing of web-based research sites and exhibitions; to the establishing of electronic discussion lists and research forums; to the publishing of electronic journals; to the creation of ever-more advanced search applications. As a result, humanities researchers have greater information access and search capabilities than ever before. But because scholars lack sufficiently integrated and refined data-mining tools, they don’t yet have the potential for analyzing large collections in the sophisticated ways that can inspire new insights. The informatics specialists at NCSA and GSLIS behind The Andrew W. Mellon Foundation-funded SEASR project–the Software Environment for the Advancement of Scholarly Research–saw the digital humanities’ need to improve data-mining tools for analyzing (and not simply searching) large bodies of information, to build software bridges for communicating between applications, and to create enhanced environments for technology and information sharing. In answer, SEASR is developing a state-of-the-art software environment for managing unstructured data; analyzing digital libraries, repositories, and archives; and providing an improved end-to-end software system that integrates workflow engines, analysis and visualization tools, and collaborative tools. In doing so, SEASR advances the informatics capabilities developed in some of the very latest digital humanities efforts: Wordhoard, led by Martin Mueller of Northwestern University; the multi-institutional Nora, led by John Unsworth of University of Illinois, and IMIRSEL (International Music Information Retrieval Systems Evaluation Laboratory, led by J. Stephen Downie of University of Illinois; and the two-year combined project that furthers Wordhoard and Nora’s technologies, MONK (Metadata Offer New Knowledge), led by John Unsworth and Martin Mueller. Drawing upon the most current informatics research, the SEASR team has combined IBM’s UIMA (Unstructured Information Management Architecture) and NCSA’s D2K (Data to Knowledge) machine learning to produce a powerful, componentbased analytic framework. Component-based applications like SEASR have an advantage over massive, single-domain applications in that they allow for sharing and reusing components across domains. This approach enables SEASR’s developers to develop, integrate, deploy, and sustain a set of reusable and extensible software components—and empowers other digital humanities developers who use SEASR to do the same. Moreover, SEASR’s framework provides portability through developing tools that can be installed on hardware footprints small and large, so that the tools can be brought to data sets where they are housed and where users choose to work with them; creates a repository for open-source components and workflows that supports sharing and publishing among users; and enables scalability so that components may run on a large variety of hardware footprints, including shared memory processors and clusters. How does SEASR work? SEASR employs leading technology to transform raw data into the semi-structured and structured information that can be processed by machine learning and data analysis applications. Specifically, SEASR uses IBM’s opensource UIMA to construct data services that access and normalize unstructured information. UIMA, according to IBM, advances data synthesis by providing “a technology designed to support a new breed of software applications that can process text within documents and other content sources to understand the latent meaning, relationship and relevant facts buried within…” The SEASR team has chosen to work with UIMA, since it has become a new formal standard with wide use in many fields. Members of the SEASR development team actively serve on the OASIS technical committee to establish semantic search and content analytics specifications for UIMA. Technically, UIMA’s appeal is two-fold: it offers a rich metadata standard that allows for expressing structure in complex ways and it provides a run-time environment in which developers can build, deploy, plug in, and run UIMA component implementations, along with other independentlydeveloped components. UIMA’s component-based framework enables reuse, so that developers can leverage third-party codes across platforms and development environments. The unstructured information SEASR transforms through component-based data services is processed a step further by our service-oriented architecture (i.e., also component-based), which includes ontology libraries and analytics services. To mine semi-structured and structured data and enable previously incompatible formats and platforms to communicate, SEASR draws upon the best practices developed in NCSA’s D2K project over the last decade. D2K is a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection with data and information visualization tools. D2K’s visual programming environment allows developers to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development. The technical legacy of D2K’s data and visualization advances for developers is passed on to humanities researchers with SEASR. Users will leverage SEASR through NSCA and GSLISdesigned developer toolkits—as rich client applications—and user toolkits—in the form of rich internet applications—or through custom user interfaces, developed by anyone in the digital humanities development community. The architectural and software advancements SEASR makes are clear; but, how does SEASR improve upon existing digital knowledge discovery? SEASR provides an enhanced range of data synthesis operations over those currently available to digital humanities applications: from focused data retrieval and data integration, to intelligent human-computer interactions for knowledge access, to semantic data enrichment, to entity and relationship discovery, to knowledge discovery and hypothesis generation. Truly, SEASR is engineering knowledge for the humanities. How can you participate in SEASR? We are eager for members of the digital humanities and humanities communities to collaborate on application development and ontology creation, as well as to contribute to component development for analytics and data access. We also invite community members to participate in visualization and UI design. And, we welcome expert advisors who can help SEASR to make the best possible contributions to the evolving cyberinfrastructure for the humanities. www.seasr.org