SEEK: Accomplishing Enterprise Information Integration Across Heterogeneous Sources Authors William O’Brien, Ph.D., Ph: 352-392-7213 Dept. of Civil & Coastal Engineering Em: wjob@ce.ufl.edu R. Raymond Issa, Ph.D., J.D. M.E. Rinker School of Building Construction Ph: 352-392-7438 Em: raymond-issa@ufl.edu Joachim Hammer, Ph.D., Ph: 352-392-2687 Computer & Information Science & Engineering Em: jhammer@cise.ufl.edu Mark Schmalz, Ph.D., O.D., Computer & Information Science & Engineering Ph: 352-392-6831 Em: mssz@cise.ufl.edu Joseph Geunes, Ph.D., Ph: 352-392-1220 Dept. of Industrial & Systems Engineering Em: geunes@ise.ufl.edu Sherman Bai, Ph.D., Ph: 352-392-1220 Dept. of Industrial & Systems Engineering Em: bai@ise.ufl.edu University of Florida, Gainesville, FL 32611-6120, USA Keywords Knowledge sharing, legacy system integration, knowledge capture, supply chain management, value added analysis Background This paper describes a set of enabling technologies to support integration of information stored in the heterogeneous legacy systems of the many firms on construction projects. Specifically, we provide a new approach, known as SEEK – Scalable Extraction of Enterprise Knowledge, that supports extraction and composition of knowledge across legacy sources that are heterogeneous both physically and semantically. SEEK is not meant to be a general-purpose extraction/composition toolkit. Rather it supports extraction of specific and limited forms of knowledge. Current instantiations of SEEK extract knowledge supporting applications in construction scheduling and supply-chain management. In particular, our approach to knowledge integration differs significantly from approaches based on data standards such as the IFC and AECXML. Practically, we believe that is it implausible that the potentially hundreds of firms in a project supply chain will uniformly subscribe to a single data standard or even a compatible set of standards. This makes connection to heterogeneous information systems and translation of semantically heterogeneous data in those systems dual challenges that must be overcome if knowledge stored in legacy systems is to be leveraged. Beyond connection and extraction, there are challenges in knowledge composition. Raw data must often be transformed to a form suitable for decision-making. Consider for example that much of the data used for operations in firms is detailed in nature, often mimicking accounting details or a detailed work breakdown structure. This data is too detailed for many enterprise level decision support and analysis tools (for example, quantitative supply chain models). In general, we must compose the data needed as input for analysis tools from data used by legacy applications that were developed for other purposes. Objectives SEEK is an attempt to overcome the collective challenges of assembling knowledge resident in numerous legacy information systems. Instantiation of SEEK will allow the implementation of enterprise level decision support tools to improve construction performance. SEEK is designed with several aspects of scalability that promise: Rapid configuration with semi-automatic set-up: SEEK can be flexibly configured to accept a wide variety of legacy systems, removing the burden of set-up from the firm. Customization of specific instantiations of SEEK components through user configuration and tuning, assisted by domain experts or knowledge engineers. Composition of knowledge via analysis and processing of data extracted from the legacy source, allowing queries beyond those natively supported by the source. Protection of source-specific, proprietary knowledge by establishing a layer between the source and enterprise/decision maker. (SEEK tools may be provided by a third-party separate from the decision maker, encouraging adoption.) Extended capabilities through upgrades to the modular components of SEEK. Application to numerous domains, such as design, prototyping, test, manufacturing, and maintenance, via the basic SEEK modular architecture. Methodology A high-level view of the SEEK architecture is shown in Figure 1. SEEK provides a middleware layer that bridges the gap between legacy information sources and decision makers or decision support tools employing information from legacy systems. SEEK thus follows established mediation/wrapper methodologies. Applications/ decision support End Users and Decision Support SEEK Analysis Module Sources Knowledge Extraction Module source expert Wrapper executive Legacy data and systems Legend: run-time/operational data flow build-time/set-up/tuning data flow Figure 1: Schematic diagram of SEEK logical architecture. abstract for ITCON – Special Edition on Knowledge Management in Construction University of Florida 2 Novel aspects of SEEK include an analysis module and a knowledge extraction module integrated with the legacy data interface. The analysis module supports advanced processing of data extracted from legacy sources by wrappers, further supporting composition of knowledge required by decision makers or support tools. The knowledge extraction module directs setup of the analysis module and wrapper, thus supporting (a) automatic connection of legacy sources with decision makers, and (b) fine tuning of the knowledge extraction process by domain experts and knowledge engineers or managers. In practice, the knowledge extraction and analysis modules significantly extend existing wrappers by (1) accessing knowledge encoded in applications as database schemas, business rules, high-level code, etc. rather than accessing data only; (2) supporting discovery of operational knowledge with customizable templates; (3) integrating machine learning techniques to simplify and speed up customization of the knowledge extraction templates. Since the wrapper and analysis tools are integrated in SEEK, the amount of data transferred between legacy sources and decision makers can be reduced. We also note that security of information exchange can be improved since the extracted information can be filtered and summarized in the analysis module prior to transmission over the enterprise network. Results Our Phase-I development includes initial prototype design, implementation, and validation. Fig. 2 shows the prototype architecture (an elaboration of the high-level view of Fig. 1). The wrapper works as follows. Requests from the hub are represented in XML using the global schema defined in the hub. Based on the given input data and the desired output described in the query, the wrapper selects a suitable execution plan using a library of customized analysis functions for decision-support and planning. Decision support specific SEEK firm-specific FIRM SEEK-general modules Data SEEK Decision-Support Request Send/Receive MS Project (JDBC/ODBC) SQL Wraper (SQL Query library) Result Set for the SQL query Parsed XML Query DOM Result XML Doc Invoke Wrapper Extractor Intermediate Result Set Invoke Analysis Send/Receive (Sockets) Query XML Doc Analysis Library Signature & Given values Matcher Analysis Module Final Result XML Parser Full Result XML Doc Parsed Query Template Invalid Signature XML Result Generator Parser DTD for hub schema Query Template (XML-DOC) Figure 2: Conceptual Overview of the SEEK Prototype Architecture. abstract for ITCON – Special Edition on Knowledge Management in Construction University of Florida 3 Maximum scalability is provided by carefully separating the wrapper components into three groups: (1) decision support tool-specific components (shown as white boxes in Fig. 2), (2) SEEK general components (dark shaded boxes) and, (3) SEEK firm-specific components (light shaded boxes). The modular architecture supports incremental code enhancement to specific modules without impacting other modules. For example, we separated the generation of firm specific Structured Query Language (SQL) queries in the wrapper module from the code in the analysis module that uses an internal data format to represent data or requests. Hence, when the wrapper connects to a firm that supports a different data model or query interface (e.g., PERL), only the wrapper module needs to be modified. This setup is accomplished semi-automatically via the knowledge extraction module (not shown in figure 2). Figure 3 contains a screenshot of our SEEK prototype that depicts the results of a query about resource availability over time. The source being queried is a MS Project application. Subcontractor Resource Availability 1 4 Figure 3: Snapshot of our SEEK prototype, illustrating a query about resource availability. (note that we have not invested in interface design as SEEK will likely exist between applications and data sources as shown in figure 1) Acknowledgements This material is based upon work supported by the National Science Foundation under grant number CMS-0075407. abstract for ITCON – Special Edition on Knowledge Management in Construction University of Florida 4