Frameworks for Internet content extraction, aggregation, and personalization A Position Paper for the OOPSLA 2000 Workshop On Enterprise Frameworks: Adequacies and Inadequacies By Jagdish Bansiya California State University, Hayward jbansiya@csuhayward.edu Abstract In today’s rapidly growing Internet content environment, corporations, and individuals need to manage and react to large volumes of information more rapidly than ever. For them to be successful in managing and using the information, it should be aggregated and personalized to their businesses and personal needs. According to a Gartner group research publication Internet and Intranet content is doubling every 2-3 months. Managing this overwhelming growth in content and delivering businesses and users relevant, aggregated, and personalized information content is resulting in the creation of suites of new software solutions that heavily rely and leverage from the use of enterprise framework technology. These solutions provide flexible information management frameworks for structured and unstructured content extraction (retrieving relevant snippets of information from a large document), content aggregation (putting together multiple snippets of content to construct an aggregated document), and content delivery (making the aggregated documents available for use) to a host of enterprise content and business applications. The objectives of these framework solutions is to automate content extraction and management followed by delivery of aggregated and transformed information content to business, applications, and individual users. This position paper describes the experiences in using enterprise framework technology in designing and developing solutions for Internet content extraction, aggregation, and delivery. 1. Introduction In the Internet and Information age, knowledge has become the competitive advantage; business and individuals know that the key to success lies in their ability to handle large volumes of information rapidly and accurately and make decisions based on the information. The volume and sources of information have exploded with tremendous growth of web content, the availability of legacy repositories, and deployment of new applications. Furthermore, the information that resides at these different data sources is presented in multiple formats and stored in numerous repositories with their own access methods. Additionally, each enterprise content management application has its own taxonomy for information usage and 1 workflow management. This creates a substantial hurdle in building enterprise-wide and interenterprise information management frameworks. Current information content management frameworks are not optimized for content extraction and delivery of personalized content to applications and users. Most information management systems are built for specific applications that require substantial rework if they are to extract, aggregate, and personalize information for businesses and users. Moreover, these mechanisms must be based on both enterprise and user-defined rules. 2. Solving the Problem Several key requirements must be addressed to successfully extract information, aggregate information, and deliver personalized information. A significant part of the Internet/Intranet content is primarily in the form of HTML based web pages. Therefore there is the need to be able to selectively identify (i.e. mark-up) snippets of web objects that are of business and personal value and than be able to automatically extract the information content corresponding to the selected web objects. This requires the development of a content markup language that uses characteristics and features of selected objects to generate an abstract description of the selected (desired) content. Executing the abstract description of desired content against web resources (pages) results in the extraction of web content that closely matches the content of interest. A XML represented and framework driven markup language has been developed for this purpose. Snippets of information collected from the Internet and other data sources become valuable when the content can be aggregated and business rules can be attached to make the information actionable. To ensure that individual snippets of content are automatically or semi-automatically actionable when events take place, the frameworks must support the establishment of dynamic information filtering and event response rules. The frameworks should also support the integration of custom filtering and event response plug-ins. As the number of applications and devices that need aggregated and personalized information grows and the different types of connectivity increase, the frameworks should support highly individualized event notifications appropriately formatted for any application and any device. Lastly, the frameworks should be flexible enough to quickly adapt to the continual changes in the capabilities and needs of applications and users, and deliver rapid reconfiguration functionality without major programming or lead-time for deployment. 3. The OnePage Solution OnePage.com is working on developing a suite of solutions that address the problems of Internet content extract, aggregation, personalization, and content delivery to a host of business applications and individual users. The core of the solutions is built using a host of enterprise level frameworks technologies. Frameworks have been developed for each of the following key parts of the solutions. 2 An agent technology based on frameworks has been designed from optimal acquisition of targeted data upon demand or execution of rules from any data source on the Internet. A publish/subscribe framework technology that allows for active matching of complex, user-defined event based filters and document handling. A document exchange framework technology that allows applications and users to define new document formats or transform document content into standardized or customized document structures. A framework for content delivery into any number of applications and devices using technology adapters that understand the formats and protocols of interfacing applications. Figure 1 below shows the stacking of these various framework technology solutions in the creation of a comprehension Internet content extraction, management, and personalization solution. Agents (Capture, Import, Subscribe) Enterprise Applications Portal Applications XML Content Delivery Framework Publish/Subscribe Framework Agent Framework Document Server Framework ModelView Control Portal Framework Object Model XML DOM Figure 1: Framework Technologies Used in OnePages Content Extraction, Aggregation, and Delivery Solution 4. Basic Dataflow Figure 2 shows a data flow diagram that describes navigation to content data sources, content markup, extraction, aggregation, document management, personalization, and delivery. The following paragraphs describe the dataflow and the usage of the overall content management solution. Users and applications navigate to documents of data sources from which they want to extract content. The data sources that provide information content of interest include websites, legacy applications, databases, corporate file systems, and data feeds. Since a major part of the content is in the form of web pages available from websites, a customized web browser has been developed that can be used to navigate to web pages of interest. Non-web based data sources will typically need other types of application to navigate to the data. Once navigated to the desired 3 data source the navigating application allows for a point and click mechanism to select and identify snippets of desirable information. The navigating application analyzes the characteristics and features of the object snippet and stores an abstracted representation of the selected object snippet in database along with additional meta attributes about the object and the data source. Intelligent Agents Navigate and Capture Browse to the desired data source and select objects of interest Data WebSites IBM 37XX OnePage Applications Tools for composing, aggregating and personalizing snippets of information content Execute aggregated descriptions of content snippets based on rules and events, that generate content documents Publish Subscribe Systems Stores and manages aggregated documents. Database Enterprise Applications Transformation and Notification Content management systems, Portals, Handheld devices such as Palm, Cellphones, etc. Personalizes information content, changes content data format and presentation Application Interfaces Figure 2: Data Flow Diagram Describing Content Markup, Aggregation, Extraction, Management, Personalization, and Delievery OnePage applications are used to aggregate and personalize snippets of content. The applications allow for grouping snippets and associating business rules to sequence content extraction and aggregation. Applications and users can define content format and UI transformations. Aggregated document descriptions that represent one or more content snippets are called Windows. Content personalization attributes are also associated with the definition of a Window. Windows descriptions are XML documents that are made persistent by storing them in a repository. Intelligent agents provide the framework for executing groups of logically related tasks. Tasks represent actionable execution and are constructed using Window definitions and scheduling parameters. Tasks are created and logically grouped using OnePage applications. Agents fetch the definitions of task to execute and perform the task execution based on the schedule information included in the task description. Agents execute content extraction, sequence tasking, aggregate content from different snippets, and publish content documents to the publishsubscribe system for storage and management. 4 The publish-subscribe system stores and manages documents. The publish-subscribe system implements applications/user authentication and document level access control using Oracle’s LDAP local directory server. Requests for document retrieval, modification, and creation are handled using a hierarchical access control model that is based on user roles and privileges. Static document content is stored in relational database tables with capabilities for attribute based search and direct content indexing. The final step involves delivering aggregated content documents to applications and users. Request for content documents is either initiated by applications and users (pulled) or can be pushed by the publish-subscribe system based on events. Request for documents can include specifications for content transformation to other formats (XML, WML, HTML) and presentation related transformations. Personalization of document content is also executed on content delivery. Content delivery into enterprise applications requires the need to develop Adapters that interface and stream content to the integrating enterprise applications. Adapters have been developed to stream aggregated and personalized content into Portal applications, Microsoft digital dashboard, and enterprise content management systems such as Vignette and Boardvision. 5. Experiences with the use of framework technology The solutions described above have all been built using framework technology. Important characteristics of these framework-based solutions include: 1. 100% Java based solutions built using J2EE. 2. XML based document exchange technology with the capabilities to rapidly develop new document type support within the exchange. 3. Architecting a highly scalable and stateless framework solutions that allows for handling fail-over and load balancing using standard web farm architectures. 4. Handling co-branded and internationalized versions of the software solutions made possible by using a Model2 framework implementation that separates the handling of presentation and business codes. 5. Access to all framework functionalities using XML-SOAP i.e. XML over HTTP providing for a distributed object implementation. 6. Benefits resulting from the use of framework technology 1. Cut scheduled development time by about half. Projects planned for a 12-month development cycle were completed in about 6 months. Therefore a significant cost benefit was achieved. 2. Integration with other enterprise framework solutions became significantly quicker because of the abstraction designed and implemented using the framework approach. Adapters developed for the integration translated the protocols of the two connecting frameworks at the lowest levels of data streaming and format handling. 3. Time required for co-branding and internationalization of the software solutions and document content was made easy because of separation of business and presentation code. 4. Streamlined and structured the overall software development activities. Engineers were limited to using the framework application development styles, which resulted in consistency and uniformity in the code developed. 5 7. Limitations observed with the use of framework technology 1. Collaborative (i.e. parallel and concurrent) development of framework technologies was tedious and difficult to manage. 2. Off the shelf reusable software components were not readily available for use in framework technologies. As a result all parts of the framework solutions had to be developed and built in-house. 3. Framework development is still an evolutionary process. At least two reworks were required in all of the framework solutions to get them right. The first implementations were always found to have deficiencies and limitations that were corrected in a subsequent major rework. 6