Framework for Internet content extraction and Personalization

advertisement
Frameworks for Internet content extraction, aggregation, and
personalization
A Position Paper for the OOPSLA 2000 Workshop
On
Enterprise Frameworks: Adequacies and Inadequacies
By
Jagdish Bansiya
California State University, Hayward
jbansiya@csuhayward.edu
Abstract
In today’s rapidly growing Internet content environment, corporations, and individuals need to
manage and react to large volumes of information more rapidly than ever. For them to be
successful in managing and using the information, it should be aggregated and personalized to
their businesses and personal needs. According to a Gartner group research publication Internet
and Intranet content is doubling every 2-3 months. Managing this overwhelming growth in
content and delivering businesses and users relevant, aggregated, and personalized information
content is resulting in the creation of suites of new software solutions that heavily rely and
leverage from the use of enterprise framework technology.
These solutions provide flexible information management frameworks for structured and
unstructured content extraction (retrieving relevant snippets of information from a large
document), content aggregation (putting together multiple snippets of content to construct an
aggregated document), and content delivery (making the aggregated documents available for use)
to a host of enterprise content and business applications. The objectives of these framework
solutions is to automate content extraction and management followed by delivery of aggregated
and transformed information content to business, applications, and individual users.
This position paper describes the experiences in using enterprise framework technology in
designing and developing solutions for Internet content extraction, aggregation, and delivery.
1. Introduction
In the Internet and Information age, knowledge has become the competitive advantage; business
and individuals know that the key to success lies in their ability to handle large volumes of
information rapidly and accurately and make decisions based on the information. The volume
and sources of information have exploded with tremendous growth of web content, the
availability of legacy repositories, and deployment of new applications.
Furthermore, the information that resides at these different data sources is presented in multiple
formats and stored in numerous repositories with their own access methods. Additionally, each
enterprise content management application has its own taxonomy for information usage and
1
workflow management. This creates a substantial hurdle in building enterprise-wide and interenterprise information management frameworks.
Current information content management frameworks are not optimized for content extraction
and delivery of personalized content to applications and users. Most information management
systems are built for specific applications that require substantial rework if they are to extract,
aggregate, and personalize information for businesses and users. Moreover, these mechanisms
must be based on both enterprise and user-defined rules.
2. Solving the Problem
Several key requirements must be addressed to successfully extract information, aggregate
information, and deliver personalized information.
A significant part of the Internet/Intranet content is primarily in the form of HTML based web
pages. Therefore there is the need to be able to selectively identify (i.e. mark-up) snippets of web
objects that are of business and personal value and than be able to automatically extract the
information content corresponding to the selected web objects. This requires the development of
a content markup language that uses characteristics and features of selected objects to generate an
abstract description of the selected (desired) content. Executing the abstract description of
desired content against web resources (pages) results in the extraction of web content that closely
matches the content of interest. A XML represented and framework driven markup language has
been developed for this purpose.
Snippets of information collected from the Internet and other data sources become valuable when
the content can be aggregated and business rules can be attached to make the information
actionable. To ensure that individual snippets of content are automatically or semi-automatically
actionable when events take place, the frameworks must support the establishment of dynamic
information filtering and event response rules. The frameworks should also support the
integration of custom filtering and event response plug-ins.
As the number of applications and devices that need aggregated and personalized information
grows and the different types of connectivity increase, the frameworks should support highly
individualized event notifications appropriately formatted for any application and any device.
Lastly, the frameworks should be flexible enough to quickly adapt to the continual changes in the
capabilities and needs of applications and users, and deliver rapid reconfiguration functionality
without major programming or lead-time for deployment.
3. The OnePage Solution
OnePage.com is working on developing a suite of solutions that address the problems of Internet
content extract, aggregation, personalization, and content delivery to a host of business
applications and individual users.
The core of the solutions is built using a host of enterprise level frameworks technologies.
Frameworks have been developed for each of the following key parts of the solutions.
2




An agent technology based on frameworks has been designed from optimal acquisition of
targeted data upon demand or execution of rules from any data source on the Internet.
A publish/subscribe framework technology that allows for active matching of complex,
user-defined event based filters and document handling.
A document exchange framework technology that allows applications and users to define
new document formats or transform document content into standardized or customized
document structures.
A framework for content delivery into any number of applications and devices using
technology adapters that understand the formats and protocols of interfacing applications.
Figure 1 below shows the stacking of these various framework technology solutions in the
creation of a comprehension Internet content extraction, management, and personalization
solution.
Agents
(Capture, Import,
Subscribe)
Enterprise Applications
Portal Applications
XML Content Delivery Framework
Publish/Subscribe
Framework
Agent Framework
Document Server
Framework
ModelView
Control
Portal Framework
Object Model
XML DOM
Figure 1: Framework Technologies Used in OnePages
Content Extraction, Aggregation, and Delivery Solution
4. Basic Dataflow
Figure 2 shows a data flow diagram that describes navigation to content data sources, content
markup, extraction, aggregation, document management, personalization, and delivery. The
following paragraphs describe the dataflow and the usage of the overall content management
solution.
Users and applications navigate to documents of data sources from which they want to extract
content. The data sources that provide information content of interest include websites, legacy
applications, databases, corporate file systems, and data feeds. Since a major part of the content
is in the form of web pages available from websites, a customized web browser has been
developed that can be used to navigate to web pages of interest. Non-web based data sources will
typically need other types of application to navigate to the data. Once navigated to the desired
3
data source the navigating application allows for a point and click mechanism to select and
identify snippets of desirable information. The navigating application analyzes the characteristics
and features of the object snippet and stores an abstracted representation of the selected object
snippet in database along with additional meta attributes about the object and the data source.
Intelligent Agents
Navigate and Capture
Browse to the desired data
source and select objects
of interest
Data
WebSites
IBM 37XX
OnePage
Applications
Tools for composing,
aggregating and
personalizing snippets of
information content
Execute aggregated
descriptions of content
snippets based on rules
and events, that generate
content documents
Publish Subscribe
Systems
Stores and manages
aggregated documents.
Database
Enterprise
Applications
Transformation and
Notification
Content management
systems, Portals,
Handheld devices such as
Palm, Cellphones, etc.
Personalizes information
content, changes content
data format and
presentation
Application Interfaces
Figure 2: Data Flow Diagram Describing Content Markup, Aggregation,
Extraction, Management, Personalization, and Delievery
OnePage applications are used to aggregate and personalize snippets of content. The applications
allow for grouping snippets and associating business rules to sequence content extraction and
aggregation. Applications and users can define content format and UI transformations.
Aggregated document descriptions that represent one or more content snippets are called
Windows. Content personalization attributes are also associated with the definition of a Window.
Windows descriptions are XML documents that are made persistent by storing them in a
repository.
Intelligent agents provide the framework for executing groups of logically related tasks. Tasks
represent actionable execution and are constructed using Window definitions and scheduling
parameters. Tasks are created and logically grouped using OnePage applications. Agents fetch
the definitions of task to execute and perform the task execution based on the schedule
information included in the task description. Agents execute content extraction, sequence
tasking, aggregate content from different snippets, and publish content documents to the publishsubscribe system for storage and management.
4
The publish-subscribe system stores and manages documents. The publish-subscribe system
implements applications/user authentication and document level access control using Oracle’s
LDAP local directory server. Requests for document retrieval, modification, and creation are
handled using a hierarchical access control model that is based on user roles and privileges.
Static document content is stored in relational database tables with capabilities for attribute based
search and direct content indexing.
The final step involves delivering aggregated content documents to applications and users.
Request for content documents is either initiated by applications and users (pulled) or can be
pushed by the publish-subscribe system based on events. Request for documents can include
specifications for content transformation to other formats (XML, WML, HTML) and presentation
related transformations. Personalization of document content is also executed on content
delivery. Content delivery into enterprise applications requires the need to develop Adapters that
interface and stream content to the integrating enterprise applications. Adapters have been
developed to stream aggregated and personalized content into Portal applications, Microsoft
digital dashboard, and enterprise content management systems such as Vignette and Boardvision.
5. Experiences with the use of framework technology
The solutions described above have all been built using framework technology. Important
characteristics of these framework-based solutions include:
1. 100% Java based solutions built using J2EE.
2. XML based document exchange technology with the capabilities to rapidly develop new
document type support within the exchange.
3. Architecting a highly scalable and stateless framework solutions that allows for handling
fail-over and load balancing using standard web farm architectures.
4. Handling co-branded and internationalized versions of the software solutions made
possible by using a Model2 framework implementation that separates the handling of
presentation and business codes.
5. Access to all framework functionalities using XML-SOAP i.e. XML over HTTP
providing for a distributed object implementation.
6. Benefits resulting from the use of framework technology
1. Cut scheduled development time by about half. Projects planned for a 12-month
development cycle were completed in about 6 months. Therefore a significant cost
benefit was achieved.
2. Integration with other enterprise framework solutions became significantly quicker
because of the abstraction designed and implemented using the framework approach.
Adapters developed for the integration translated the protocols of the two connecting
frameworks at the lowest levels of data streaming and format handling.
3. Time required for co-branding and internationalization of the software solutions and
document content was made easy because of separation of business and presentation
code.
4. Streamlined and structured the overall software development activities. Engineers were
limited to using the framework application development styles, which resulted in
consistency and uniformity in the code developed.
5
7. Limitations observed with the use of framework technology
1. Collaborative (i.e. parallel and concurrent) development of framework technologies was
tedious and difficult to manage.
2. Off the shelf reusable software components were not readily available for use in
framework technologies. As a result all parts of the framework solutions had to be
developed and built in-house.
3. Framework development is still an evolutionary process. At least two reworks were
required in all of the framework solutions to get them right. The first implementations
were always found to have deficiencies and limitations that were corrected in a
subsequent major rework.
6
Download