Intelligent Information Source Selection for
The Context Interchange System
Kenneth C. Ng
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements of the Degrees of
Bachelor of Science in Computer Science and Engineering
and Master of Engineering in Electrical Engineering and Computer Science
at the Massachusetts Institute of Technology
May 21, 1999
May 21, 1999
The Context Interchange (COIN) project aims to develop tools and technologies for
supporting access to heterogeneous information sources on the Internet. Under the current
COIN framework, a user has to specify the sources when issuing a query. As the number
of information sources on the Internet increases rapidly, a user can choose from many
similar information sources. Users would like to send their queries to sources that are
most relevant to their contexts. This thesis looks into the problem of data quality and
source selection. We present an approach to source selection based on context mediation.
We show that context mediation can handle semantic conflicts, such as consistency and
timeliness issues, among various sources. For data discrepancies,we propose to adopt
the strategy of derived data, i.e. ontology mapping.
Thesis Supervisor: Stuart Madnick
Title: John Norris Maguire Professor of Information Technology,
Sloan School of Management
Thesis Supervisor: Michael Siegel
Title: Principal Research Scientist,
Sloan School of Management
Corporate information integration is essential for business analyses in many fields
such as consulting and finance. Management consultants need to understand the
competition and the dynamics of an industry. Equity research analysts have to value
companies based on fundamentals and market information. An integrated approach to
obtaining corporate information would result in tremendous time and cost savings.
The advent of Internet extended the frontier of information integration. Information
sources in different parts of the world are now just a few mouse clicks away. However,
the inherent semantic heterogeneity in human languages is a major obstacle for correct
interpretation of information. Different sources operate in different contexts. For
example, a US customer visiting the web-site of an online bookshop in the UK might
misinterpret the prices in pounds to be prices in dollars. When the currency is not stated
clearly, people would naturally assume the prices are in their local currencies. In other
words, people assume they are within the same contexts as the information source.
Context differences could lead to large-scale semantic heterogeneity and hinder the
information integration process.
The COntext INterchange (COIN) project aims to develop tools and technologies for
supporting access to heterogeneous information sources. The heart of the COIN
framework is the notion of context, that determines the underlying meaning and
interpretation of information. The contexts are collections of statements defining how
data should be interpreted and how potential conflicts might be resolved.
The COIN framework is designed to work with a wide range of distributed
information systems. A natural extension of COIN framework is the integration of
information on the Internet with context mediation. Nowadays, the number of
information sources on the Internet increases exponentially. People could obtain detailed
information on virtually every topic: from microbiology to astrophysics, and from
Chinese history to American technologies.
1.1.1 Data Quality
As the number of information sources on the Internet increases, a user can typically
choose from many similar information sources. Due to time and cost considerations,
people would only want to query the most appropriate sources. This can be achieved
either by the user or by an automated process. Hence, we assume multiple information
sources with partially overlapping content that can be uniformly queried by some system.
The sources can be document collections or Internet web sites. The information sources
show a varying quality of information and a varying quality of access, i.e. varying query
Querying information from Internet sources is usually divided into three tasks:
Source selection, i.e. choosing the best possible information sources to evaluate a
Query evaluation at the sources
Merging the query results [Gravano, Chang, and Garcia-Molina 1997]
Given a query and a set of information sources that are capable of answering the
query to some extent, we address the problem of deciding which of the sources to issue
the query upon. This decision should be based on data quality. Without considering
quality, a system might return an answer that is useless, inaccurate or incomplete.
Among the data quality dimensions suggested by Wang and Strong [1996], we
choose four criteria that are most relevant to context mediation: accuracy,completeness,
consistency, and timeliness. There is no exact definition for accuracy. From literature, it
appears that accuracy is viewed as equivalent to correctness; we consider completeness to
be all values requiredby the user for a certain variable are recorded. Consistency is
related to the values of data. A data value can only be expected to be the same for the
same situation. If different values for the same piece of data under the same situation are
reported by different sources, then these sources are inconsistent. Timeliness has been
defined in terms of whether the data is out of date [Ballou and Pazer 1985] and
availability of output on time [Kriebel 1979].
1.1.2 Source Selection Problem
Source selection is usually handled in a straightforward manner with some selectorcomponent analyzes source capabilities and source contents. Matching the query against
the capabilities of the sources determines which combinations of sources are capable of
answering the query. Matching the query against the source contents determines the
sources that will likely provide the most and the most relevant information. This
technique relies on statistical information giving the total number of appearances of each
distinct word, the document frequency for each word, and the total number of documents
in source. With this information, the appropriateness of each source for evaluating the
query can be estimated. The sources with the highest estimates are then chosen to be
queried. An information source is thus considered to be appropriate if the keywords of a
query appear often and in many documents.
We believe there is more to finding the best sources than just counting the
appearances of certain key words. Consider a source containing not only text documents
but explanatory graphics on the subject. Should this source be valued higher than the one
without graphics? Consider a source that matches the query well but is very concise,
hardly containing anything but the key words of the query. Should this document be
downloaded? Finally, consider a source containing matching but outdated information. Is
such a source worth exploring at all? Quality of a source and the quality of the documents
it contains must be measured with more than one criterion or quality dimension.
Another concern is the context of the user. Data with quality considered appropriate
for one use may not possess sufficient quality for another use. For example, a financial
analyst would like to obtain the geographical breakdown of sales of a company while a
day-trader only wants to know the last quoted price of the company's stock. In this case,
any one of the stock quote servers on the Internet would probably the day-trader. But for
the financial analyst, he might have to dig up the annual report or SEC-filings of the
company for the geographical breakdown of sales.
Two main problems arise when the COIN model is applied on sources of unknown
quality. First, there is the problem of semantic conflicts. Different sources may have
different representations for the same piece of data. Second, there is the problem of data
discrepancies. Sources sometimes provide conflicting values for the same piece of data
with no apparent semantic conflicts.
We would illustrate the problem of source selection by a scenario of personal
investing. Researching on equity stocks is a very tedious process. Investors might want to
obtain key financial ratios of various companies. For example, they would need to
compare valuations with the Price/Earning ratios (P/E ratios). However, different Internet
sources might give different P/E ratios for the same company. In this case, we are
interested to know the reason behind the inconsistency. Is it due to semantic conflicts? Or
is it due to data discrepancies?We would show how context mediation could help the
investor obtain the information from the source that best fits his context.
Goal of Thesis
The goal of this thesis is to present an approach to source selection based on context
mediation. We show that context mediation can handle semantic conflicts, such as
consistency and timeliness issues, among various sources. For data discrepancies, we
propose to adopt the strategy of derived data, i.e. ontology mapping.
Structure of Thesis
In chapter 2, we first give a brief introduction to the COIN model on which further
extensions would be based. Our focus is on the context mediation process.
In chapter 3, we give details of the context mediation process, defining the domain
model, elevation axioms and the mediation system. Also, we discuss the limitations of the
current context mediator when dealing with data conflicts.
In chapter 4, we discuss the meanings of data quality based on previous research in
this area. There are four main quality criteria that we will focus on: accuracy,
completeness, consistency, and timeliness. Then, we move on to present a literature
review on source selection approaches.
In chapter 5, we define the problem of our thesis by pointing out that most of the
source selection processes have ignored the contexts of data providers and data
consumers. In order to illustrate the source selection concretely, we present two scenarios
that highlight the problem. There are two main types of conflicts among information
sources: semantic conflicts and data discrepancies.
In chapter 6, we would introduce the approach of source selection based on context
mediation. We show that context mediation can handle semantic conflicts, such as
consistency issues and timeliness issues, among various sources. For data discrepancies,
we propose to adopt the strategy of derived data, i.e. ontology mapping. At the end of the
chapter, we will leave open-ended questions for further research.
In chapter 7, we would conclude the thesis with our research findings and proposed
2 Background
Context Interchange System
The COntext INterchange (COIN) strategy seeks to address the problem of semantic
interoperability by consolidating distributed data sources and providing a unified view to
them. COIN technology presents all data sources as SQL databases by providing generic
wrappers for them. The underlying integration strategy, the COIN model, defines a novel
approach for mediated data access in which semantic conflicts among heterogeneous
systems are automatically detected and reconciled by the Context Mediator.As a result,
the COIN approach integrates disparate data sources by providing semantic
interoperability (the ability to exchange data meaningfully) among them [Bressan et al.
The COIN Framework
The COIN framework is composed of both a data model and a logical language,
COINL, derived from the family of F-Logic [Kifer, Lausen, and Wu 1995]. The data
model and language define the domain model of the receiver, data source, and the
contexts [McCarthy 1987] associated with them. The data model contains the definitions
for the "types" of information units (called semantic-types) that constitute a common
vocabulary for capturing the semantics of data in disparate systems. Contexts, associated
with both information sources and receivers, are collections of statements defining how
data should be interpreted and how potential conflicts (differences in the interpretation)
should be resolved. Concepts such as semantic-objects, attributes, modifiers, and
conversion functions define the semantics of data inside and across contexts. Together
with the deductive and object-oriented features inherited from F-Logic, the COIN data
model and COINL constitute an expressive framework for representing semantic
knowledge and reasoning about semantic heterogeneity.
2.1.2 Context Mediator
The Context Mediator is the heart of the COIN project. It is the unit that provides
mediation for user queries. Mediation is the process of rewriting queries posed in the
receiver's context into a set of mediated queries where all potential conflicts are explicitly
solved. This process is based on an abduction procedure that determines what
information is needed to answer the query and how conflicts should be resolved by using
the axioms in the different contexts involved. Answers generated by the mediation unit
can be both extensional and intentional. Extensional answers correspond to the actual
data retrieved from the various sources involved. Intentional answers, on the other hand,
provide only a characterization of the extensional answer without actually retrieving data
from the data sources. Furthermore, the mediation process supports queries on the
semantics of data that are implicit in the different systems. These are referred to as
knowledge-level queries as opposed to data-level queries that are inquiries on the factual
data present in data sources. Finally, integrity knowledge on one source or across sources
can be naturally involved in the mediation process to improve the quality and information
content of the mediated queries and ultimately aid in the optimization of the data access.
2.1.3 System Perspective
From a systems perspective, the COIN strategy combines the best features of the
loose- and tight-coupling approaches to semantic interoperability among autonomous and
heterogeneous systems. Its modular design and implementation funnels the complexity of
the system into manageable chunks, enables sources and receivers to remain looselycoupled to one another, and sustains an infrastructure for data integration.
This modularity, both in the components and the protocol, also keeps our
infrastructure scalable, extensible, and accessible. By scalability, we mean that the
complexity of creating and administering the mediation services does not increase
exponentially with the number of participating sources and receivers. Extensibility refers
to the ability to incorporate changes into the system in a graceful manner; in particular,
local changes do not have adverse effects on other parts of the system. Finally,
accessibilityrefers to how a user in terms of its ease-of-use perceives the system and
flexibility in supporting a variety of queries.
2.1.4 Application Domains
The COIN technology can be applied to a variety of scenarios where information
needs to be shared amongst heterogeneous services and receivers. The need for this novel
technology in the integration of disparate data sources can be readily seen in the
following examples.
A useful application of the COIN technology is in the financial domain. The COIN
model could assist financial analysts in conducting research and valuing companies. The
technology is particularly valuable when comparing companies across borders. Different
countries have different reporting requirements and accounting standards. Furthermore,
each country has its own financial information providers. All these information might be
presented in vastly different formats and of different qualities. Some major discrepancies
are due to scale-factors and currency representations. The COIN framework could help
resolve the semantic heterogeneity among various sources and aid in financial decision
In the domain of manufacturing inventory control, the ability to access design,
engineering, manufacturing, and inventory data pertaining to all parts, components, and
assemblies is vital to any large manufacturing process. Typically, thousands of
contractors play roles and each contractor tends to setup its data in its own individualistic
manner. COIN technology can play an important role in standardizing various formats
the contractors follow in their bids. This would help managers optimize inventory levels
and ensure overall productivity and effectiveness.
Finally, the modem health care enterprise lies at the nexus of several different
industries and institutions. Within a single hospital, different departments (e.g. internal
medicine, medical records, pharmacy, admitting, billing) maintain separate information
systems yet must share data in order to ensure high levels of care. Medical centers and
local clinics not only collaborate with one another but also with State and Federal
regulators, insurance companies, and other payer institutions. This sharing requires
reconciling differences such as those of procedure codes, medical supplies, classification
schemes, and patient records.
COIN Architecture
The feasibility and features of the proposed strategy of using context mediation to
solve semantic differences between various heterogeneous data sources are demonstrated
in a working system that provides mediated access to both on-line structured databases
and semi-structured data sources such as web site. This demonstration system implements
most of the important concepts of the context interchange strategy and is called the COIN
system. This section introduces the COIN system and its high level architecture. The
infrastructure leverages on the World Wide Web in two ways. First, we rely on the
hypertext transfer protocol for the physical connectivity among sources and receivers and
the different mediation components and services. Second, we employ the hypertext
markup language and Java for the development of portable user interfaces. Figure 1
shows an overview of the COIN architecture [Bressan et al. 1997].
Figure 1: COIN architecture overview [Bressan et al. 1997].
2.2.1 Client Processes
Client processes provide the interaction with receivers and route all database requests
to the Context Mediator. An example of a client process is the multi-database browser,
which provides a point-and-click interface for formulating queries to multiple sources
and for displaying the answers obtained. Specifically, any application program that posts
queries to one or more sources can be considered a client process. This can include all the
programs (e.g. spread sheet software programs like Excel or Access) that can
communicate using the ODBC bridge to send SQL queries and receive results.
2.2.2 Server Processes
Server processes refer to databasegateways and wrappers. Database gateways
provide physical connectivity to database on a network. The goal is to insulate the
Mediator Process from the idiosyncrasies of different database management systems by
providing a uniform protocol for database access as well as canonical query language
(and data model) for formulating the queries. Wrappers, on the other hand, provide richer
functionality by allowing semi-structured documents on the World Wide Web to be
queried as if they were relational databases. This is accomplished by defining an export
schema for each of these web sites and describing how attribute-values can be extracted
from a web site using pattern matching.
2.2.3 Mediator Processes
Mediator processes refer to the system components that collectively provide the
mediation services. These include SQL-to-datalog compiler, context mediator, and query
planner/optimizer and multi-database executioner. SQL-to-datalog compiler translates a
SQL query into its corresponding datalog format. Context mediator rewrites the userprovided query into a mediated query with all the conflicts resolved. The
planner/optimizer produces a query evaluation plan based on the mediated query. The
multi-database executioner executes the query plan generated by the planner. It
dispatches subqueries to the server processes, collates the intermediary results, and
returns the final answer to the client processes.
In our discussion, we would focus on the mediator processes. In the next chapter, we
would introduce context mediation with a scenario of an equity research analyst gathering
information from a variety of sources. Then, we would elaborate on that example and
explain the subsystems within the context mediator in detail. Finally, we would identify
the potential problems of poor information quality in context mediation.
Context Mediation
We would introduce context mediation with an example of an equity research analyst
researching on Daimler Benz. He would like to find out the net income, net sales, and
total assets of Daimler Benz for the fiscal year ending 1993. He normally uses the
financial data in the database Worldscope. However, for this particular research exercise,
Worldscope does not have all the information he needs. He learns from his colleagues
about two new databases Datastreamand Disclosure. Hopefully, he could obtain all the
necessary information from the three databases. He starts off with Worldscope database
that had total assets for all the companies. He logs into Oracle server containing the
Worldscope data and issues a query:
select company name, total-assets
from worldscope
where companyname = "DAIMLER-BENZ AG";
The result returned is:
Total Assets
Net Income
Net Sales
Figure 2: Scenario for context mediation.
The analyst continues to look for net income and net sales figures for Daimler Benz.
He realizes that Disclosure has net income data whereas Datastreamhas net sales data.
For net sales, he issues the query:
select company name,
from disclosure
where companyname = "DAIMLER-BENZ
The query does not return any records. He checks for typos and tries again as he is
pretty sure that Datastreamhas the information. He refines the query by entering the
partial name for Daimler Benz:
select companyname, net-income
from disclosure
where companyname like "DAIMLER&";
The result returned is:
He then realizes that the data sources do not conform to the same standards. The
failure of the initial query is due to different representations of company names in the two
databases. Finally, he issues the query for net income:
select name, totalsales
from datastream
where name like "DAIMLER%";
The result returned is:
After obtaining all the information needed, he begins to analyze the numbers.
However, there are a number of things unusual for this data set. First, the total sales are
twice as much as the total assets of the company, which is quite unlikely for a company
like Daimler Benz. Another disturbing phenomenon is that net income is almost ten times
as much as total assets. He immediately notices that something is wrong and tries to solve
the mystery by digging into the fact sheets of the databases. Upon a detailed examination,
he finds some interesting facts about the data. First, there is a difference in scale factors.
Datastreamhas a scale factor of 1000 for all the financial amounts, while Disclosure uses
a scale factor of one. Second, there is a difference in currency denominations. Both
Datastreamand Disclosure use the country of incorporation for the currency while
Worldscope uses a scale factor of 1000 but every number is in USD. After recognizing
the semantic differences of the data sources, he would need a data source of historical
currency exchange rates in order to reconcile the differences.
With context mediation, the system automatically detects and resolves the semantic
conflicts between all the data sources. The results would be presented in the format that
the user is familiar with. In the above scenario, if the equity research analyst is using
context mediation system instead, all he has to do is to formulate and issue only one
query despite the underlying semantic differences among various data sources. For
example, if he wants the result returned to be in the Worldscope context, then the context
mediation system would issue the query:
select worldscope.totalassets,
datastream.totalsales, income
from worldscope, datastream, disclosure
where worldscope.companyname = "DAIMLER-BENZ AG" and of date = "01/05/94" and
worldscope.companyname = and
worldscope.companyname = disclosure.companyname;
The system basically detects all the conflicts that the analyst has to deal with, and
resolves these conflicts without the analyst's explicit intervention.
In the next section, we will discuss the concept of a domain model based on the
above scenario.
Domain Model
A domain model specifies the semantics of the "types" of information units, which
constitutes a common vocabulary used in capturing the semantics of data in disparate
sources. In other words, it defines the ontology that will be used. The various semantic
types, the type hierarchy, and the type signatures (for attributes and modifiers) are all
defined in the domain model. Types in the generalized hierarchy are rooted to system
types, i.e. types native to the underlying system such as integers, strings, real numbers etc
[Shah 1998].
Figure 3: Financial Domain Model [Shah 19981.
Inheritance: This is the classic type of inheritance relationship. All semantic types
inherit from basic system types. In the domain model, type company Financ ia l s
inherits from basic type string.
Attributes: In COIN framework, objects have two forms of properties, those which are
structural properties of the underlying data source and those that encapsulate the
underlying assumptions about a particular piece of data. Attributes access structural
properties of the semantic object in question. For instance, the semantic type
companyf inancials has two attributes, company and f yEnding. Intuitively,
these attributes define a relationship between objects of the corresponding semantic
types. The relationship formed by the company attribute states that for any company
financial in question, there must be corresponding company to which that company
financial belongs. Similarly, the f yEnding attribute states that every company financial
object has a date when it was recorded.
Modifiers: Modifiers define a relationship between semantic objects of the
corresponding semantic types. The difference though is that the values of the semantic
objects defined by the modifiers have varying interpretations depending on the context.
Referring to the domain model, the semantic type companyFinancials define two
modifiers, s cale Fa ct or and curr ency. The value of the object returned by the
modifier scaleFactor depends on a given context.
Elevation Axioms
The mapping of data and data-relationship from the sources to the domain model is
accomplished through the elevation axioms. There are three distinct operations that
define the elevation axioms:
Define a virtual semantic relation corresponding to each extensional relation
Assign to each semantic object defined its value in the context of the source
Map the semantic objects in the semantic relation to semantic types defined in the
domain model and make explicit any implicit links (attribute initialization)
represented by the semantic relation
We illustrate how the relation is elevated with the Worldscope example. The
Worldscope relation is a table in oracle database and has the following columns:
Mediation System
In the following sections, we will describe the workings of the mediation system by
means of the financial analyst application scenario. We will begin with the query, and
then we will describe the compilation process. In addition, we will discuss the domain
and the specification of information context. The query is as follows:
select assets, sales, income, quotes.Last
from worldscope, datastream, disclosure, quotes
where worldscope.companyname = "DAIMLER-BENZ AG" and
datastream.asofdate = "01/05/94" and
worldscope.companyname = and
worldscope.companyname = disclosure.companyname and
worldscope.companyname = quotes.cname;
The above query is asked in the Worldscope context. Once the mediation system has
processed all parts of the query, the system will convert the result into the Worldscope
context and returns the converted results back to the user.
3.4.1 SQL to Datalog Query Compiler
The SQL query is fed to the SQL query compiler when it is entered into the system.
The query compiler takes in the SQL query and parses the query into its corresponding
datalog form. At the same time, by means of elevation axioms, the compiler elevates the
data sources into its corresponding elevated data objects. The corresponding datalog
query for the SQL query above is as follows:
net income,
WorldcAFp(V27, V26, V25, V24, V23, V22, V21),
DiscAF p(V20, V19, V18, V17, V16, V15, V14),
DstreamAF_p(V13, V12, V11, Vi0, V9, V8),
Quotesp(V7, qlast),
Value(V27, cws, V5),
Value(V13, cws, V4),
V4 = "01/05/94",
Value(V12, c_ws, V3),
V5 = V3,
Value(V20, c_ws, V2),
V5 = V2,
Value(V7, c_ws, V1),
V5 = V1,
Value(V22, c ws, total assets),
Value(V17, cws, totalsales),
Value(V11, c_ws, net-income),
Value(qlast, c_ws, last).
The query now contains elevated data sources along with a set of predicates that map
each attribute to its value in the corresponding context. Since the user asked the query in
the Worldscope context (denoted by c_ws), the last four predicates in the translated query
ensure that the actual values returned as the solution of the query are in Worldscope
context. The resulting unmediated datalog query is then fed to the mediation engine.
3.4.2 Mediation Engine
The mediation engine is the core of the COIN system. It detects and resolves
possible semantic conflicts. In short, the mediation is a query rewriting process. The
actual mechanism of mediation is based upon an abduction engine [KaKas, Kowalski,
and Toni 1993]. The engine takes a datalog query and a set of domain model axioms and
computes a set of abducted queries such that the abducted queries have all the semantic
differences resolved. The system incrementally tests for potential semantic conflicts and
introduces conversion functions to resolve the conflicts. The mediation engine outputs a
set of queries that accounts for all possible cases of conflicts. Shah [1998] presents
detailed examples of abducted queries.
3.4.3 Query Planner and Optimizer
The query planner module takes the set of datalog queries produced by the mediation
engine and produces a query plan. It ensures that an executable plan exists which will
produce a result that satisfies the initial query. A query planner is necessary because there
are sources that restrict the type of queries they can service. Another limitation is the
types of operators sources can handle. For example, some web sources do not export an
interface that support all the SQL operators. Once the planner verifies that an executable
plan exists, it generates a set of constraints on the order in which the different sub-queries
can be executed. Under these constraints, the optimizer applies standard optimization
heuristics to generate the query execution plan.
The query execution plan is an algebraic operator tree in which each operation is
represented by a node. There are two types of nodes:
Access Nodes: Access nodes represent access to remote data sources. Two subtypes
of access nodes are:
sfw nodes: These nodes represent access to data sources that do not require input
bindings from other sources in the query
join-sfw nodes: These nodes require input from other data sources in the query.
Thus these nodes have to come after the nodes that they depend on while
traversing the query plan tree
Local Nodes: These nodes represent local operations in local execution engine. Four
subtypes of local nodes are:
join nodes: These nodes join two trees
select nodes: These nodes apply conditions to intermediate results
CVT nodes: These nodes apply conversion functions to intermediate query
union nodes: These nodes represent unions of results obtained by executing the
3.4.4 Runtime Engine
The runtime execution engine executes the query plan. For a given query plan, the
execution engine traverses the query plan tree in a depth-first manner starting from the
root node. At each node, it computes the sub-trees for that node and then applies the
operation specified for that node. For each sub-tree, the engine recursively goes down the
tree until it encounters an access node. At an access node, it composes and sends a SQL
query to the remote source. The results of that query are then maintained in local storage.
The operation at a node will not be carried out unless all the sub-trees have been executed
and all the results available are in local storage. This whole operation propagates all the
way up to the root node. Upon reaching the root node, the execution engine should have
the required set of results corresponding to the original query. The results are then
presented to the user and the whole context mediation process is completed.
Web Wrapper
Web wrapping is the technology that enables user to treat web sites as relational
databases. With this technology, users can issue SQL queries to web sources just as they
would to any relation in a relational database. The implementation of this technology is
the web wrapping engine [Bressan and Bonnet 1997]. With the web wrapper engine,
applications developers can rapidly wrap a structured or semi-structured web site and
export the schema for users' queries.
Figure 4: Generic Web Wrapper Architecture [Bressan and Bonet 1997].
Figure 4 shows the architecture of the COIN web wrapper. The system takes the
SQL query as input. It parses the query along with the specifications for the given web
site. A query plan is then constituted. The query plan contains a detailed list of web sites
to sent http requests, the order of those requests and also the list of documents that will be
fetched from those web sites. The executioner then executes the plan. Once the pages are
fetched, the executioner then extracts the required information from the pages and
presents the collated results to the user.
Source selection in Context Mediation
Context mediation is a powerful tool for resolving semantic differences among
various data sources. However, the extensive use of the Internet poses the problem of
information quality. As the number of information sources increase, people would only
need to query the most appropriate sources. The data quality offered by these sources can
and must be a criterion for source selection. When two different sources give conflicting
information on the same subject, which one should we choose?
In the next chapter, we will discuss the issue of data quality and some quality
criteria. In chapter 5, we will define the problem of source selection with a scenario in
which an investor is trying to make an investment decision by comparing the financial
ratios of several companies in an industry. He tries to leverage the resources on the
Internet but faces the problem of semantic conflicts and data discrepancies among the
Data Quality and Source Selection
Data Quality
4.1.1 Defining Data Quality
There is much database research showing how important data quality is to businesses
and users. The research usually aims at ensuring the quality of data in databases. With
few exceptions, however, data quality is treated as an intrinsic concept, independent of
the context in which data is produced and used. This focus on intrinsic data quality
problems fails to recognize the role of data consumer in data quality management. In
contrast to this intrinsic view, Strong, Lee and Wang [1997] propose that quality cannot
be assessed independent of consumers who choose and use products. Similarly, the
quality of data cannot be assessed independent of the people who use data - data
consumers. Data consumers' assessments of data quality are important because
consumers now have more choices and control over their computing environment and the
data they use.
Tayi and Ballou [1998] define data quality as "fitness for use", which implies the
concept of data quality is relative. Thus data with quality considered appropriate for one
use may not possess sufficient quality for another use. The trend toward multiple uses of
data has highlighted the need to address data quality concerns.
In addition, fitness for use implies that we need to look beyond traditional concerns
with the accuracy of the data. Data in stock-portfolio management systems may be
accurate but unfit for use if that data is not sufficiently timely. Also, personnel databases
situated in different divisions of a company may be correct but unfit for use if the desire
is to combine the two and they have incompatible formats.
4.1.2 Quality Criteria for Information Sources
Wang and Strong [1996] have identified fifteen quality criteria and have classified
these into four categories "intrinsic quality", "accessibility", "contextual quality", and
"representational quality".
Data Quality Category
Data Quality Dimensions
Accuracy, Objectivity, Believability, Reputation
Accessibility, Access security
Relevancy, Value-Added, Timeliness,
Completeness, Amount of data
Interpretability, Ease of understanding, Concise
Representation, Consistent representation
Table 1: Data quality categories and dimensions.
Among the data quality dimensions suggested by Wang and Strong [1996], we
choose four criteria that are most relevant to context mediation: accuracy, completeness,
consistency, and timeliness.
There is no exact definition for accuracy. For example, Kriebel [1979] characterizes
accuracy as "the correctness of the output information". Ballou and Pazer [1985] describe
accuracy as "the recorded value is in conformity with the actual value." Thus it appears
that accuracy is viewed as equivalent to correctness. Instead of defining accuracy, Wand
and Wang [1996] tries to define inaccuracy.Inaccuracy implies that information system
represents a real-world state different from the one that should have been represented.
Completeness is the ability of an information system to represent every meaningful
state of the represented real world system. Ballou and Pazer [1985] view a set of data as
complete if all necessary values are included: "All values for a certain variable are
recorded". This definition, however, fails to consider the context of the user. Data
considered complete by one user may not be sufficient for another one. We define
completeness as all values requiredby the user for a certain variable are recorded.
Consistency is related to the values of data. A data value can only be expected to be
the same for the same situation. If different values for the same piece of data under the
same situation are reported by different sources, then these sources are inconsistent.
Inconsistency occurs in two types of data conflicts: semantic conflicts and data
discrepancies.Semantic conflicts may be due to different representations of the same
piece of data or timeliness issues. Data discrepancies may be a direct result of
Timeliness has been defined in terms of whether the data is out of date [Ballou and
Pazer 1985] and availability of output on time [Kriebel 1979]. Timeliness is affected by
three factors: How fast the information system state is updated after the real-world
system changes; the rate of change of the real-world system; and the time the data is
actually used. Lack of timeliness may lead to a state of the information system reflecting
a past state of the real world.
Source Selection
Source selection approaches can be broadly divided into two categories: information
retrieval (IR) approach and database (DB) approach.
4.2.1 Information Retrieval Approach
Information retrieval (IR) approach handles source selection in a straightforward
manner with some selector-component analyzing source capabilities and source contents.
Matching the query against the capabilities of the sources determines which combinations
of sources are capable of answering the query. Matching the query against the source
contents determines the sources that will likely provide the most and the most relevant
Under the IR approach, retrieval of desired information dispersed in multiple sources
requires general familiarity with their contents and structure, query languages, location
on existing networks etc. The user must break down a given retrieval task into a sequence
of actual queries to databases.
In the GIOSS system [Gravano, Garcia-Molina, and Tomasic 1994], the authors
assume that each participating source provides information on the total number of
documents in the source and for each word the number of documents it appears in. These
values are used to estimate the percentage of query-matching documents in a source. The
source with the highest percentage is chosen for querying.
Florescu, Koller, and Levy [1997] try to describe quantitatively the contents of
information sources using probabilistic measures. In their model two values are
calculated: Coverage of information sources, determining the probability that a matching
document is found in the source, and overlap between two information sources,
determining the probability that an arbitrary document is found in both sources. These
probabilities are calculated with the help of word-count statistics. This information is
then used to determine the order in querying the sources.
4.2.2 Database Approach
Database (DB) approach involves creating a knowledge server that will form the
interface between information sources and applications in need of that information. A
user queries against the knowledge server in a manner that is independent of the
distribution of information over various sources. It is up to the knowledge server to
determine how to obtain the desired information and which data sources to use.
The Services and Information Management for decision Systems (SIMS) approach
proposed by Arens and Knoblock [1993] accepts queries in the form of a description of a
class of objects about which information is desired. A complete semantic model of the
application domain is created and used in order to provide a collection of terms with
which to describe the contents of available information sources. The user is not presumed
to know how information is distributed over the data- and knowledge bases to which
SIMS has access. SIMS then proceeds to reformulate the user's query as a collection of
more elementary statements that refer to data stored in available information sources. A
plan is created for retrieving the desired information, establishing the order and content of
the various plan steps/subqueries. The resulting plan is then executed by performing local
data manipulation that generates the final translation into database queries in the
appropriate languages.
Liu and Pu [1997] propose a metadata approach to identify relevant and capable
information sources. For each query the query scope and the query capacity are
determined. The query scope describes synonyms for each part of the query; the query
capacity describes the information source capability requirements for each part of the
query. This metadata is matched with the source capability profiles of the information
sources, which describe category, content and capabilities of a source.
Naumann, Freytag, and Spiliopoulou [1998] propose to use Data Envelopment
Analysis (DEA) for quality-driven source selection. This method does not directly
compare sources, but focuses on individual sources, determining their efficiency in terms
of information quality and cost. A set of efficient sources is then identified. Finer source
selection will be based upon further criteria imposed on the efficient set.
Source Selection Problem
Defining the Problem
In chapter 4, we discuss the meaning of data quality and several approaches to source
selection in previous literature. Most of the IR approaches rely on statistical information
giving the total number of appearances of each distinct word, the document frequency for
each word, and the total number of documents in source. The DB approaches are slightly
different. They set up a knowledge base that will form the interface between information
sources and applications in need of that information.
We believe there is more to finding the best sources than just counting the
appearances of certain key words. Consider a source containing not only text documents
but explanatory graphics on the subject. Should this source be valued higher than the one
without graphics? Consider a source containing matching but outdated information. Is
such a source worth exploring at all? Quality of source is a relative measure. It depends
very much on the requirements of the user.
Two main types of conflicts may arise in the source selection process: semantic
conflicts and data discrepancies.In the first case, different sources may have different
representations for the same piece of data. In the second case, sources sometimes provide
conflicting values for the same piece of data with no apparent semantic conflicts.
In the next section, we will present two scenarios, one for an individual investor and
the other for a professional financial analyst. Both of them are trying to obtain financial
ratios and corporate information of several companies in the same industry. However,
they are essentially in different contexts and will have different source requirements.
Another complication is the discrepancies among different sources. When they try to
integrate information from the Internet, they realize that different sources present
conflicting data on the same subject. With the two scenarios, we will show how semantic
conflicts and data discrepanciesaffect the source selection process and how the user's
context plays a part.
Scenario Analysis
5.2.1 Background
John is an individual investor who does not want to miss the recent rally in the US
stock market. Currently, he has a concentrated holding of technology stocks and would
like to diversify his portfolio by investing in pharmaceutical companies. Before making
any investment decisions, he needs to obtain the financials and analyst recommendations
of several pharmaceutical companies.
Jessica, on the other hand, is a financial analyst with a prominent investment bank.
She covers major pharmaceutical companies in the US and Europe. In compiling her
research reports, she needs to obtain detailed information on the company's operations.
The following is a list of dominant drug companies that both John and Jessica are
interested in:
Johnson & Johnson (NYSE: JNJ), a US health-care product manufacturer
Merck (NYSE: MRK), a US pharmaceutical company
Pfizer (NYSE: PFE), a US pharmaceutical company
Rhone-Poulenc S.A. (NYSE: RP), a French chemical company
5.2.2 Financial Ratios
With a working knowledge of the Internet, John looks up several financial web-sites
for the desired information. He is most interested in obtaining the key ratios of a
company so that he can compare it with its competitors and the industry average. The key
ratios that he is looking at are P/E (price/earning) ratios, EPS (earnings per share),
estimated earnings growth in the next 5 years, dividend and dividend yield (Appendix A).
He obtains the numbers from several sources (Appendix B). The results of his research
are as follows:
Q: Quarterly
Y: Yearly
TTM: Trailing 12 months
Share Price
9 0 %/
9 0 %/
9 0 %/
P/E ratio
5-yr EPS growth ()14.37
Dividend yield ( )1.11
Shares (mm)
39.73 (TTM)
2.23 (TTM)
0.25 (Q)
1.00 (y)
90 %A
Table 2: The key ratios of Johnson & Johnson as of market close on 3/16/99.
Q: Quarterly
Y: Yearly
TTM: Trailing 12 months
Share Price
P/E ratio
5-yr EPS growth (%)
Dividend yield (%)
Shares (mm)
85 7/8
0.27 (Q)
38.98 (TTM)
2.15 (TTM)
1.08 (Y)
85 7/8
85 7/8
85 7/8
47 5/8
47 5/8
Table 3: The key ratios of Merck as of market close on 3/16/99.
Share Price
P/E ratio
5-yr EPS growth (%)
Dividend yield %
Shares (mm)
0.22 (Q)
94.98 (TTM)
2.64 (TTM)
0.88 (Y)
Table 4: The key ratios of Pfizer as of market close on 3/16/99.
Share Price
P/E ratio
47 5/8
0.47 (Y)
5-yr EPS growth (%)
Dividend yield (%)
Shares (mm)
47 5/8
0.47 (Y)
Table 5: The key ratios of Rhone-Poulenc S.A. as of market close on 3/16/99.
To John's surprise, there are lots of discrepancies among various sources. In each
field, he highlights the numbers that are significantly different from the rest. P/E ratios,
EPS and estimated 5-year EPS growth rate have the most discrepancies. However, what
really disturbs John is that there are even differences in dividend and shares outstanding.
Dissatisfied with the results, John decides to revisit the web sites and look for
Timeliness Issues
John notices the difference in reported EPS values of Pfizer between Market Guide
($2.64 per share) and other sources ($2.55 per share). He revisits the Market Guide website and quickly realizes that it reports trailing 12 months' earnings. The EPS number
provided by DBC is the same as that of Market Guide. John suspects that DBC is
reporting trailing 12 months' earnings as well. In order to verify his hypothesis, he visits
the DBC web-site and looks for a definition for EPS. The following is the definition he
gets from DBC:
"Earningsper share (EPS)
EPS, as it is called, is a company'sprofit divided by its number of shares.
If a company earned$2 million in one year had 2 million shares of stock
outstanding,its EPS would be $1 per share."
From DBC's explanation, it is unclear how "one year" is defined. It can be "fiscal
year", "trailing 12 months" or just a general definition of "January to December". But
based on his findings in the Market Guide web-site, John assumes that DBC defines "one
year" as "trailing 12 months".
John re-visits the Quicken web-site and finds out that Quicken reports EPS in fiscal
year 1998. The fiscal year of Pfizer ends in December. Since both RapidResearch and
Yahoo! Finance report the same values for EPS as that of Quicken, John suspects that
RapidResearch and Yahoo! Finance report EPS in fiscal year 1998 as well. He visits the
RapidResearch web-site and notices that it does report EPS in fiscal year 1998. However,
he cannot find any explanation on the Yahoo! Finance web-site.
The EPS value reported by Bloomberg ($2.12 per share) is different from all the
other sources. John visits the Bloomberg web-site and only finds an explanation of
"Earnings: 12 months". Similar to DBC's explanation, it is unclear how "12 months" is
defined. But since the value reported by Bloomberg is very close to that reported by
Market Guide and DBC, John believes that Bloomberg reports trailing 12 months'
earnings. He suspects the difference in reported EPS is due to dilution. Reduction in
common EPS may occur if convertible securities are converted, stock options and
warrants are exercised or other shares are issued. In this case, it is possible that
Bloomberg reports diluted EPS which is lower than primary EPS reported by Market
Guide and DBC.
In the above example, we notice that timeliness can be a source of data conflicts.
Data conflicts do not necessarily mean that some or all of the data are inaccurate. All the
EPS values reported by the sources can be correct. The discrepancies are due to different
information service providers having different contexts.
5.2.4 ConsistencyIssues
John finds that the dividends of Merck reported by Yahoo! Finance and Market
Guide ($1.08 per share) are four times to those reported by Bloomberg, DBC and
Quicken ($0.27 per share). At first glance, there seem to be large discrepancies in
dividends reported by various sources. He looks up the DBC web-site again and
discovers that DBC explicitly reports the quarterly dividend. Then, he tries Quicken and
notices the following explanation concerning dividend:
A dividend is an amount of money or stock that a corporationpays to its
From the explanation on dividend offered by Quicken, it is clear that Quicken is
reporting the value on a quarterly basis. Finally, John visits the Bloomberg web-site but it
offers no explanation on the dividend reported.
Based on the fact that Bloomberg, DBC and Quicken are reporting quarterly
dividend and their values are four times less than that of Yahoo! Finance and Market
Guide, John suspects that Yahoo! Finance and Market Guide are reporting dividend on an
annual basis. He re-visits the Market Guide web-site and finds out that Market Guide
explicitly reports annual dividend. He tries to look for an explanation at Yahoo! Finance
but finds nothing. At this point, John is confident that the four times difference is caused
by different representation of the same piece of data.
John further notices that RapidResearch consistently reports a lower yearly dividend
than other sources. He visits RapidResearch web-site and it reports the dividend of fiscal
year 1998 ($0.95 per share). John wants to understand the reason for the inconsistency
and he looks for explanations of dividend in other sources. Finally, he discovers the
following at the Market Guide web-site:
"Dividend Rate ($per share)
This value is the total of the expected dividendpayments over the next
twelve months. It is generally the most recent cash dividendpaid or declared
multipliedby the dividendpaymentfrequency, plus any recurringextra
After reading the explanation from Market Guide, John now understands why
RapidResearch reports an annual dividend that is different from that of Market Guide.
The value reported by RapidResearch is the dividend paid out during fiscal year 1998. On
the other hand, the value reported by Market Guide is a 12-month projection based upon
the current quarterly dividend.
In the above example, we notice that inconsistencies arise when values of the same
piece of data are reported in different representations. Again, we believe that all the
sources report the dividend values correctly. The user's context determines the source
that is most appropriate to query against.
5.2.5 Completeness Issues
Despite some of the timeliness issues and consistency issues, John has been quite
satisfied with the quality and extent of the information provided by the financial websites. As an individual investor, he can basically get all the information he needs from
these sources. John finds these sources to have very high completeness. He goes on to
recommend these financial web-sites to his friend Jessica who works as a financial
analyst with a prominent investment bank.
Jessica is compiling a research report to investigate the effect of global economic
meltdown on US pharmaceutical companies. She is particularly interested in analyzing
the geographical breakdown of sales of Johnson & Johnson. As a professional financial
analyst, she knows she can get the sales figures from Johnson & Johnson's annual report.
However, since John highly recommends the Internet financial web-sites, she decides to
use the Internet this time and saves the trouble of flipping through the annual report. She
looks up the Market Guide web-site for the sales information. To her disappointment, she
only finds the following information:
Revenue ($ mm)
12 months ending 01/03/99
Table 6: Sales of Johnson & Johnson in fiscal year 1998 reported by Market Guide.
Jessica tries several other financial web-sites but still without luck in obtaining the
geographical breakdown of sales. Finally, she resorts to the 1998 annual report of
Johnson & Johnson and obtains the following information:
(Dollars in Millions)
United States
Sales to Customers in 1998
Western Hemisphere excluding US
Asia-Pacific, Africa
Segments Total
Table 7: Geographical breakdown of sales of Johnson & Johnson in fiscal year 1998 (Source:
Johnson & Johnson 1998 Annual Report).
Jessica will not consider the Internet financial sources to be complete. They do not
have the sales information that Jessica needs. She only finds the Johnson & Johnson
Annual Report to have high completeness. This example illustrates that completeness is a
quality that depends very much on users' contexts.
5.2.6 AccuracyIssues
Jessica is looking at Rhone-Poulenc, a French chemical company. She is puzzled by
the differences in the EPS values reported by the various sources. Unlike the example in
section 5.2.3, the data discrepancies are pretty wide spread in this case. Quicken and
Yahoo! Finance both report a gain for Rhone-Poulenc ($1.94 per share) while DBC and
Market Guide both report a loss (-$0.42 per share). RapidResearch reports an even deeper
loss (-$2.47 per share). Bloomberg does not have a value for the EPS. Rhone-Poulenc is a
French company and there is probably limited financial information on it.
Jessica understands that Rhone-Poulencreported a loss in fiscal year 1997. So the
discrepancies in gain/loss among the sources are most likely timeliness issues. However,
the disturbing fact is that there are discrepancies even among sources that are reporting a
loss. The only reliable way to obtain the EPS is to derive it from net earnings and number
of shares outstanding. She looks up the annual report of Rhone-Poulenc and obtains the
following information:
(in Millions of French Francs)
Net Income
Earnings per Share
Average Number of Shares Outstanding
Table 8: Consolidated Financials of Rhone-Poulenc in fiscal year 1998 (Source: Rhone-Poulenc 1998
Annual Report).
From the annual report of Rhone-Poulenc,Jessica finds out that the EPS is 11.48 FF
per share. However, since the audience of her report are US based investors, she has to
convert the EPS figure into US dollars. With the help of Oanada currency database, she
obtains the historical Euro/USD exchange rate of 1 FF = $0.1779 on 12/31/1998.
Therefore, the EPS of Rhone-Poulenc in fiscal year 1998 is $2.04 per share.
The EPS value derived by Jessica from net earnings and number of shares
outstanding is different from any of the values reported by the Internet sources. There are
no apparent reasons for the discrepancies. A possible reason is the use of different
exchange rates. For example, some sources may be mistakenly using the current
exchange rate rather than the exchange rate on 12/31/1998.
In this example, the data discrepancies without apparent semantic conflicts are
considered to be accuracy issues. Wang and Reddy [1992] propose to resolve accuracy
issues by asking for user's judgement, which is likely to be based upon past experiences.
We show that an alternative solution is using other high quality data to derive the correct
values for inaccurate data.
COIN Approach
The COIN system is an elegant system for resolving semantic conflicts among
various data sources. We propose to leverage the COIN system to resolve the semantic
conflicts and data discrepanciespresented in the scenarios in section 5.2.
In chapter 6, we would introduce the approach of source selection based on context
mediation. Our goal is to select the source that best fits the user's context. We show that
context mediation can handle semantic conflicts, such as consistency and timeliness
issues, among various sources in a straightforward manner. For data discrepancies, such
as accuracy and completeness issues, we propose to adopt the strategy of derived data.
With appropriate ontology mapping, high quality data can be used to derive the correct
values for inaccurate data.
6 Solution with Context Mediation
In this chapter we propose a solution to source selection based upon context
mediation. With the COIN approach, the user is unaware of the data discrepancies among
the data sources. The system accesses the data sources implicitly without the user ever
specifying them in the query. In order to resolve the semantic conflicts and data
discrepancies,the COIN system should be aware of the context of the data sources and
that of the user.
User Query and Context
Context of
Data Sources
Source Selection with
Context Mediation
4 -
Query Result
Figure 5: Overview of the COIN approach in source selection.
For example, a user wants to obtain the annual dividend of a company. The COIN
system should be aware that some sources report dividend on a quarterly basis while
others report dividend on an annual basis. In the user's context, he expects the query
result to be annual dividend. The COIN system should query against a source that reports
annual dividend and return the query result to the user. In case the source that reports
annual dividend is unavailable due to network or server error, the COIN system should
still be able to detect the semantic relation between quarterly dividend and annual
dividend. It will query against a source that reports quarterly dividend. Upon receiving
the query result, the COIN system should perform the necessary conversion before
passing the result to the user. In this case, the quarterly dividend is multiplied by four to
yield the annual dividend.
In the following sections, we will continue to use the two scenarios of an individual
investor and a financial analyst to illustrate how context mediation can be applied on
source selection.
Data Consistency
In section 5.2.4, we present a scenario in which John, an individual investor, notices
the inconsistencies in reported dividend values among various sources. The reason behind
the inconsistencies is that different sources have different contexts. Some sources report
dividend on a quarterly basis while others report it on an annual basis. We will illustrate
how context mediation can resolve the data inconsistency in this case.
Shah [1998] describes the steps of building a COIN application in detail. In order to
leverage the COIN system in resolving semantic conflicts, we need to carefully set up the
domain model, the context definitions and the conversion functions.
Defining the Domain Model
In the domain model, we have two types of objects, basic objects (denoted by
squares) and semantic objects (denoted by ellipses). Basic objects have scalar types.
Examples are numbers and strings. The semantic objects are the ones that are used n the
system as they carry rich semantic information about the objects in a given domain.
Furthermore, there are two types of relations specified between the objects: attribute
relation and modifier relation.Details of these relations are described in chapter 3. Figure
6 presents the dividend domain model:
Field Name
frequency format
Figure 6: Dividend Domain Model.
We define the domain model for the COIN system with the help of Figure 6. We use
the example of the semantic type Dividendto demonstrate how to define an object and
create the relations for that object. The object Dividendhas three attributes, company,
field andfrequency, and one modifier, unit. We first define the Dividend semantic type.
The construct to define a semantic object is:
After we have defined the object, we then define the attributes and modiferes:
attributes(Dividend, [company, field, frequency])
modifiers(Dividend, [unit])
Definitions for other objects in the domain model are listed in Appendix C.
6.2.2 Defining the Context
We have to identify all the semantic differences among the data sources before we
can create their contexts in COIN. The following is a context table for the dividend field
of the six sources:
Data Source
Field Name
Stock Symbol
"Dividend Amount"
"Quarterly" Number
Market Guide
"Annual Dividend"
Last Fiscal Year
Yahoo! Finance
Table 9: Context Table for Dividend Domain Model.
In Table 9, each row corresponds to a data source for which we have to specify a
context and lists all the characteristics of that data source. We also need to give a unique
name to each context that we are going to create for each of the data sources. For
example, if we are defining the context of Market Guide, we will name cmg as the name
of the context for Market Guide.
Once we have specified and named the contexts, we will create the actual contexts.
Referring to the context table, we look at the row for Market Guide data source. Each
column of the table refers to a semantic difference present in the source that needs to be
resolved. The column FieldName refers to the string identifying the key to the dividend
value in a source. For example, in order to successfully query against Market Guide for
the dividend value, we have to specify the field as "Annual Dividend". If we just type the
field as "Dividend", the query will likely returns with no result. On the other hand, when
we are querying against Yahoo! Finance for the dividend, we have to specify the field as
"Div/Shr". The column Frequencyrefers to the frequency of reported dividend value in a
source. Market Guide reports the annual dividend whereas DBC reports the quarterly
dividend. The column Unit refers to the format of the dividend value in a source. For
example in DBC, the dividend is reported as a number followed by a string "Quarterly".
In all the other sources, the dividend is just reported as a number. The last column Stock
Symbol refers to the key used by the data source to identify the companies. There is no
discrepancy among the sources in this aspect. All the sources use the stock symbol to
identify the companies.
After we have recognized all the potential semantic conflicts, we can create the
contexts. We will use Market Guide as an example to show the context creation process.
The context for Market Guide in prolog is as follows:
modifier (dividend, 0, unit, cmg, M)
cste(basic, M, cmg, "Number")
modifier (fieldName, 0, fieldFormat, cmg, M)
cste(basic, M, cmg, "Annual Dividend")
modifier (freqRepresentation, 0, freqFormat, cmg, M)
cste(basic, M, cmg, "Annual")
modifier (stockSymbol, 0, format, cmg, M)
cste(basic, M, cmg, "String")
Each statement refers to a potential conflict that needs to be resolved by the COIN
system. In other words, each statement corresponds to a modifier relation in the actual
domain model. From the domain model in Figure 6, we notice that the object FieldName
has a modifierfieldFormat.The second statement in the above example corresponds to
this modifier. Referring to Table 9, we notice that the value offieldFormatin the Market
Guide context is "Annual Dividend". The statement represents this fact. It states that the
modifierfieldFormatfor the object 0 of typefieldName in the context cmg is the object
M where the object M is a constant (cste) of type basic and has a value of "Annual
Dividend" in the context c mg. The type basic refers to all the scalar types, numbers,
reals and strings.
Context definitions for other objects are listed in Appendix D.
6.2.3 Defining the Conversion Function
We need to specify conversion functions for the contexts after we have defined them.
The conversion functions are used during context mediation when the system needs to
convert objects between different contexts. We have to provide conversion functions for
all the modifiers of all the objects defined in the domain.
For the modifierfieldFormat,the conversion between different contexts is
straightforward text manipulation. For example, if we have to convert from the DBC
context to the Market Guide context, we will change the FieldName from "Dividend
Amount" to "Annual Dividend".
Similarly, for the modifier unit, the conversion is also text manipulation. We assume
the user would expect the query result of dividend to be a number. For example, if we
have to convert from the DBC context to the user's context, we will scrap the string
"Quarterly" from the query result and just return the number to the user.
Finally, for the modifierfrequencyFormat,the conversion is a bit tricky. Converting
between quarterly and annual dividend requires arithmetic operation. For example, if we
have to convert from the Quicken context to the Yahoo! Finance context, we will
multiply the quarterly dividend by four to get the annual dividend.
Data Timeliness
In section 5.2.3, John notices there are differences in reported EPS values for the
same company. He realizes the conflict is a timeliness issue. Some sources report trailing
12 months' earnings while others report earnings in the last fiscal year. We will show that
context mediation can help resolve the timeliness issue in this scenario by reconciling the
EPS in different time frames. In our following discussion, we will focus on three sources:
Market Guide, Yahoo! Finance and RapidResearch. Market Guide reports EPS in the last
fiscal year. Yahoo! Finance reports EPS on a trailing 12 months' basis, i.e. total EPS of
the most recent 4 quarters. RapidResearch reports quarterly EPS. We intend to use
RapidResearch's quarterly EPS data to resolve the difference between fiscal EPS and
trailing 12 months' EPS.
The process in this example is similar to that in the data consistency case. We have
to define a domain model, a context table and the conversion functions.
6.3.1 Defining the Domain Model
Figure 7: EPS Domain Model
Figure 7 presents the EPS domain model. The definitions for the attributes and
modifiers are similar to that in section 6.2.1. A complete list of definitions for the objects
in the domain model is in Appendix E.
6.3.2 Defining the Context
We identify all the semantic differences among the three data sources that we are
interested in. The following is a context table for the EPS field of the sources:
Data Source
Market Guide
Field Name
"Earnings (TTM)
Date Format
Time Format
Stock Symbol
Table 10: Context Table for EPS Domain Model.
In Table 10, the column FieldName refers to the string identifying the key to the
EPS value in a source. For example, in order to obtain the EPS value from Market Guide,
we have to specify the field as "Earnings (TTM) $". The column Date Formatrefers to
the way dates are represented in the data source. For example in Market Guide,
December 31, 1998 is represented as 12/31/98; in Yahoo!, it is represented as Dec/31;
and in RapidResearch, it is represented as 9812. The column Time Formatrefers to the
time frame of the reported EPS value in a source. For example, Market Guide reports the
EPS in the last fiscal year while Yahoo! reports the trailing 12 months' EPS. The column
Unit refers to the numerical format of the EPS values. There is no discrepancy in this
field. The last column Stock Symbol refers to the key used by the data source to identify
the companies. There is no discrepancy among the sources in this field as well.
The context creation process is similar to that in section 6.2.2. A list of context
definitions for the EPS model is in Appendix F.
6.3.3 Defining the Conversion Function
The conversion functions in this scenario are more complicated than those in section
6.2.3. For the modifiers fieldFormat and dateFormat,the conversions between different
contexts are straightforward text manipulations. However, the conversion from fiscal EPS
to trailing 12 months' EPS requires quarterly EPS data.
We illustrate the conversion process with an example of the EPS of Pfizer. John
would like to obtain the trailing 12 months' EPS of Pfizer. He sends his query to the
COIN system. Under normal circumstances, the COIN system would match John's
context with the contexts of the sources and direct the query to Quicken which reports
trailing 12 months' EPS. Unfortunately, the Quicken server is down temporarily and the
COIN system cannot reach it. The Market Guide server is available but it reports EPS of
last fiscal year. The COIN system has to do some conversion before passing the query
result to John. The system knows that RapidResearch has quarterly EPS data. The
RapidResearch data is as follows:
Table 11: Quarterly EPS of Pfizer reported by RapidResearch.
The EPS of fiscal year 1998 reported by Market Guide is 2.55. The conversion of
fiscal 1998 EPS to trailing 12 months' EPS is as follows:
Trailing12 months' EPS=Fiscal 1998 EPS + Mar 99 (Q) EPS - Mar 98 (Q) EPS
= 2.55 + 0.62 - 0.53
After the conversion, the COIN system returns the result of trailing 12 months' EPS
to John. During the whole mediation process, John is not aware that Quicken is down and
that the COIN system has selected to use Market Guide and RapidResearch instead. The
query result will be the same as if the query has been sent directly to Quicken. The COIN
system resolves the timeliness issue in this scenario.
Data Accuracy
6.4.1 Resolving Data Discrepancies
In section 5.2.6, we present a scenario in which Jessica, a financial analyst, notices
the data discrepancies in the EPS of a French pharmaceutical company reported by
various sources. There are no apparent semantic conflicts among the data. She believes
the data discrepancies are the result of data inaccuracy.
Wang and Reddy [1992] propose to resolve accuracy issues by asking for user's
judgement, which is likely to be based upon past experiences. Under this approach,
Jessica will have to select a source that she believes is accurate. There are two potential
problems in this approach. First, if Jessica has no prior experience with any of the
sources, then she will have a hard time selecting a source. The source selection process
will be no better than a randomized algorithm. Second, even if Jessica perceives that one
source is better than the other, there is no guarantee that the data reported by the selected
source is accurate this time.
We propose to resolve accuracy issues by using other high quality data to derive the
correct values for inaccurate data. We can extend the COIN system with the appropriate
ontology. An ontology is a specification of a conceptualization. In other words, an
ontology is a description of the concepts and relationships that can exist for an agent or a
community of agents [Gruber 1993].
6.4.2 Derived Data
The following figure outlines the COIN approach in handling data accuracy issues:
Net Income
Shares Outstanding
Context Mediation
Figure 8: COIN approach in handling data accuracy issues.
In the above example, the EPS values reported by the various Internet sources have
lots of discrepancies. Instead of asking the user explicitly to select a source, the COIN
system implicitly handles the conflicts by deriving an accurate EPS value from the
supposedly accurate data of Net Income and Shares Outstanding from the company's
online annual report. The mapping relation is as follows:
EPS = Net Income + Shares Outstanding
One can argue that the COIN approach does not fully solve the data accuracy
problem. How do we know that the data in the annual report is accurate? Or how can we
be sure that the contents of the annual report are accurately presented online? These
questions can go on infinitely. The accuracy issue can be traced from network reliability
all the way back to the professionalism of the company's auditor.
Again, we want to point out that data accuracy is a relative issue. It depends very
much on the contexts of the data provider and the data consumer. In the scenario of
Jessica, she considers the EPS values provided by the Internet sources inaccurate but the
derived value from the annual report accurate enough for her analysis. And this may only
be true for large, reputable companies. If Jessica is analyzing a company that is notorious
for earnings manipulation, then she would not consider the data from the annual report to
be accurate.
The COIN approach towards data accuracy issues can be viewed as a best-effort
approach. The system will try to derive the data that is accurate in the user's context.
6.4.3 Ontology View
Jessica is quite satisfied with the COIN system in its ability to resolve data accuracy
problems. Upon a closer inspection of the annual report of Rhone-Poulenc, she notices
that the numbers are reported in French Francs. Since all her clients are US-based
investors, she would like to have the EPS number reported in US Dollars. Also, due to
accounting differences between the US and European companies, she would like to
obtain the EPS Before Goodwill Amortization.
We propose to create an ontology view within the COIN system that encompasses all
the relations. The ontology mapping is as follows:
f(Net Income Before
Goodwill Amortization
in FF,Shares
Outstanding, USD/FF
Exchange Rate)
EPS Before Goodwill
Amortization in USD
Figure 9: Ontology mapping for Rhone-Poulenc scenario.
In Figure 9, the ontology is a result of the mappings of the surrounding objects. Each
arrow represents a relation between two or more objects. The objects need not be data
objects from the same source.
In the scenario of Rhone-Poulenc, we can obtain Net Income, Goodwill Amortization
and Shares Outstandingdata from its online annual report. The problem is that all these
figures are reported in French Francs. In order to convert the financial figures from
French Francs to US Dollars, we need the currency exchange rate. We can obtain the
historical USD/FF exchange rate from the Oanada currency database on the Internet.
After we have identified the sources, we need to set up the mapping functions. Two
of the most important mapping functions in this scenario are as follows:
f(Net Income in FF,GoodwillAmortization in FF)
= Net Income in FF+ Goodwill Amortization in FF
f(Net Income Before Goodwill Amortization in FF,Shares Outstanding,
USD/FF Exchange Rate)
= Net Income Before GoodwillAmortization in FF+ Shares Outstanding
x USD/FF Exchange Rate
The first function derives the Net Income Before Goodwill Amortizationfrom Net
Income and Goodwill Amortization. The second function derives the EPS Before
Goodwill Amortization from Net Income Before Goodwill Amortization in FF, Shares
Outstandingand USD/FF Exchange Rate.
The data from Rhone-Poulenc's Annual Report is as follows:
(in Millions of French Francs)
Net Income
Goodwill Amortization
Average Number of Shares Outstanding
Table 12: Consolidated Financials of Rhone-Poulenc in fiscal year 1998 (Source: Rhone-Poulenc 1998
Annual Report).
The USD/FF exchange rate data from Oanada currency database is as follows:
French Franc (FF)
US Dollar (USD)
Table 13: USD/FF exchange rate as of 12/31/1998 (Source: Oanada Currency Database).
Upon receiving the query from Jessica, the COIN system performs the following
Net Income Before GoodwillAmortization in FF = 4,224 + 1,400
= 5,624 million
EPS Before Goodwill Amortization in USD = 5,624 million
= 2.72
367,752,291 x 0.1779
Another possible conversion required is the company name. Jessica may just use the
stock symbol of Rhone-Poulenc, RP, in her query. However, the web-site of RhonePoulenc may not recognize the stock symbol, RP. The full name of the company may be
needed. The COIN system has to convert the stock symbol to the company name.
The COIN system then returns the mediated result of EPS Before Goodwill
Amortization in USD to Jessica. During the whole process, Jessica is not aware of the
source selection and the currency conversion operations.
Data Completeness
In section 5.2.5, we point out that data completeness is a relative concept. A source
that is complete to one user may be incomplete to another. In this scenario, John, from the
context of an individual investor, only wants to get the total sales in the last fiscal year of
Johnson & Johnson. He considers the Internet financial sources to be complete. On the
other hand, Jessica, from the context of a financial analyst, needs the geographical
breakdown of sales in the last fiscal year of Johnson & Johnson. She cannot obtain this
information from any of the Internet financial sources and she considers these sources to
be incomplete.
We propose to solve the data completeness issue by bridging the gap between
sources with an ontology mapping. The ontology mapping for this scenario is similar to
that in section 6.4.3.
US Sales
Western Hemisphere
Western Hemisphere
excluding US Sales, AsiaPacificAfrica Sales)
Total Sales
excluding US Sales
a one-way mapping. This means the COIN system can derive Total Sales given the
GeographicalSales but not the other way round. When Jessica queries for the
GeographicalSales, the COIN system will answer the query with the corresponding
es, the COIN system can
GeographicalSales object. When John queries for the Total
either obtain the total sales from a financial web-site or perform the conversion on the
GeographicalSales data with the ontology. With this ontology, the COIN system looks
complete to both John and Jessica.
However, the COIN approach does not completely solve the data completeness
problem. What if there is a third user querying for the sales of Johnson & Johnson in
Massachusetts? Or for the sales of Johnson & Johnson in Boston? These queries can get
finer and finer infinitely. The COIN system cannot answer all these queries with the
given ontology. Similar to the data accuracy issue, the COIN approach towards data
completeness issues can be viewed as a best-effort approach.
Potential Problems and Possible Solutions
6.6.1 Identifyingthe Sources
A potential problem for the COIN derived data approach is to correctly identify the
sources that can provide accurate data for creating the ontology view. For example, how
does the COIN system know the URL for the online annual report of Rhone-Poulenc? Or
how does the COIN system know the existence of the online annual report? We propose
three solutions to this problem. The first solution is to query against an Internet search
engine with the company name and tries to get the URL of the company's web-site. The
second solution is to identify sources that have integrated information on a set of
companies. The third solution is to maintain a mapping between company names and the
URL of their web-sites internally within the COIN system.
The first solution of querying against a search engine requires a very complicated
algorithm to identify the correct source. Most search engines will perform key word
search and return a list of potential matches. The COIN system then has to identify the
company web-site. The story does not end here. The financial records are probably
hidden somewhere within the company web-site. And the structures are different for
web-sites of different companies. The COIN system has to handle lots of context issues
in this solution.
The second solution involves identifying sources that have integrated information on
a set of companies. In the example of financial reports, we may be able to download
company financials from Edgar Online. Edgar Online has quite a complete set of SEC
filings for US companies. However, due to different reporting requirements for non-US
companies, Edgar Online does not provide much information on non-US companies.
Take Rhone-Poulenc as an example. The only SEC filing for Rhone-Poulenc on Edgar
Online is a list of "Amended Ownership Statements". We cannot locate its annual or
quarterly financial reports.
The third solution involves maintaining a mapping within the COIN system. For
example, the mapping for the four pharmaceutical companies mentioned in the previous
scenarios are as follows:
Company Name
Stock Symbol
URL for Financial Information in Fiscal Year 1998
Johnson & Johnson
JNJ 1.html
Table 14: Mapping of company names to the URL of their online financial information.
We can notice that the structure of each company's web-site is very different. Some
companies, in this case Merck, do not provide financials online at all. If the COIN system
has to incorporate this mapping, the scale of the system will increase linearly with the
number of companies. Also, it is insufficient to have the URL alone. The representation
of the financial information within each web-site is different. The COIN system has to
know the data provider's context in order to correctly wrap the web-sites. Thus this
solution will limit the scalability of the system.
6.6.2 Defining the Ontology View
The problem with defining the ontology view is whom should be responsible for
defining the mappings. We can let the users define their own views or let the ontology
provider define the view.
In the first solution, the users set their own views explicitly by stating what mappings
to use. For example, in the scenario in section 6.5, Jessica can set her own view by
querying for the percentage of US Sales to Total Sales of Johnson & Johnson in fiscal
year 1998. The mapping will be
Percentage of US Sales to Total Sales = US Sales + Total Sales x 100%
There will be a view maintenance layer between the query interface and the COIN
engine. The SQL-to-Datalog module needs to be modified as well. The following
diagram presents the extension of a view maintenance layer in the COIN system:
View Maintenance
COIN Engine
Figure 11: View Maintenance layer in the COIN model.
In the second solution, the mappings are defined by the ontology provider. Users can
only pose queries with a limited number of mappings. For example, in the scenario in
section 6.5, the relations will be limited to the one-way mapping of total sales.
Further Research
6.7.1 Source Mapping for Derived Data
We point out in section 6.6.1 that it is difficult for the COIN system to correctly
identify sources in the derived data approach. We propose two solutions but both of them
involve lots of context issues and greatly hinder the scalability of the system.
Different companies have web-sites with different structures. We believe that further
research can be done on the web wrappermodule so as to more effectively wrap the websites.
Also, research can be done on how to map the company names to their web-sites
efficiently. The proposed solution involves a one-to-one mapping for each company and
the scale of the system increases linearly with the number of companies. We can consider
querying against web-sites that integrate the annual reports of different companies, such
as Edgar Online. However, the filings on Edgar Online may not be complete, especially
for non-US companies. For example, we try to search the Edgar Online database for SEC
filings of Rhone-Poulenc. To out disappointment, we only find an "Amended Ownership
Statement" filed on February 10, 1999. There is no filing of quarterly or annual report.
6.7.2 Other Source Selection Criteria
Wang and Strong [1996] have identified fifteen data quality criteria. In this thesis,
we discuss four criteria that are most relevant to context mediation: accuracy,
completeness, consistency and timeliness. Further research can be done on other quality
We believe that the accessibilitycriteria will be interesting to the COIN system.
Accessibility of a source can depend on lots of factors: network traffic, server speed,
server load etc. In the scenario of data consistency in section 6.2, the user is trying to
obtain quarterly dividend of a company. The current COIN system does not have the
functionality to select a particular source from a set of sources that are in the same
context as that of the user, i.e. reporting quarterly dividend. How should the COIN
system decide which source to query against? Accessibility can play a part in the solution.
The COIN system may want to query against the fastest server. But how does it know
which server is faster? How about the network traffic condition? Lots of interesting
issues will arise in this area.
In this thesis, we have discussed the meanings of data quality and have showed that
it is a relative concept. We have identified four data quality criteria that are most relevant
to context mediation: accuracy, completeness, consistency and timeliness. Through the
scenarios of two users who are in different contexts, we show how data discrepancies
arise due to each of the four data quality issues mentioned above.
We then introduce a solution with context mediation. We show how context
mediation can help resolve the issues in each of the four data quality criteria. For
semantic conflicts such as consistency and timeliness issues, the context mediation
process is pretty straightforward. For data discrepanciessuch as accuracyand
completeness issues, we propose to use derived data in context mediation. The COIN
approach can only partially solve the problems of data accuracy and completeness
because the problems are relative to the user's context.
In conclusion, we believe that context mediation is a novel approach in handling data
quality problems. Most of the data quality problems depend very much on the users'
contexts. The COIN framework, with a set of well-defined contexts of the data providers,
can query against the source that best fits the user's context. In this thesis, we present the
scenarios in the financial domain. However, we believe that the COIN model can be
extended to handle data quality issues in other domains as well.
Arens, Y. and Knoblock, C. (1993). SIMS: Retrieving and Integrating Information
From Multiple Sources. Proceedings of the 1993 ACM SIGMOD International
Conference on ManagementofData, pp. 562-563.
Ballou, D. P. and Pazer, H. L. (1985). Modeling Data and Process Quality in Multiinput, Multi-output Information Systems. Management Science 31, 2 (1985), pp. 150162.
Bressan, S. and Bonnet, P. (1997). Extraction and Integration of Data from Semistructured Documents into Business Applications. Conference on IndustrialApplications
of Prolog (1997).
Bressan, S., Fynn, K., Goh C., Madnick, S., Pena, T., and Siegel, M. (1997)
Overview of a Prolog Implementation of the Context Interchange Mediator. Proceedings
of the Fifth InternationalConference andExhibition on The PracticalApplications of
Prolog(Mar. 1997).
Bressan, S., Goh, C., Fynn, K., Jakobisiak, M., Hussein, K., Lee, T., Madnick, S.,
Pena, T., Qu, J., Shum, A., and Siegel, M. (1997). Context Interchange Mediator
Prototype. A CMSIGMOD InternationalConference on ManagementofData (1997).
Florescu, D., Koller, D., and Levy, A. (1997). Using Probabilistic Information in Data
Integration. Proceedingsof the 23rd VLDB Conference, Athens, Greece (1997).
Goh, C. (1997). Representing and Reasoning about Semantic Conflicts in Heterogenous
Information Systems. MIT Sloan School of Management CISL Working Paper #97-01.
Gravano, L., Chang, C.-C. K., and Garcia-Molina, H. (1997). STARTS: Stanford
Proposal for Internet Meta-searching. Proceedingsof the A CMSIGMOD Conference
Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The Effectiveness of
GIOSS for the Text Database Recovery Problem. Proceedingsof the A CMSIGMOD
Conference (1994).
Gruber, T. (1993). A Translation Approach to Portable Ontologies. Knowledge
Acquisition 5, 2 (1993), pp. 199-220.
KaKas, A. C., Kowalski, R. A., and Toni, F. (1993) Abductive Logic Programming.
Journalof Logic and Computation 2, 6 (1993), pp. 719-770.
Kifer, M., Lausen, G., and Wu, J. (1995). Logical Foundations of Object-oriented and
Frame-based Languages. JACM4 (1995), pp. 741-843.
Kriebel, C. H. (1979) Evaluating the Quality of Information Systems. Design and
Implementation of Computer Based Information Systems. N. Szysperski and E. Grochla,
Ed. Sijthtoff & Noordhoff, Germantown.
Liu, L. and Pu, C. (1997). A Metadata Based Approach to Improving Query
Responsiveness. Proceedings of the 2 nd IEEE Metadata Conference (1997).
McCarthy, J. (1987). Generality in Artificial Intelligence. Communications of the ACM
30, 12 (1987), pp. 1030-1035.
Naumann, F., Freytag, J. C., and Spiliopoulou, M. (1998). Quality-driven Source
Selection Using Data Envelopment Analysis. Proceedingsof the 1998 Conference on
Information Quality (1998), pp. 137-152.
Shah, S. (1998). Design and Architecture of the Context Interchange System. MIT Sloan
School of Management CISL Working Paper #98-05 (May 1998).
Strong, D. M., Lee, Y. W., and Wang, R. Y. (1997). Data Quality in Context.
Communications of the ACM 40, 5 (May 1997), pp. 103-110.
Tayi, G. K. and Ballou, D. P. (1998). Examining Data Quality. Communications of the
ACM 41, 2 (Feb. 1998), pp. 54-57.
Wand, Y. and Wang, R. Y. (1996). Anchoring Data Quality Dimensions in Ontological
Foundations. Communications of the ACM 39, 11 (Nov. 1996), pp. 86-95.
Wang, R. Y. and Reddy, M. P. (1992). Quality Data Objects. MIT Sloan School of
Management Working Paper #3517 (1993).
Wang, R. Y. and Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means
to Data Consumers. Journalon Management ofInformation Systems 12, 4 (1996).
