I- Intelligent Information Source Selection for The Context Interchange System by Kenneth C. Ng Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements of the Degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 21, 1999 MASSACHU ]JUL 1 ) Copyright 1999 Kenneth C. Ng. All rights reserved. LIBRAR The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. Author Department of Electri ngineering and Computer Science May 21, 1999 Certified by Prof. Stuart Madnick Thesis Supervisor v Certified by Dr. Michael Siegel / 2 Theis Supervisor Accepted by Arthur C. Smith Chairman, Department Committee on Graduate Thesis ENQ~ Intelligent Information Source Selection for The Context Interchange System by Kenneth C. Ng Submitted to the Department of Electrical Engineering and Computer Science May 21, 1999 In Partial Fulfillment of the Requirements of the Degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science ABSTRACT The Context Interchange (COIN) project aims to develop tools and technologies for supporting access to heterogeneous information sources on the Internet. Under the current COIN framework, a user has to specify the sources when issuing a query. As the number of information sources on the Internet increases rapidly, a user can choose from many similar information sources. Users would like to send their queries to sources that are most relevant to their contexts. This thesis looks into the problem of data quality and source selection. We present an approach to source selection based on context mediation. We show that context mediation can handle semantic conflicts, such as consistency and timeliness issues, among various sources. For data discrepancies,we propose to adopt the strategy of derived data, i.e. ontology mapping. Thesis Supervisor: Stuart Madnick Title: John Norris Maguire Professor of Information Technology, Sloan School of Management Thesis Supervisor: Michael Siegel Title: Principal Research Scientist, Sloan School of Management 2 Acknowledgements I would like to express my heartfelt gratitude to my thesis supervisors, Prof. Stuart Madnick and Dr. Michael Siegel, for offering me the unique opportunity to work on the Context Interchange Project. Their invaluable advice and ardent support guided me towards new ideas for my thesis. Also, I would like to thank the Context Interchange team for maintaining an open, supportive working atmosphere. Special thanks go to Robert Frankus for ideas on business scenarios from his project, and Aykut Firat for insights and suggestions. My days at MIT were the most memorable chapters in my life. I want to say thank you to all my friends at MIT who have shared my happiness and faced tough challenges with me. Last but not least, I am proud to enjoy the love and encouragement of my parents, Vincent and Ann, and my sister, Karen. Without whom, I would not have accomplished so much through the past 21 years. 3 Contents 1 IN TROD U CTION ................................................................................................... 1.1 M OTIVATION ....................................................................................................... 1.1.1 Data Quality............................................................................................... 1.1.2 Source Selection Problem............................................................................ 1.2 GOAL OF THESIS ................................................................................................... 1.3 STRUCTURE OF THESIS ......................................................................................... 2 3 4 5 BA C KG R O U N D ................................................................................................... 6 6 6 7 8 8 10 CONTEXT INTERCHANGE SYSTEM ....................................................................... 2.1 2.1.1 The COIN Framework ................................................................................. 2.1.2 Context Mediator..................................................................................... 2.1.3 System Perspective....................................................................................... 2.1.4 Application Dom ains ................................................................................ 2.2 COIN ARCHITECTURE.......................................................................................... 2.2.1 ClientProcesses........................................................................................ 2.2.2 Server Processes....................................................................................... Mediator Processes................................................................................... 2.2.3 10 CO N TEX T MED IATION ................................................................................... 14 10 10 11 11 12 13 13 13 14 3.1 SCENARIO ........................................................................................................... 3.2 D OMAIN MODEL .................................................................................................. 3.3 ELEVATION AXIOM S.......................................................................................... 3.4 MEDIATION SYSTEM .......................................................................................... 3.4.1 SQL to Datalog Query Compiler.............................................................. Mediation Engine..................................................................................... 3.4.2 3.4.3 Query Plannerand Optimizer................................................................... Runtime Engine.......................................................................................... 3.4.4 3.5 W EB W RAPPER .................................................................................................. 3.6 SOURCE SELECTION IN CONTEXT M EDIATION.................................................... 17 17 18 18 19 19 20 21 DATA QUALITY AND SOURCE SELECTION................................................. 22 16 4.1 D ATA Q UALITY ................................................................................................. Defining Data Q uality............................................................................... 4.1.1 4.1.2 Quality CriteriaforInformation Sources ................................................ 4.2 SOURCE SELECTION ............................................................................................ Inform ation Retrieval Approach.............................................................. 4.2.1 DatabaseApproach ................................................................................... 4.2.2 22 22 22 23 24 24 SOURCE SELECTION PROBLEM ................................................................... 26 5.1 D EFINING THE PROBLEM ..................................................................................... 5.2 SCENARIO ANALYSIS.......................................................................................... Background............................................................................................... 5.2.1 FinancialRatios........................................................................................ 5.2.2 Tim eliness Issues...................................................................................... 5.2.3 26 26 26 27 29 4 5.2.4 5.2.5 5.2.6 5.3 6 Consistency Issues..................................................................................... Completeness Issues................................................................................. Accuracy Issues........................................................................................ COIN A PPROACH ................................................................................................. SOLUTION WITH CONTEXT MEDIATION..................................................... 6.1 6.2 O VERVIEW ......................................................................................................... D ATA CONSISTENCY.......................................................................................... 6.2.1 Defining the Dom ain Model...................................................................... 6.2.2 Defining the Context ................................................................................ 6.2.3 Defining the ConversionFunction............................................................ 6.3 D ATA TIM ELINESS............................................................................................. 6.3.1 Defining the Dom ain Model...................................................................... 6 3.2 Defining the Context................................................................................. 6.3.3 Defining the ConversionFunction............................................................ 6.4 D ATA A CCURACY.............................................................................................. 6.4.1 Resolving Data Discrepancies................................................................. 6.4.2 Derived Data............................................................................................ 6 4.3 Ontology View .......................................................................................... 6.5 D ATA COM PLETENESS........................................................................................ 6.6 POTENTIAL PROBLEMS AND POSSIBLE SOLUTIONS ............................................ 6.6.1 Identifying the Sources............................................................................... 6.6.2 Defining the Ontology View ..................................................................... 6.7 FURTHER RESEARCH ......................................................................................... 6.7.1 Source Mappingfor Derived Data ............................................................ 6.7.2 Other Source Selection Criteria............................................................... 30 31 32 32 34 34 35 35 36 37 38 38 39 39 40 40 41 42 44 45 45 46 47 47 47 7 CO NCLU SION ..................................................................................................... 48 8 R EFEREN C ES..................................................................................................... 49 9 A PPEN D IC ES ....................................................................................................... 51 APPENDIX A ................................................................................................................... A PPENDIX B .................................................................................................................... APPENDIX C ................................................................................................................... A PPENDIX D ................................................................................................................... A PPENDIX F .................................................................................................................... 51 52 54 55 A APPENDIX F.................................................................................. 57 58 5 1 1.1 Introduction Motivation Corporate information integration is essential for business analyses in many fields such as consulting and finance. Management consultants need to understand the competition and the dynamics of an industry. Equity research analysts have to value companies based on fundamentals and market information. An integrated approach to obtaining corporate information would result in tremendous time and cost savings. The advent of Internet extended the frontier of information integration. Information sources in different parts of the world are now just a few mouse clicks away. However, the inherent semantic heterogeneity in human languages is a major obstacle for correct interpretation of information. Different sources operate in different contexts. For example, a US customer visiting the web-site of an online bookshop in the UK might misinterpret the prices in pounds to be prices in dollars. When the currency is not stated clearly, people would naturally assume the prices are in their local currencies. In other words, people assume they are within the same contexts as the information source. Context differences could lead to large-scale semantic heterogeneity and hinder the information integration process. The COntext INterchange (COIN) project aims to develop tools and technologies for supporting access to heterogeneous information sources. The heart of the COIN framework is the notion of context, that determines the underlying meaning and interpretation of information. The contexts are collections of statements defining how data should be interpreted and how potential conflicts might be resolved. The COIN framework is designed to work with a wide range of distributed information systems. A natural extension of COIN framework is the integration of information on the Internet with context mediation. Nowadays, the number of information sources on the Internet increases exponentially. People could obtain detailed information on virtually every topic: from microbiology to astrophysics, and from Chinese history to American technologies. 1.1.1 Data Quality As the number of information sources on the Internet increases, a user can typically choose from many similar information sources. Due to time and cost considerations, people would only want to query the most appropriate sources. This can be achieved either by the user or by an automated process. Hence, we assume multiple information sources with partially overlapping content that can be uniformly queried by some system. The sources can be document collections or Internet web sites. The information sources show a varying quality of information and a varying quality of access, i.e. varying query cost. Querying information from Internet sources is usually divided into three tasks: 6 . * . Source selection, i.e. choosing the best possible information sources to evaluate a query Query evaluation at the sources Merging the query results [Gravano, Chang, and Garcia-Molina 1997] Given a query and a set of information sources that are capable of answering the query to some extent, we address the problem of deciding which of the sources to issue the query upon. This decision should be based on data quality. Without considering quality, a system might return an answer that is useless, inaccurate or incomplete. Among the data quality dimensions suggested by Wang and Strong [1996], we choose four criteria that are most relevant to context mediation: accuracy,completeness, consistency, and timeliness. There is no exact definition for accuracy. From literature, it appears that accuracy is viewed as equivalent to correctness; we consider completeness to be all values requiredby the user for a certain variable are recorded. Consistency is related to the values of data. A data value can only be expected to be the same for the same situation. If different values for the same piece of data under the same situation are reported by different sources, then these sources are inconsistent. Timeliness has been defined in terms of whether the data is out of date [Ballou and Pazer 1985] and availability of output on time [Kriebel 1979]. 1.1.2 Source Selection Problem Source selection is usually handled in a straightforward manner with some selectorcomponent analyzes source capabilities and source contents. Matching the query against the capabilities of the sources determines which combinations of sources are capable of answering the query. Matching the query against the source contents determines the sources that will likely provide the most and the most relevant information. This technique relies on statistical information giving the total number of appearances of each distinct word, the document frequency for each word, and the total number of documents in source. With this information, the appropriateness of each source for evaluating the query can be estimated. The sources with the highest estimates are then chosen to be queried. An information source is thus considered to be appropriate if the keywords of a query appear often and in many documents. We believe there is more to finding the best sources than just counting the appearances of certain key words. Consider a source containing not only text documents but explanatory graphics on the subject. Should this source be valued higher than the one without graphics? Consider a source that matches the query well but is very concise, hardly containing anything but the key words of the query. Should this document be downloaded? Finally, consider a source containing matching but outdated information. Is such a source worth exploring at all? Quality of a source and the quality of the documents it contains must be measured with more than one criterion or quality dimension. Another concern is the context of the user. Data with quality considered appropriate for one use may not possess sufficient quality for another use. For example, a financial analyst would like to obtain the geographical breakdown of sales of a company while a 7 day-trader only wants to know the last quoted price of the company's stock. In this case, any one of the stock quote servers on the Internet would probably the day-trader. But for the financial analyst, he might have to dig up the annual report or SEC-filings of the company for the geographical breakdown of sales. Two main problems arise when the COIN model is applied on sources of unknown quality. First, there is the problem of semantic conflicts. Different sources may have different representations for the same piece of data. Second, there is the problem of data discrepancies. Sources sometimes provide conflicting values for the same piece of data with no apparent semantic conflicts. We would illustrate the problem of source selection by a scenario of personal investing. Researching on equity stocks is a very tedious process. Investors might want to obtain key financial ratios of various companies. For example, they would need to compare valuations with the Price/Earning ratios (P/E ratios). However, different Internet sources might give different P/E ratios for the same company. In this case, we are interested to know the reason behind the inconsistency. Is it due to semantic conflicts? Or is it due to data discrepancies?We would show how context mediation could help the investor obtain the information from the source that best fits his context. 1.2 Goal of Thesis The goal of this thesis is to present an approach to source selection based on context mediation. We show that context mediation can handle semantic conflicts, such as consistency and timeliness issues, among various sources. For data discrepancies, we propose to adopt the strategy of derived data, i.e. ontology mapping. 1.3 Structure of Thesis In chapter 2, we first give a brief introduction to the COIN model on which further extensions would be based. Our focus is on the context mediation process. In chapter 3, we give details of the context mediation process, defining the domain model, elevation axioms and the mediation system. Also, we discuss the limitations of the current context mediator when dealing with data conflicts. In chapter 4, we discuss the meanings of data quality based on previous research in this area. There are four main quality criteria that we will focus on: accuracy, completeness, consistency, and timeliness. Then, we move on to present a literature review on source selection approaches. In chapter 5, we define the problem of our thesis by pointing out that most of the source selection processes have ignored the contexts of data providers and data consumers. In order to illustrate the source selection concretely, we present two scenarios that highlight the problem. There are two main types of conflicts among information sources: semantic conflicts and data discrepancies. 8 In chapter 6, we would introduce the approach of source selection based on context mediation. We show that context mediation can handle semantic conflicts, such as consistency issues and timeliness issues, among various sources. For data discrepancies, we propose to adopt the strategy of derived data, i.e. ontology mapping. At the end of the chapter, we will leave open-ended questions for further research. In chapter 7, we would conclude the thesis with our research findings and proposed solution. 9 2 Background 2.1 Context Interchange System The COntext INterchange (COIN) strategy seeks to address the problem of semantic interoperability by consolidating distributed data sources and providing a unified view to them. COIN technology presents all data sources as SQL databases by providing generic wrappers for them. The underlying integration strategy, the COIN model, defines a novel approach for mediated data access in which semantic conflicts among heterogeneous systems are automatically detected and reconciled by the Context Mediator.As a result, the COIN approach integrates disparate data sources by providing semantic interoperability (the ability to exchange data meaningfully) among them [Bressan et al. 1997]. 2.1.1 The COIN Framework The COIN framework is composed of both a data model and a logical language, COINL, derived from the family of F-Logic [Kifer, Lausen, and Wu 1995]. The data model and language define the domain model of the receiver, data source, and the contexts [McCarthy 1987] associated with them. The data model contains the definitions for the "types" of information units (called semantic-types) that constitute a common vocabulary for capturing the semantics of data in disparate systems. Contexts, associated with both information sources and receivers, are collections of statements defining how data should be interpreted and how potential conflicts (differences in the interpretation) should be resolved. Concepts such as semantic-objects, attributes, modifiers, and conversion functions define the semantics of data inside and across contexts. Together with the deductive and object-oriented features inherited from F-Logic, the COIN data model and COINL constitute an expressive framework for representing semantic knowledge and reasoning about semantic heterogeneity. 2.1.2 Context Mediator The Context Mediator is the heart of the COIN project. It is the unit that provides mediation for user queries. Mediation is the process of rewriting queries posed in the receiver's context into a set of mediated queries where all potential conflicts are explicitly solved. This process is based on an abduction procedure that determines what information is needed to answer the query and how conflicts should be resolved by using the axioms in the different contexts involved. Answers generated by the mediation unit can be both extensional and intentional. Extensional answers correspond to the actual data retrieved from the various sources involved. Intentional answers, on the other hand, provide only a characterization of the extensional answer without actually retrieving data from the data sources. Furthermore, the mediation process supports queries on the semantics of data that are implicit in the different systems. These are referred to as knowledge-level queries as opposed to data-level queries that are inquiries on the factual 10 data present in data sources. Finally, integrity knowledge on one source or across sources can be naturally involved in the mediation process to improve the quality and information content of the mediated queries and ultimately aid in the optimization of the data access. 2.1.3 System Perspective From a systems perspective, the COIN strategy combines the best features of the loose- and tight-coupling approaches to semantic interoperability among autonomous and heterogeneous systems. Its modular design and implementation funnels the complexity of the system into manageable chunks, enables sources and receivers to remain looselycoupled to one another, and sustains an infrastructure for data integration. This modularity, both in the components and the protocol, also keeps our infrastructure scalable, extensible, and accessible. By scalability, we mean that the complexity of creating and administering the mediation services does not increase exponentially with the number of participating sources and receivers. Extensibility refers to the ability to incorporate changes into the system in a graceful manner; in particular, local changes do not have adverse effects on other parts of the system. Finally, accessibilityrefers to how a user in terms of its ease-of-use perceives the system and flexibility in supporting a variety of queries. 2.1.4 Application Domains The COIN technology can be applied to a variety of scenarios where information needs to be shared amongst heterogeneous services and receivers. The need for this novel technology in the integration of disparate data sources can be readily seen in the following examples. A useful application of the COIN technology is in the financial domain. The COIN model could assist financial analysts in conducting research and valuing companies. The technology is particularly valuable when comparing companies across borders. Different countries have different reporting requirements and accounting standards. Furthermore, each country has its own financial information providers. All these information might be presented in vastly different formats and of different qualities. Some major discrepancies are due to scale-factors and currency representations. The COIN framework could help resolve the semantic heterogeneity among various sources and aid in financial decision making. In the domain of manufacturing inventory control, the ability to access design, engineering, manufacturing, and inventory data pertaining to all parts, components, and assemblies is vital to any large manufacturing process. Typically, thousands of contractors play roles and each contractor tends to setup its data in its own individualistic manner. COIN technology can play an important role in standardizing various formats the contractors follow in their bids. This would help managers optimize inventory levels and ensure overall productivity and effectiveness. 11 Finally, the modem health care enterprise lies at the nexus of several different industries and institutions. Within a single hospital, different departments (e.g. internal medicine, medical records, pharmacy, admitting, billing) maintain separate information systems yet must share data in order to ensure high levels of care. Medical centers and local clinics not only collaborate with one another but also with State and Federal regulators, insurance companies, and other payer institutions. This sharing requires reconciling differences such as those of procedure codes, medical supplies, classification schemes, and patient records. 2.2 COIN Architecture The feasibility and features of the proposed strategy of using context mediation to solve semantic differences between various heterogeneous data sources are demonstrated in a working system that provides mediated access to both on-line structured databases and semi-structured data sources such as web site. This demonstration system implements most of the important concepts of the context interchange strategy and is called the COIN system. This section introduces the COIN system and its high level architecture. The infrastructure leverages on the World Wide Web in two ways. First, we rely on the hypertext transfer protocol for the physical connectivity among sources and receivers and the different mediation components and services. Second, we employ the hypertext markup language and Java for the development of portable user interfaces. Figure 1 shows an overview of the COIN architecture [Bressan et al. 1997]. SERVER PROCESSSES MEDIATOR PROCESSSES CLIENT PROCESSSES Figure 1: COIN architecture overview [Bressan et al. 1997]. 12 2.2.1 Client Processes Client processes provide the interaction with receivers and route all database requests to the Context Mediator. An example of a client process is the multi-database browser, which provides a point-and-click interface for formulating queries to multiple sources and for displaying the answers obtained. Specifically, any application program that posts queries to one or more sources can be considered a client process. This can include all the programs (e.g. spread sheet software programs like Excel or Access) that can communicate using the ODBC bridge to send SQL queries and receive results. 2.2.2 Server Processes Server processes refer to databasegateways and wrappers. Database gateways provide physical connectivity to database on a network. The goal is to insulate the Mediator Process from the idiosyncrasies of different database management systems by providing a uniform protocol for database access as well as canonical query language (and data model) for formulating the queries. Wrappers, on the other hand, provide richer functionality by allowing semi-structured documents on the World Wide Web to be queried as if they were relational databases. This is accomplished by defining an export schema for each of these web sites and describing how attribute-values can be extracted from a web site using pattern matching. 2.2.3 Mediator Processes Mediator processes refer to the system components that collectively provide the mediation services. These include SQL-to-datalog compiler, context mediator, and query planner/optimizer and multi-database executioner. SQL-to-datalog compiler translates a SQL query into its corresponding datalog format. Context mediator rewrites the userprovided query into a mediated query with all the conflicts resolved. The planner/optimizer produces a query evaluation plan based on the mediated query. The multi-database executioner executes the query plan generated by the planner. It dispatches subqueries to the server processes, collates the intermediary results, and returns the final answer to the client processes. In our discussion, we would focus on the mediator processes. In the next chapter, we would introduce context mediation with a scenario of an equity research analyst gathering information from a variety of sources. Then, we would elaborate on that example and explain the subsystems within the context mediator in detail. Finally, we would identify the potential problems of poor information quality in context mediation. 13 3 3.1 Context Mediation Scenario We would introduce context mediation with an example of an equity research analyst researching on Daimler Benz. He would like to find out the net income, net sales, and total assets of Daimler Benz for the fiscal year ending 1993. He normally uses the financial data in the database Worldscope. However, for this particular research exercise, Worldscope does not have all the information he needs. He learns from his colleagues about two new databases Datastreamand Disclosure. Hopefully, he could obtain all the necessary information from the three databases. He starts off with Worldscope database that had total assets for all the companies. He logs into Oracle server containing the Worldscope data and issues a query: select company name, total-assets from worldscope where companyname = "DAIMLER-BENZ AG"; The result returned is: DAIMLER-BENZ AG 5659478 Company-Name Total Assets Company-Name Net Income Company-Name Net Sales DAIMLER-BENZ AG 5659478 DAIMLER-BENZ CORP 615000000 DAIMLER-BENZ 97736992 Worldscope Disclosure Datastream Figure 2: Scenario for context mediation. The analyst continues to look for net income and net sales figures for Daimler Benz. He realizes that Disclosure has net income data whereas Datastreamhas net sales data. For net sales, he issues the query: select company name, net-income from disclosure where companyname = "DAIMLER-BENZ AG"; The query does not return any records. He checks for typos and tries again as he is pretty sure that Datastreamhas the information. He refines the query by entering the partial name for Daimler Benz: select companyname, net-income 14 from disclosure where companyname like "DAIMLER&"; The result returned is: DAIMLER BENZ CORP 615000000 He then realizes that the data sources do not conform to the same standards. The failure of the initial query is due to different representations of company names in the two databases. Finally, he issues the query for net income: select name, totalsales from datastream where name like "DAIMLER%"; The result returned is: DAIMLER-BENZ 9773092 After obtaining all the information needed, he begins to analyze the numbers. However, there are a number of things unusual for this data set. First, the total sales are twice as much as the total assets of the company, which is quite unlikely for a company like Daimler Benz. Another disturbing phenomenon is that net income is almost ten times as much as total assets. He immediately notices that something is wrong and tries to solve the mystery by digging into the fact sheets of the databases. Upon a detailed examination, he finds some interesting facts about the data. First, there is a difference in scale factors. Datastreamhas a scale factor of 1000 for all the financial amounts, while Disclosure uses a scale factor of one. Second, there is a difference in currency denominations. Both Datastreamand Disclosure use the country of incorporation for the currency while Worldscope uses a scale factor of 1000 but every number is in USD. After recognizing the semantic differences of the data sources, he would need a data source of historical currency exchange rates in order to reconcile the differences. With context mediation, the system automatically detects and resolves the semantic conflicts between all the data sources. The results would be presented in the format that the user is familiar with. In the above scenario, if the equity research analyst is using context mediation system instead, all he has to do is to formulate and issue only one query despite the underlying semantic differences among various data sources. For example, if he wants the result returned to be in the Worldscope context, then the context mediation system would issue the query: select worldscope.totalassets, datastream.totalsales, disclosure.net income from worldscope, datastream, disclosure where worldscope.companyname = "DAIMLER-BENZ AG" and datastream.as of date = "01/05/94" and worldscope.companyname = datastream.name and worldscope.companyname = disclosure.companyname; The system basically detects all the conflicts that the analyst has to deal with, and resolves these conflicts without the analyst's explicit intervention. 15 In the next section, we will discuss the concept of a domain model based on the above scenario. 3.2 Domain Model A domain model specifies the semantics of the "types" of information units, which constitutes a common vocabulary used in capturing the semantics of data in disparate sources. In other words, it defines the ontology that will be used. The various semantic types, the type hierarchy, and the type signatures (for attributes and modifiers) are all defined in the domain model. Types in the generalized hierarchy are rooted to system types, i.e. types native to the underlying system such as integers, strings, real numbers etc [Shah 1998]. -o ------- Inheritance Attribute - Modifier company Figure 3: Financial Domain Model [Shah 19981. Inheritance: This is the classic type of inheritance relationship. All semantic types inherit from basic system types. In the domain model, type company Financ ia l s inherits from basic type string. Attributes: In COIN framework, objects have two forms of properties, those which are structural properties of the underlying data source and those that encapsulate the underlying assumptions about a particular piece of data. Attributes access structural properties of the semantic object in question. For instance, the semantic type companyf inancials has two attributes, company and f yEnding. Intuitively, these attributes define a relationship between objects of the corresponding semantic types. The relationship formed by the company attribute states that for any company 16 financial in question, there must be corresponding company to which that company financial belongs. Similarly, the f yEnding attribute states that every company financial object has a date when it was recorded. Modifiers: Modifiers define a relationship between semantic objects of the corresponding semantic types. The difference though is that the values of the semantic objects defined by the modifiers have varying interpretations depending on the context. Referring to the domain model, the semantic type companyFinancials define two modifiers, s cale Fa ct or and curr ency. The value of the object returned by the modifier scaleFactor depends on a given context. 3.3 Elevation Axioms The mapping of data and data-relationship from the sources to the domain model is accomplished through the elevation axioms. There are three distinct operations that define the elevation axioms: . . . Define a virtual semantic relation corresponding to each extensional relation Assign to each semantic object defined its value in the context of the source Map the semantic objects in the semantic relation to semantic types defined in the domain model and make explicit any implicit links (attribute initialization) represented by the semantic relation We illustrate how the relation is elevated with the Worldscope example. The Worldscope relation is a table in oracle database and has the following columns: Name COMPANYNAME LATESTANNUAL FINANCIALDATE CURRENT OUTSTANDINGSHARES NETINCOME SALES COUNTRYOFINCORP TOTAL ASSETS 3.4 Type VARCHAR2(80) VARCHAR2(10) NUMBER NUMBER NUMBER VARCHAR2(40) NUMBER Mediation System In the following sections, we will describe the workings of the mediation system by means of the financial analyst application scenario. We will begin with the query, and then we will describe the compilation process. In addition, we will discuss the domain and the specification of information context. The query is as follows: select worldscope.total assets, datastream.total sales, disclosure.net income, quotes.Last from worldscope, datastream, disclosure, quotes where worldscope.companyname = "DAIMLER-BENZ AG" and datastream.asofdate = "01/05/94" and worldscope.companyname = datastream.name and 17 worldscope.companyname = disclosure.companyname and worldscope.companyname = quotes.cname; The above query is asked in the Worldscope context. Once the mediation system has processed all parts of the query, the system will convert the result into the Worldscope context and returns the converted results back to the user. 3.4.1 SQL to Datalog Query Compiler The SQL query is fed to the SQL query compiler when it is entered into the system. The query compiler takes in the SQL query and parses the query into its corresponding datalog form. At the same time, by means of elevation axioms, the compiler elevates the data sources into its corresponding elevated data objects. The corresponding datalog query for the SQL query above is as follows: Answer (totalassets, totalsales, net income, last) WorldcAFp(V27, V26, V25, V24, V23, V22, V21), DiscAF p(V20, V19, V18, V17, V16, V15, V14), DstreamAF_p(V13, V12, V11, Vi0, V9, V8), Quotesp(V7, qlast), Value(V27, cws, V5), V5 = "DAIMLER-BENZ AG", Value(V13, cws, V4), V4 = "01/05/94", Value(V12, c_ws, V3), V5 = V3, Value(V20, c_ws, V2), V5 = V2, Value(V7, c_ws, V1), V5 = V1, Value(V22, c ws, total assets), Value(V17, cws, totalsales), Value(V11, c_ws, net-income), Value(qlast, c_ws, last). The query now contains elevated data sources along with a set of predicates that map each attribute to its value in the corresponding context. Since the user asked the query in the Worldscope context (denoted by c_ws), the last four predicates in the translated query ensure that the actual values returned as the solution of the query are in Worldscope context. The resulting unmediated datalog query is then fed to the mediation engine. 3.4.2 Mediation Engine The mediation engine is the core of the COIN system. It detects and resolves possible semantic conflicts. In short, the mediation is a query rewriting process. The actual mechanism of mediation is based upon an abduction engine [KaKas, Kowalski, and Toni 1993]. The engine takes a datalog query and a set of domain model axioms and computes a set of abducted queries such that the abducted queries have all the semantic 18 differences resolved. The system incrementally tests for potential semantic conflicts and introduces conversion functions to resolve the conflicts. The mediation engine outputs a set of queries that accounts for all possible cases of conflicts. Shah [1998] presents detailed examples of abducted queries. 3.4.3 Query Planner and Optimizer The query planner module takes the set of datalog queries produced by the mediation engine and produces a query plan. It ensures that an executable plan exists which will produce a result that satisfies the initial query. A query planner is necessary because there are sources that restrict the type of queries they can service. Another limitation is the types of operators sources can handle. For example, some web sources do not export an interface that support all the SQL operators. Once the planner verifies that an executable plan exists, it generates a set of constraints on the order in which the different sub-queries can be executed. Under these constraints, the optimizer applies standard optimization heuristics to generate the query execution plan. The query execution plan is an algebraic operator tree in which each operation is represented by a node. There are two types of nodes: Access Nodes: Access nodes represent access to remote data sources. Two subtypes of access nodes are: . . sfw nodes: These nodes represent access to data sources that do not require input bindings from other sources in the query join-sfw nodes: These nodes require input from other data sources in the query. Thus these nodes have to come after the nodes that they depend on while traversing the query plan tree Local Nodes: These nodes represent local operations in local execution engine. Four subtypes of local nodes are: . . . . join nodes: These nodes join two trees select nodes: These nodes apply conditions to intermediate results CVT nodes: These nodes apply conversion functions to intermediate query results union nodes: These nodes represent unions of results obtained by executing the sub-nodes 3.4.4 Runtime Engine The runtime execution engine executes the query plan. For a given query plan, the execution engine traverses the query plan tree in a depth-first manner starting from the root node. At each node, it computes the sub-trees for that node and then applies the operation specified for that node. For each sub-tree, the engine recursively goes down the tree until it encounters an access node. At an access node, it composes and sends a SQL query to the remote source. The results of that query are then maintained in local storage. The operation at a node will not be carried out unless all the sub-trees have been executed 19 and all the results available are in local storage. This whole operation propagates all the way up to the root node. Upon reaching the root node, the execution engine should have the required set of results corresponding to the original query. The results are then presented to the user and the whole context mediation process is completed. 3.5 Web Wrapper Web wrapping is the technology that enables user to treat web sites as relational databases. With this technology, users can issue SQL queries to web sources just as they would to any relation in a relational database. The implementation of this technology is the web wrapping engine [Bressan and Bonnet 1997]. With the web wrapper engine, applications developers can rapidly wrap a structured or semi-structured web site and export the schema for users' queries. Query Results Specifications Figure 4: Generic Web Wrapper Architecture [Bressan and Bonet 1997]. Figure 4 shows the architecture of the COIN web wrapper. The system takes the SQL query as input. It parses the query along with the specifications for the given web site. A query plan is then constituted. The query plan contains a detailed list of web sites to sent http requests, the order of those requests and also the list of documents that will be 20 fetched from those web sites. The executioner then executes the plan. Once the pages are fetched, the executioner then extracts the required information from the pages and presents the collated results to the user. 3.6 Source selection in Context Mediation Context mediation is a powerful tool for resolving semantic differences among various data sources. However, the extensive use of the Internet poses the problem of information quality. As the number of information sources increase, people would only need to query the most appropriate sources. The data quality offered by these sources can and must be a criterion for source selection. When two different sources give conflicting information on the same subject, which one should we choose? In the next chapter, we will discuss the issue of data quality and some quality criteria. In chapter 5, we will define the problem of source selection with a scenario in which an investor is trying to make an investment decision by comparing the financial ratios of several companies in an industry. He tries to leverage the resources on the Internet but faces the problem of semantic conflicts and data discrepancies among the sources. 21 4 4.1 Data Quality and Source Selection Data Quality 4.1.1 Defining Data Quality There is much database research showing how important data quality is to businesses and users. The research usually aims at ensuring the quality of data in databases. With few exceptions, however, data quality is treated as an intrinsic concept, independent of the context in which data is produced and used. This focus on intrinsic data quality problems fails to recognize the role of data consumer in data quality management. In contrast to this intrinsic view, Strong, Lee and Wang [1997] propose that quality cannot be assessed independent of consumers who choose and use products. Similarly, the quality of data cannot be assessed independent of the people who use data - data consumers. Data consumers' assessments of data quality are important because consumers now have more choices and control over their computing environment and the data they use. Tayi and Ballou [1998] define data quality as "fitness for use", which implies the concept of data quality is relative. Thus data with quality considered appropriate for one use may not possess sufficient quality for another use. The trend toward multiple uses of data has highlighted the need to address data quality concerns. In addition, fitness for use implies that we need to look beyond traditional concerns with the accuracy of the data. Data in stock-portfolio management systems may be accurate but unfit for use if that data is not sufficiently timely. Also, personnel databases situated in different divisions of a company may be correct but unfit for use if the desire is to combine the two and they have incompatible formats. 4.1.2 Quality Criteria for Information Sources Wang and Strong [1996] have identified fifteen quality criteria and have classified these into four categories "intrinsic quality", "accessibility", "contextual quality", and "representational quality". Data Quality Category Intrinsic Accessibility Contextual Representational Data Quality Dimensions Accuracy, Objectivity, Believability, Reputation Accessibility, Access security Relevancy, Value-Added, Timeliness, Completeness, Amount of data Interpretability, Ease of understanding, Concise Representation, Consistent representation Table 1: Data quality categories and dimensions. 22 Among the data quality dimensions suggested by Wang and Strong [1996], we choose four criteria that are most relevant to context mediation: accuracy, completeness, consistency, and timeliness. Accuracy There is no exact definition for accuracy. For example, Kriebel [1979] characterizes accuracy as "the correctness of the output information". Ballou and Pazer [1985] describe accuracy as "the recorded value is in conformity with the actual value." Thus it appears that accuracy is viewed as equivalent to correctness. Instead of defining accuracy, Wand and Wang [1996] tries to define inaccuracy.Inaccuracy implies that information system represents a real-world state different from the one that should have been represented. Completeness Completeness is the ability of an information system to represent every meaningful state of the represented real world system. Ballou and Pazer [1985] view a set of data as complete if all necessary values are included: "All values for a certain variable are recorded". This definition, however, fails to consider the context of the user. Data considered complete by one user may not be sufficient for another one. We define completeness as all values requiredby the user for a certain variable are recorded. Consistency Consistency is related to the values of data. A data value can only be expected to be the same for the same situation. If different values for the same piece of data under the same situation are reported by different sources, then these sources are inconsistent. Inconsistency occurs in two types of data conflicts: semantic conflicts and data discrepancies.Semantic conflicts may be due to different representations of the same piece of data or timeliness issues. Data discrepancies may be a direct result of inaccuracy. Timeliness Timeliness has been defined in terms of whether the data is out of date [Ballou and Pazer 1985] and availability of output on time [Kriebel 1979]. Timeliness is affected by three factors: How fast the information system state is updated after the real-world system changes; the rate of change of the real-world system; and the time the data is actually used. Lack of timeliness may lead to a state of the information system reflecting a past state of the real world. 4.2 Source Selection Source selection approaches can be broadly divided into two categories: information retrieval (IR) approach and database (DB) approach. 23 4.2.1 Information Retrieval Approach Information retrieval (IR) approach handles source selection in a straightforward manner with some selector-component analyzing source capabilities and source contents. Matching the query against the capabilities of the sources determines which combinations of sources are capable of answering the query. Matching the query against the source contents determines the sources that will likely provide the most and the most relevant information. Under the IR approach, retrieval of desired information dispersed in multiple sources requires general familiarity with their contents and structure, query languages, location on existing networks etc. The user must break down a given retrieval task into a sequence of actual queries to databases. In the GIOSS system [Gravano, Garcia-Molina, and Tomasic 1994], the authors assume that each participating source provides information on the total number of documents in the source and for each word the number of documents it appears in. These values are used to estimate the percentage of query-matching documents in a source. The source with the highest percentage is chosen for querying. Florescu, Koller, and Levy [1997] try to describe quantitatively the contents of information sources using probabilistic measures. In their model two values are calculated: Coverage of information sources, determining the probability that a matching document is found in the source, and overlap between two information sources, determining the probability that an arbitrary document is found in both sources. These probabilities are calculated with the help of word-count statistics. This information is then used to determine the order in querying the sources. 4.2.2 Database Approach Database (DB) approach involves creating a knowledge server that will form the interface between information sources and applications in need of that information. A user queries against the knowledge server in a manner that is independent of the distribution of information over various sources. It is up to the knowledge server to determine how to obtain the desired information and which data sources to use. The Services and Information Management for decision Systems (SIMS) approach proposed by Arens and Knoblock [1993] accepts queries in the form of a description of a class of objects about which information is desired. A complete semantic model of the application domain is created and used in order to provide a collection of terms with which to describe the contents of available information sources. The user is not presumed to know how information is distributed over the data- and knowledge bases to which SIMS has access. SIMS then proceeds to reformulate the user's query as a collection of more elementary statements that refer to data stored in available information sources. A plan is created for retrieving the desired information, establishing the order and content of the various plan steps/subqueries. The resulting plan is then executed by performing local 24 data manipulation that generates the final translation into database queries in the appropriate languages. Liu and Pu [1997] propose a metadata approach to identify relevant and capable information sources. For each query the query scope and the query capacity are determined. The query scope describes synonyms for each part of the query; the query capacity describes the information source capability requirements for each part of the query. This metadata is matched with the source capability profiles of the information sources, which describe category, content and capabilities of a source. Naumann, Freytag, and Spiliopoulou [1998] propose to use Data Envelopment Analysis (DEA) for quality-driven source selection. This method does not directly compare sources, but focuses on individual sources, determining their efficiency in terms of information quality and cost. A set of efficient sources is then identified. Finer source selection will be based upon further criteria imposed on the efficient set. 25 5 5.1 Source Selection Problem Defining the Problem In chapter 4, we discuss the meaning of data quality and several approaches to source selection in previous literature. Most of the IR approaches rely on statistical information giving the total number of appearances of each distinct word, the document frequency for each word, and the total number of documents in source. The DB approaches are slightly different. They set up a knowledge base that will form the interface between information sources and applications in need of that information. We believe there is more to finding the best sources than just counting the appearances of certain key words. Consider a source containing not only text documents but explanatory graphics on the subject. Should this source be valued higher than the one without graphics? Consider a source containing matching but outdated information. Is such a source worth exploring at all? Quality of source is a relative measure. It depends very much on the requirements of the user. Two main types of conflicts may arise in the source selection process: semantic conflicts and data discrepancies.In the first case, different sources may have different representations for the same piece of data. In the second case, sources sometimes provide conflicting values for the same piece of data with no apparent semantic conflicts. In the next section, we will present two scenarios, one for an individual investor and the other for a professional financial analyst. Both of them are trying to obtain financial ratios and corporate information of several companies in the same industry. However, they are essentially in different contexts and will have different source requirements. Another complication is the discrepancies among different sources. When they try to integrate information from the Internet, they realize that different sources present conflicting data on the same subject. With the two scenarios, we will show how semantic conflicts and data discrepanciesaffect the source selection process and how the user's context plays a part. 5.2 Scenario Analysis 5.2.1 Background John is an individual investor who does not want to miss the recent rally in the US stock market. Currently, he has a concentrated holding of technology stocks and would like to diversify his portfolio by investing in pharmaceutical companies. Before making any investment decisions, he needs to obtain the financials and analyst recommendations of several pharmaceutical companies. Jessica, on the other hand, is a financial analyst with a prominent investment bank. She covers major pharmaceutical companies in the US and Europe. In compiling her research reports, she needs to obtain detailed information on the company's operations. 26 The following is a list of dominant drug companies that both John and Jessica are interested in: . . .0 . Johnson & Johnson (NYSE: JNJ), a US health-care product manufacturer Merck (NYSE: MRK), a US pharmaceutical company Pfizer (NYSE: PFE), a US pharmaceutical company Rhone-Poulenc S.A. (NYSE: RP), a French chemical company 5.2.2 Financial Ratios With a working knowledge of the Internet, John looks up several financial web-sites for the desired information. He is most interested in obtaining the key ratios of a company so that he can compare it with its competitors and the industry average. The key ratios that he is looking at are P/E (price/earning) ratios, EPS (earnings per share), estimated earnings growth in the next 5 years, dividend and dividend yield (Appendix A). He obtains the numbers from several sources (Appendix B). The results of his research are as follows: Q: Quarterly Y: Yearly TTM: Trailing 12 months JNJ Share Price 90.2500 9 0 %/ 90.25 9 0 %/ 9 0 %/ P/E ratio 39.76 EPS 2.27 5-yr EPS growth ()14.37 Dividend 0.25 Dividend yield ( )1.11 Shares (mm) 1,344.67 40.8 2.229 39.73 (TTM) 2.23 (TTM) 0.25 (Q) 1.1 1,344.67 1.00 (y) 1.13 1,344.67 40.5 2.23 13.00 0.25 1.0 1,344.67 40.8 2.23 11.94 0.97 1.1 1,344.67 90 %A 40.78 2.23 12.8 1.00 1.10 1,340 Table 2: The key ratios of Johnson & Johnson as of market close on 3/16/99. 27 Q: Quarterly Y: Yearly TTM: Trailing 12 months 2 MRK Share Price P/E ratio EPS 5-yr EPS growth (%) Dividend Dividend yield (%) Shares (mm) 85.8750 38.95 2.21 85 7/8 39.15 2.152 0.27 1.21 2,382.12 0.27 (Q) 1.282 2,382.12 85.875 38.98 (TTM) 2.15 (TTM) 12.03 1.08 (Y) 1.29 2,382.13 85 7/8 39.9 2.15 14.00 0.27 1.2 2,382.13 85 7/8 39.2 2.15 17.89 0.95 1.3 2,382.13 85 7/8 39.19 2.15 13.4 1.08 1.28 2,380 142 55.2 2.55 32.43 0.76 0.6 1,297.79 142 55.22 2.55 19.8 0.88 0.62 1,300 47 5/8 N/A -2.47 47 5/8 23.29 1.94 Table 3: The key ratios of Merck as of market close on 3/16/99. PFE Share Price P/E ratio EPS 5-yr EPS growth (%) Dividend Dividend yield % Shares (mm) 142.0000 84.02 2.12 142 61.41 2.64 0.22 0.56 1297.79 0.22 (Q) 0.625 1,297.79 142 94.98 (TTM) 2.64 (TTM) 15.90 0.88 (Y) 0.63 1,297.79 142 55.7 2.55 19.00 0.22 0.6 1,297.79 Table 4: The key ratios of Pfizer as of market close on 3/16/99. RP Share Price P/E ratio EPS 47.6250 0.00 0.00 47 5/8 0 -0.426 47.625 N/M -0.42 0.48 1.34 372.15 0 1.036 1,439.60 0.47 (Y) 1.03 1,439.60 5-yr EPS growth (%) Dividend Dividend yield (%) Shares (mm) 47 5/8 24.5 1.94 N/M 0.47 (Y) 1.0 358.758 N/A 16.0 0.61 0 457.65 0.62 1.38 1,440 Table 5: The key ratios of Rhone-Poulenc S.A. as of market close on 3/16/99. 28 To John's surprise, there are lots of discrepancies among various sources. In each field, he highlights the numbers that are significantly different from the rest. P/E ratios, EPS and estimated 5-year EPS growth rate have the most discrepancies. However, what really disturbs John is that there are even differences in dividend and shares outstanding. Dissatisfied with the results, John decides to revisit the web sites and look for explanations. 5.2.3 Timeliness Issues John notices the difference in reported EPS values of Pfizer between Market Guide ($2.64 per share) and other sources ($2.55 per share). He revisits the Market Guide website and quickly realizes that it reports trailing 12 months' earnings. The EPS number provided by DBC is the same as that of Market Guide. John suspects that DBC is reporting trailing 12 months' earnings as well. In order to verify his hypothesis, he visits the DBC web-site and looks for a definition for EPS. The following is the definition he gets from DBC: "Earningsper share (EPS) EPS, as it is called, is a company'sprofit divided by its number of shares. If a company earned$2 million in one year had 2 million shares of stock outstanding,its EPS would be $1 per share." From DBC's explanation, it is unclear how "one year" is defined. It can be "fiscal year", "trailing 12 months" or just a general definition of "January to December". But based on his findings in the Market Guide web-site, John assumes that DBC defines "one year" as "trailing 12 months". John re-visits the Quicken web-site and finds out that Quicken reports EPS in fiscal year 1998. The fiscal year of Pfizer ends in December. Since both RapidResearch and Yahoo! Finance report the same values for EPS as that of Quicken, John suspects that RapidResearch and Yahoo! Finance report EPS in fiscal year 1998 as well. He visits the RapidResearch web-site and notices that it does report EPS in fiscal year 1998. However, he cannot find any explanation on the Yahoo! Finance web-site. The EPS value reported by Bloomberg ($2.12 per share) is different from all the other sources. John visits the Bloomberg web-site and only finds an explanation of "Earnings: 12 months". Similar to DBC's explanation, it is unclear how "12 months" is defined. But since the value reported by Bloomberg is very close to that reported by Market Guide and DBC, John believes that Bloomberg reports trailing 12 months' earnings. He suspects the difference in reported EPS is due to dilution. Reduction in common EPS may occur if convertible securities are converted, stock options and warrants are exercised or other shares are issued. In this case, it is possible that Bloomberg reports diluted EPS which is lower than primary EPS reported by Market Guide and DBC. 29 In the above example, we notice that timeliness can be a source of data conflicts. Data conflicts do not necessarily mean that some or all of the data are inaccurate. All the EPS values reported by the sources can be correct. The discrepancies are due to different information service providers having different contexts. 5.2.4 ConsistencyIssues John finds that the dividends of Merck reported by Yahoo! Finance and Market Guide ($1.08 per share) are four times to those reported by Bloomberg, DBC and Quicken ($0.27 per share). At first glance, there seem to be large discrepancies in dividends reported by various sources. He looks up the DBC web-site again and discovers that DBC explicitly reports the quarterly dividend. Then, he tries Quicken and notices the following explanation concerning dividend: "Dividend A dividend is an amount of money or stock that a corporationpays to its shareholdersquarterly..." From the explanation on dividend offered by Quicken, it is clear that Quicken is reporting the value on a quarterly basis. Finally, John visits the Bloomberg web-site but it offers no explanation on the dividend reported. Based on the fact that Bloomberg, DBC and Quicken are reporting quarterly dividend and their values are four times less than that of Yahoo! Finance and Market Guide, John suspects that Yahoo! Finance and Market Guide are reporting dividend on an annual basis. He re-visits the Market Guide web-site and finds out that Market Guide explicitly reports annual dividend. He tries to look for an explanation at Yahoo! Finance but finds nothing. At this point, John is confident that the four times difference is caused by different representation of the same piece of data. John further notices that RapidResearch consistently reports a lower yearly dividend than other sources. He visits RapidResearch web-site and it reports the dividend of fiscal year 1998 ($0.95 per share). John wants to understand the reason for the inconsistency and he looks for explanations of dividend in other sources. Finally, he discovers the following at the Market Guide web-site: "Dividend Rate ($per share) This value is the total of the expected dividendpayments over the next twelve months. It is generally the most recent cash dividendpaid or declared multipliedby the dividendpaymentfrequency, plus any recurringextra dividends" After reading the explanation from Market Guide, John now understands why RapidResearch reports an annual dividend that is different from that of Market Guide. The value reported by RapidResearch is the dividend paid out during fiscal year 1998. On the other hand, the value reported by Market Guide is a 12-month projection based upon the current quarterly dividend. 30 In the above example, we notice that inconsistencies arise when values of the same piece of data are reported in different representations. Again, we believe that all the sources report the dividend values correctly. The user's context determines the source that is most appropriate to query against. 5.2.5 Completeness Issues Despite some of the timeliness issues and consistency issues, John has been quite satisfied with the quality and extent of the information provided by the financial websites. As an individual investor, he can basically get all the information he needs from these sources. John finds these sources to have very high completeness. He goes on to recommend these financial web-sites to his friend Jessica who works as a financial analyst with a prominent investment bank. Jessica is compiling a research report to investigate the effect of global economic meltdown on US pharmaceutical companies. She is particularly interested in analyzing the geographical breakdown of sales of Johnson & Johnson. As a professional financial analyst, she knows she can get the sales figures from Johnson & Johnson's annual report. However, since John highly recommends the Internet financial web-sites, she decides to use the Internet this time and saves the trouble of flipping through the annual report. She looks up the Market Guide web-site for the sales information. To her disappointment, she only finds the following information: Revenue ($ mm) 12 months ending 01/03/99 23,657 Table 6: Sales of Johnson & Johnson in fiscal year 1998 reported by Market Guide. Jessica tries several other financial web-sites but still without luck in obtaining the geographical breakdown of sales. Finally, she resorts to the 1998 annual report of Johnson & Johnson and obtains the following information: (Dollars in Millions) United States Sales to Customers in 1998 12,562 Europe 6,317 Western Hemisphere excluding US Asia-Pacific, Africa Segments Total 2,090 2,688 23,657 Table 7: Geographical breakdown of sales of Johnson & Johnson in fiscal year 1998 (Source: Johnson & Johnson 1998 Annual Report). Jessica will not consider the Internet financial sources to be complete. They do not have the sales information that Jessica needs. She only finds the Johnson & Johnson Annual Report to have high completeness. This example illustrates that completeness is a quality that depends very much on users' contexts. 31 5.2.6 AccuracyIssues Jessica is looking at Rhone-Poulenc, a French chemical company. She is puzzled by the differences in the EPS values reported by the various sources. Unlike the example in section 5.2.3, the data discrepancies are pretty wide spread in this case. Quicken and Yahoo! Finance both report a gain for Rhone-Poulenc ($1.94 per share) while DBC and Market Guide both report a loss (-$0.42 per share). RapidResearch reports an even deeper loss (-$2.47 per share). Bloomberg does not have a value for the EPS. Rhone-Poulenc is a French company and there is probably limited financial information on it. Jessica understands that Rhone-Poulencreported a loss in fiscal year 1997. So the discrepancies in gain/loss among the sources are most likely timeliness issues. However, the disturbing fact is that there are discrepancies even among sources that are reporting a loss. The only reliable way to obtain the EPS is to derive it from net earnings and number of shares outstanding. She looks up the annual report of Rhone-Poulenc and obtains the following information: (in Millions of French Francs) Net Income Earnings per Share Average Number of Shares Outstanding 1998 4,224 11.48 367,752,291 Table 8: Consolidated Financials of Rhone-Poulenc in fiscal year 1998 (Source: Rhone-Poulenc 1998 Annual Report). From the annual report of Rhone-Poulenc,Jessica finds out that the EPS is 11.48 FF per share. However, since the audience of her report are US based investors, she has to convert the EPS figure into US dollars. With the help of Oanada currency database, she obtains the historical Euro/USD exchange rate of 1 FF = $0.1779 on 12/31/1998. Therefore, the EPS of Rhone-Poulenc in fiscal year 1998 is $2.04 per share. The EPS value derived by Jessica from net earnings and number of shares outstanding is different from any of the values reported by the Internet sources. There are no apparent reasons for the discrepancies. A possible reason is the use of different exchange rates. For example, some sources may be mistakenly using the current exchange rate rather than the exchange rate on 12/31/1998. In this example, the data discrepancies without apparent semantic conflicts are considered to be accuracy issues. Wang and Reddy [1992] propose to resolve accuracy issues by asking for user's judgement, which is likely to be based upon past experiences. We show that an alternative solution is using other high quality data to derive the correct values for inaccurate data. 5.3 COIN Approach The COIN system is an elegant system for resolving semantic conflicts among various data sources. We propose to leverage the COIN system to resolve the semantic conflicts and data discrepanciespresented in the scenarios in section 5.2. 32 In chapter 6, we would introduce the approach of source selection based on context mediation. Our goal is to select the source that best fits the user's context. We show that context mediation can handle semantic conflicts, such as consistency and timeliness issues, among various sources in a straightforward manner. For data discrepancies, such as accuracy and completeness issues, we propose to adopt the strategy of derived data. With appropriate ontology mapping, high quality data can be used to derive the correct values for inaccurate data. 33 6 Solution with Context Mediation 6.1 Overview In this chapter we propose a solution to source selection based upon context mediation. With the COIN approach, the user is unaware of the data discrepancies among the data sources. The system accesses the data sources implicitly without the user ever specifying them in the query. In order to resolve the semantic conflicts and data discrepancies,the COIN system should be aware of the context of the data sources and that of the user. User Query and Context Context of Data Sources A Source Selection with Context Mediation 4 - Query Result 1 - Web Sources Figure 5: Overview of the COIN approach in source selection. For example, a user wants to obtain the annual dividend of a company. The COIN system should be aware that some sources report dividend on a quarterly basis while others report dividend on an annual basis. In the user's context, he expects the query result to be annual dividend. The COIN system should query against a source that reports annual dividend and return the query result to the user. In case the source that reports annual dividend is unavailable due to network or server error, the COIN system should still be able to detect the semantic relation between quarterly dividend and annual dividend. It will query against a source that reports quarterly dividend. Upon receiving the query result, the COIN system should perform the necessary conversion before passing the result to the user. In this case, the quarterly dividend is multiplied by four to yield the annual dividend. In the following sections, we will continue to use the two scenarios of an individual investor and a financial analyst to illustrate how context mediation can be applied on source selection. 34 6.2 Data Consistency In section 5.2.4, we present a scenario in which John, an individual investor, notices the inconsistencies in reported dividend values among various sources. The reason behind the inconsistencies is that different sources have different contexts. Some sources report dividend on a quarterly basis while others report it on an annual basis. We will illustrate how context mediation can resolve the data inconsistency in this case. Shah [1998] describes the steps of building a COIN application in detail. In order to leverage the COIN system in resolving semantic conflicts, we need to carefully set up the domain model, the context definitions and the conversion functions. 6.2.1 Defining the Domain Model In the domain model, we have two types of objects, basic objects (denoted by squares) and semantic objects (denoted by ellipses). Basic objects have scalar types. Examples are numbers and strings. The semantic objects are the ones that are used n the system as they carry rich semantic information about the objects in a given domain. Furthermore, there are two types of relations specified between the objects: attribute relation and modifier relation.Details of these relations are described in chapter 3. Figure 6 presents the dividend domain model: -------- Attribute - Modifier * basic basic format field Field Name frequency format / / / Dividend / / / / / / If / Frequency Representation eld frequency ---------- DividendStock ---------------------------------company Smo Smo Figure 6: Dividend Domain Model. We define the domain model for the COIN system with the help of Figure 6. We use the example of the semantic type Dividendto demonstrate how to define an object and create the relations for that object. The object Dividendhas three attributes, company, 35 field andfrequency, and one modifier, unit. We first define the Dividend semantic type. The construct to define a semantic object is: semanticType(Dividend) After we have defined the object, we then define the attributes and modiferes: attributes(Dividend, [company, field, frequency]) modifiers(Dividend, [unit]) Definitions for other objects in the domain model are listed in Appendix C. 6.2.2 Defining the Context We have to identify all the semantic differences among the data sources before we can create their contexts in COIN. The following is a context table for the dividend field of the six sources: Data Source Field Name Frequency Unit Stock Symbol Bloomberg "Dividend" Quarterly Number String DBC "Dividend Amount" Quarterly "Quarterly" Number String Market Guide "Annual Dividend" Annual Number String Quicken "Div/Shr" Quarterly Number String RapidResearch "Dividend" Last Fiscal Year Number String Yahoo! Finance "Div/Shr" Annual Number String Table 9: Context Table for Dividend Domain Model. In Table 9, each row corresponds to a data source for which we have to specify a context and lists all the characteristics of that data source. We also need to give a unique name to each context that we are going to create for each of the data sources. For example, if we are defining the context of Market Guide, we will name cmg as the name of the context for Market Guide. Once we have specified and named the contexts, we will create the actual contexts. Referring to the context table, we look at the row for Market Guide data source. Each column of the table refers to a semantic difference present in the source that needs to be resolved. The column FieldName refers to the string identifying the key to the dividend value in a source. For example, in order to successfully query against Market Guide for the dividend value, we have to specify the field as "Annual Dividend". If we just type the field as "Dividend", the query will likely returns with no result. On the other hand, when we are querying against Yahoo! Finance for the dividend, we have to specify the field as "Div/Shr". The column Frequencyrefers to the frequency of reported dividend value in a source. Market Guide reports the annual dividend whereas DBC reports the quarterly 36 dividend. The column Unit refers to the format of the dividend value in a source. For example in DBC, the dividend is reported as a number followed by a string "Quarterly". In all the other sources, the dividend is just reported as a number. The last column Stock Symbol refers to the key used by the data source to identify the companies. There is no discrepancy among the sources in this aspect. All the sources use the stock symbol to identify the companies. After we have recognized all the potential semantic conflicts, we can create the contexts. We will use Market Guide as an example to show the context creation process. The context for Market Guide in prolog is as follows: modifier (dividend, 0, unit, cmg, M) cste(basic, M, cmg, "Number") modifier (fieldName, 0, fieldFormat, cmg, M) cste(basic, M, cmg, "Annual Dividend") modifier (freqRepresentation, 0, freqFormat, cmg, M) cste(basic, M, cmg, "Annual") modifier (stockSymbol, 0, format, cmg, M) cste(basic, M, cmg, "String") Each statement refers to a potential conflict that needs to be resolved by the COIN system. In other words, each statement corresponds to a modifier relation in the actual domain model. From the domain model in Figure 6, we notice that the object FieldName has a modifierfieldFormat.The second statement in the above example corresponds to this modifier. Referring to Table 9, we notice that the value offieldFormatin the Market Guide context is "Annual Dividend". The statement represents this fact. It states that the modifierfieldFormatfor the object 0 of typefieldName in the context cmg is the object M where the object M is a constant (cste) of type basic and has a value of "Annual Dividend" in the context c mg. The type basic refers to all the scalar types, numbers, reals and strings. Context definitions for other objects are listed in Appendix D. 6.2.3 Defining the Conversion Function We need to specify conversion functions for the contexts after we have defined them. The conversion functions are used during context mediation when the system needs to convert objects between different contexts. We have to provide conversion functions for all the modifiers of all the objects defined in the domain. For the modifierfieldFormat,the conversion between different contexts is straightforward text manipulation. For example, if we have to convert from the DBC context to the Market Guide context, we will change the FieldName from "Dividend Amount" to "Annual Dividend". 37 Similarly, for the modifier unit, the conversion is also text manipulation. We assume the user would expect the query result of dividend to be a number. For example, if we have to convert from the DBC context to the user's context, we will scrap the string "Quarterly" from the query result and just return the number to the user. Finally, for the modifierfrequencyFormat,the conversion is a bit tricky. Converting between quarterly and annual dividend requires arithmetic operation. For example, if we have to convert from the Quicken context to the Yahoo! Finance context, we will multiply the quarterly dividend by four to get the annual dividend. 6.3 Data Timeliness In section 5.2.3, John notices there are differences in reported EPS values for the same company. He realizes the conflict is a timeliness issue. Some sources report trailing 12 months' earnings while others report earnings in the last fiscal year. We will show that context mediation can help resolve the timeliness issue in this scenario by reconciling the EPS in different time frames. In our following discussion, we will focus on three sources: Market Guide, Yahoo! Finance and RapidResearch. Market Guide reports EPS in the last fiscal year. Yahoo! Finance reports EPS on a trailing 12 months' basis, i.e. total EPS of the most recent 4 quarters. RapidResearch reports quarterly EPS. We intend to use RapidResearch's quarterly EPS data to resolve the difference between fiscal EPS and trailing 12 months' EPS. The process in this example is similar to that in the data consistency case. We have to define a domain model, a context table and the conversion functions. 6.3.1 Defining the Domain Model ------ * - Attribute P Modifier Figure 7: EPS Domain Model 38 Figure 7 presents the EPS domain model. The definitions for the attributes and modifiers are similar to that in section 6.2.1. A complete list of definitions for the objects in the domain model is in Appendix E. 6.3.2 Defining the Context We identify all the semantic differences among the three data sources that we are interested in. The following is a context table for the EPS field of the sources: Data Source Market Guide Field Name "Earnings (TTM) $" Date Format Time Format Unit Stock Symbol MM/DD/YY Fiscal Number String Yahoo! "Earn/Shr" Month/Day Trailing Number String RapidResearch "EPS" YYMM Quarterly Number String Table 10: Context Table for EPS Domain Model. In Table 10, the column FieldName refers to the string identifying the key to the EPS value in a source. For example, in order to obtain the EPS value from Market Guide, we have to specify the field as "Earnings (TTM) $". The column Date Formatrefers to the way dates are represented in the data source. For example in Market Guide, December 31, 1998 is represented as 12/31/98; in Yahoo!, it is represented as Dec/31; and in RapidResearch, it is represented as 9812. The column Time Formatrefers to the time frame of the reported EPS value in a source. For example, Market Guide reports the EPS in the last fiscal year while Yahoo! reports the trailing 12 months' EPS. The column Unit refers to the numerical format of the EPS values. There is no discrepancy in this field. The last column Stock Symbol refers to the key used by the data source to identify the companies. There is no discrepancy among the sources in this field as well. The context creation process is similar to that in section 6.2.2. A list of context definitions for the EPS model is in Appendix F. 6.3.3 Defining the Conversion Function The conversion functions in this scenario are more complicated than those in section 6.2.3. For the modifiers fieldFormat and dateFormat,the conversions between different contexts are straightforward text manipulations. However, the conversion from fiscal EPS to trailing 12 months' EPS requires quarterly EPS data. We illustrate the conversion process with an example of the EPS of Pfizer. John would like to obtain the trailing 12 months' EPS of Pfizer. He sends his query to the COIN system. Under normal circumstances, the COIN system would match John's context with the contexts of the sources and direct the query to Quicken which reports trailing 12 months' EPS. Unfortunately, the Quicken server is down temporarily and the COIN system cannot reach it. The Market Guide server is available but it reports EPS of last fiscal year. The COIN system has to do some conversion before passing the query 39 result to John. The system knows that RapidResearch has quarterly EPS data. The RapidResearch data is as follows: Quarter 9903 9812 9809 9806 9803 EPS 0.62 0.49 1.06 0.47 0.53 Table 11: Quarterly EPS of Pfizer reported by RapidResearch. The EPS of fiscal year 1998 reported by Market Guide is 2.55. The conversion of fiscal 1998 EPS to trailing 12 months' EPS is as follows: Trailing12 months' EPS=Fiscal 1998 EPS + Mar 99 (Q) EPS - Mar 98 (Q) EPS = 2.55 + 0.62 - 0.53 =2.64 After the conversion, the COIN system returns the result of trailing 12 months' EPS to John. During the whole mediation process, John is not aware that Quicken is down and that the COIN system has selected to use Market Guide and RapidResearch instead. The query result will be the same as if the query has been sent directly to Quicken. The COIN system resolves the timeliness issue in this scenario. 6.4 Data Accuracy 6.4.1 Resolving Data Discrepancies In section 5.2.6, we present a scenario in which Jessica, a financial analyst, notices the data discrepancies in the EPS of a French pharmaceutical company reported by various sources. There are no apparent semantic conflicts among the data. She believes the data discrepancies are the result of data inaccuracy. Wang and Reddy [1992] propose to resolve accuracy issues by asking for user's judgement, which is likely to be based upon past experiences. Under this approach, Jessica will have to select a source that she believes is accurate. There are two potential problems in this approach. First, if Jessica has no prior experience with any of the sources, then she will have a hard time selecting a source. The source selection process will be no better than a randomized algorithm. Second, even if Jessica perceives that one source is better than the other, there is no guarantee that the data reported by the selected source is accurate this time. We propose to resolve accuracy issues by using other high quality data to derive the correct values for inaccurate data. We can extend the COIN system with the appropriate ontology. An ontology is a specification of a conceptualization. In other words, an ontology is a description of the concepts and relationships that can exist for an agent or a community of agents [Gruber 1993]. 40 6.4.2 Derived Data The following figure outlines the COIN approach in handling data accuracy issues: Annual Report Internet Sources Net Income (accurate) Shares Outstanding (accurate) Context Mediation EPS (inaccurate) EPS (accurate) Figure 8: COIN approach in handling data accuracy issues. In the above example, the EPS values reported by the various Internet sources have lots of discrepancies. Instead of asking the user explicitly to select a source, the COIN system implicitly handles the conflicts by deriving an accurate EPS value from the supposedly accurate data of Net Income and Shares Outstanding from the company's online annual report. The mapping relation is as follows: EPS = Net Income + Shares Outstanding One can argue that the COIN approach does not fully solve the data accuracy problem. How do we know that the data in the annual report is accurate? Or how can we be sure that the contents of the annual report are accurately presented online? These questions can go on infinitely. The accuracy issue can be traced from network reliability all the way back to the professionalism of the company's auditor. Again, we want to point out that data accuracy is a relative issue. It depends very much on the contexts of the data provider and the data consumer. In the scenario of Jessica, she considers the EPS values provided by the Internet sources inaccurate but the derived value from the annual report accurate enough for her analysis. And this may only be true for large, reputable companies. If Jessica is analyzing a company that is notorious for earnings manipulation, then she would not consider the data from the annual report to be accurate. The COIN approach towards data accuracy issues can be viewed as a best-effort approach. The system will try to derive the data that is accurate in the user's context. 41 6.4.3 Ontology View Jessica is quite satisfied with the COIN system in its ability to resolve data accuracy problems. Upon a closer inspection of the annual report of Rhone-Poulenc, she notices that the numbers are reported in French Francs. Since all her clients are US-based investors, she would like to have the EPS number reported in US Dollars. Also, due to accounting differences between the US and European companies, she would like to obtain the EPS Before Goodwill Amortization. We propose to create an ontology view within the COIN system that encompasses all the relations. The ontology mapping is as follows: f(Net Income Before Goodwill Amortization in FF,Shares Outstanding, USD/FF Exchange Rate) EPS Before Goodwill Amortization in USD Figure 9: Ontology mapping for Rhone-Poulenc scenario. In Figure 9, the ontology is a result of the mappings of the surrounding objects. Each arrow represents a relation between two or more objects. The objects need not be data objects from the same source. In the scenario of Rhone-Poulenc, we can obtain Net Income, Goodwill Amortization and Shares Outstandingdata from its online annual report. The problem is that all these figures are reported in French Francs. In order to convert the financial figures from French Francs to US Dollars, we need the currency exchange rate. We can obtain the historical USD/FF exchange rate from the Oanada currency database on the Internet. After we have identified the sources, we need to set up the mapping functions. Two of the most important mapping functions in this scenario are as follows: 42 f(Net Income in FF,GoodwillAmortization in FF) = Net Income in FF+ Goodwill Amortization in FF f(Net Income Before Goodwill Amortization in FF,Shares Outstanding, USD/FF Exchange Rate) = Net Income Before GoodwillAmortization in FF+ Shares Outstanding x USD/FF Exchange Rate The first function derives the Net Income Before Goodwill Amortizationfrom Net Income and Goodwill Amortization. The second function derives the EPS Before Goodwill Amortization from Net Income Before Goodwill Amortization in FF, Shares Outstandingand USD/FF Exchange Rate. The data from Rhone-Poulenc's Annual Report is as follows: (in Millions of French Francs) Net Income Goodwill Amortization Average Number of Shares Outstanding 1998 4,224 1,400 367,752,291 Table 12: Consolidated Financials of Rhone-Poulenc in fiscal year 1998 (Source: Rhone-Poulenc 1998 Annual Report). The USD/FF exchange rate data from Oanada currency database is as follows: French Franc (FF) 1.0000 US Dollar (USD) 0.1779 Table 13: USD/FF exchange rate as of 12/31/1998 (Source: Oanada Currency Database). Upon receiving the query from Jessica, the COIN system performs the following conversions: Net Income Before GoodwillAmortization in FF = 4,224 + 1,400 = 5,624 million EPS Before Goodwill Amortization in USD = 5,624 million = 2.72 + 367,752,291 x 0.1779 Another possible conversion required is the company name. Jessica may just use the stock symbol of Rhone-Poulenc, RP, in her query. However, the web-site of RhonePoulenc may not recognize the stock symbol, RP. The full name of the company may be needed. The COIN system has to convert the stock symbol to the company name. The COIN system then returns the mediated result of EPS Before Goodwill Amortization in USD to Jessica. During the whole process, Jessica is not aware of the source selection and the currency conversion operations. 43 6.5 Data Completeness In section 5.2.5, we point out that data completeness is a relative concept. A source that is complete to one user may be incomplete to another. In this scenario, John, from the context of an individual investor, only wants to get the total sales in the last fiscal year of Johnson & Johnson. He considers the Internet financial sources to be complete. On the other hand, Jessica, from the context of a financial analyst, needs the geographical breakdown of sales in the last fiscal year of Johnson & Johnson. She cannot obtain this information from any of the Internet financial sources and she considers these sources to be incomplete. We propose to solve the data completeness issue by bridging the gap between sources with an ontology mapping. The ontology mapping for this scenario is similar to that in section 6.4.3. US Sales Western Hemisphere Western Hemisphere excluding US Sales, AsiaPacificAfrica Sales) sT Total Sales ote excluding US Sales 10:Ontoogy appnoohsn&Jhsoscnro Figur he Geographica Sales, uobje aesmpe, oteTta ae Figure In 10, betb Asia-Pacific, Africa Sales Figure 10: Ontology mapping for Johnson & Johnson scenario. In Figur the t GeographicalSales objects are mapped to the Total Sales object by a one-way mapping. This means the COIN system can derive Total Sales given the GeographicalSales but not the other way round. When Jessica queries for the GeographicalSales, the COIN system will answer the query with the corresponding es, the COIN system can GeographicalSales object. When John queries for the Total either obtain the total sales from a financial web-site or perform the conversion on the GeographicalSales data with the ontology. With this ontology, the COIN system looks complete to both John and Jessica. However, the COIN approach does not completely solve the data completeness problem. What if there is a third user querying for the sales of Johnson & Johnson in Massachusetts? Or for the sales of Johnson & Johnson in Boston? These queries can get finer and finer infinitely. The COIN system cannot answer all these queries with the given ontology. Similar to the data accuracy issue, the COIN approach towards data completeness issues can be viewed as a best-effort approach. 44 6.6 Potential Problems and Possible Solutions 6.6.1 Identifyingthe Sources A potential problem for the COIN derived data approach is to correctly identify the sources that can provide accurate data for creating the ontology view. For example, how does the COIN system know the URL for the online annual report of Rhone-Poulenc? Or how does the COIN system know the existence of the online annual report? We propose three solutions to this problem. The first solution is to query against an Internet search engine with the company name and tries to get the URL of the company's web-site. The second solution is to identify sources that have integrated information on a set of companies. The third solution is to maintain a mapping between company names and the URL of their web-sites internally within the COIN system. The first solution of querying against a search engine requires a very complicated algorithm to identify the correct source. Most search engines will perform key word search and return a list of potential matches. The COIN system then has to identify the company web-site. The story does not end here. The financial records are probably hidden somewhere within the company web-site. And the structures are different for web-sites of different companies. The COIN system has to handle lots of context issues in this solution. The second solution involves identifying sources that have integrated information on a set of companies. In the example of financial reports, we may be able to download company financials from Edgar Online. Edgar Online has quite a complete set of SEC filings for US companies. However, due to different reporting requirements for non-US companies, Edgar Online does not provide much information on non-US companies. Take Rhone-Poulenc as an example. The only SEC filing for Rhone-Poulenc on Edgar Online is a list of "Amended Ownership Statements". We cannot locate its annual or quarterly financial reports. The third solution involves maintaining a mapping within the COIN system. For example, the mapping for the four pharmaceutical companies mentioned in the previous scenarios are as follows: Company Name Stock Symbol URL for Financial Information in Fiscal Year 1998 Johnson & Johnson JNJ http://www.jnj.com/news-finance/98_AnnualReport/3 1.html Merck MRK NULL Pfizer PFE ials/fmancials.html Rhone-Poulenc RP http://www.rhone-poulenc.com/framu/re984002.htm http://www.pfizer.com/pfizerinc/investing/annual/1998/financ Table 14: Mapping of company names to the URL of their online financial information. 45 We can notice that the structure of each company's web-site is very different. Some companies, in this case Merck, do not provide financials online at all. If the COIN system has to incorporate this mapping, the scale of the system will increase linearly with the number of companies. Also, it is insufficient to have the URL alone. The representation of the financial information within each web-site is different. The COIN system has to know the data provider's context in order to correctly wrap the web-sites. Thus this solution will limit the scalability of the system. 6.6.2 Defining the Ontology View The problem with defining the ontology view is whom should be responsible for defining the mappings. We can let the users define their own views or let the ontology provider define the view. In the first solution, the users set their own views explicitly by stating what mappings to use. For example, in the scenario in section 6.5, Jessica can set her own view by querying for the percentage of US Sales to Total Sales of Johnson & Johnson in fiscal year 1998. The mapping will be Percentage of US Sales to Total Sales = US Sales + Total Sales x 100% There will be a view maintenance layer between the query interface and the COIN engine. The SQL-to-Datalog module needs to be modified as well. The following diagram presents the extension of a view maintenance layer in the COIN system: Query Results Interface View Maintenance COIN Engine Figure 11: View Maintenance layer in the COIN model. In the second solution, the mappings are defined by the ontology provider. Users can only pose queries with a limited number of mappings. For example, in the scenario in section 6.5, the relations will be limited to the one-way mapping of total sales. 46 6.7 Further Research 6.7.1 Source Mapping for Derived Data We point out in section 6.6.1 that it is difficult for the COIN system to correctly identify sources in the derived data approach. We propose two solutions but both of them involve lots of context issues and greatly hinder the scalability of the system. Different companies have web-sites with different structures. We believe that further research can be done on the web wrappermodule so as to more effectively wrap the websites. Also, research can be done on how to map the company names to their web-sites efficiently. The proposed solution involves a one-to-one mapping for each company and the scale of the system increases linearly with the number of companies. We can consider querying against web-sites that integrate the annual reports of different companies, such as Edgar Online. However, the filings on Edgar Online may not be complete, especially for non-US companies. For example, we try to search the Edgar Online database for SEC filings of Rhone-Poulenc. To out disappointment, we only find an "Amended Ownership Statement" filed on February 10, 1999. There is no filing of quarterly or annual report. 6.7.2 Other Source Selection Criteria Wang and Strong [1996] have identified fifteen data quality criteria. In this thesis, we discuss four criteria that are most relevant to context mediation: accuracy, completeness, consistency and timeliness. Further research can be done on other quality criteria. We believe that the accessibilitycriteria will be interesting to the COIN system. Accessibility of a source can depend on lots of factors: network traffic, server speed, server load etc. In the scenario of data consistency in section 6.2, the user is trying to obtain quarterly dividend of a company. The current COIN system does not have the functionality to select a particular source from a set of sources that are in the same context as that of the user, i.e. reporting quarterly dividend. How should the COIN system decide which source to query against? Accessibility can play a part in the solution. The COIN system may want to query against the fastest server. But how does it know which server is faster? How about the network traffic condition? Lots of interesting issues will arise in this area. 47 7 Conclusion In this thesis, we have discussed the meanings of data quality and have showed that it is a relative concept. We have identified four data quality criteria that are most relevant to context mediation: accuracy, completeness, consistency and timeliness. Through the scenarios of two users who are in different contexts, we show how data discrepancies arise due to each of the four data quality issues mentioned above. We then introduce a solution with context mediation. We show how context mediation can help resolve the issues in each of the four data quality criteria. For semantic conflicts such as consistency and timeliness issues, the context mediation process is pretty straightforward. For data discrepanciessuch as accuracyand completeness issues, we propose to use derived data in context mediation. The COIN approach can only partially solve the problems of data accuracy and completeness because the problems are relative to the user's context. In conclusion, we believe that context mediation is a novel approach in handling data quality problems. Most of the data quality problems depend very much on the users' contexts. The COIN framework, with a set of well-defined contexts of the data providers, can query against the source that best fits the user's context. In this thesis, we present the scenarios in the financial domain. However, we believe that the COIN model can be extended to handle data quality issues in other domains as well. 48 8 References Arens, Y. and Knoblock, C. (1993). SIMS: Retrieving and Integrating Information From Multiple Sources. Proceedings of the 1993 ACM SIGMOD International Conference on ManagementofData, pp. 562-563. Ballou, D. P. and Pazer, H. L. (1985). Modeling Data and Process Quality in Multiinput, Multi-output Information Systems. Management Science 31, 2 (1985), pp. 150162. Bressan, S. and Bonnet, P. (1997). Extraction and Integration of Data from Semistructured Documents into Business Applications. Conference on IndustrialApplications of Prolog (1997). Bressan, S., Fynn, K., Goh C., Madnick, S., Pena, T., and Siegel, M. (1997) Overview of a Prolog Implementation of the Context Interchange Mediator. Proceedings of the Fifth InternationalConference andExhibition on The PracticalApplications of Prolog(Mar. 1997). Bressan, S., Goh, C., Fynn, K., Jakobisiak, M., Hussein, K., Lee, T., Madnick, S., Pena, T., Qu, J., Shum, A., and Siegel, M. (1997). Context Interchange Mediator Prototype. A CMSIGMOD InternationalConference on ManagementofData (1997). Florescu, D., Koller, D., and Levy, A. (1997). Using Probabilistic Information in Data Integration. Proceedingsof the 23rd VLDB Conference, Athens, Greece (1997). Goh, C. (1997). Representing and Reasoning about Semantic Conflicts in Heterogenous Information Systems. MIT Sloan School of Management CISL Working Paper #97-01. Gravano, L., Chang, C.-C. K., and Garcia-Molina, H. (1997). STARTS: Stanford Proposal for Internet Meta-searching. Proceedingsof the A CMSIGMOD Conference (1997). Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The Effectiveness of GIOSS for the Text Database Recovery Problem. Proceedingsof the A CMSIGMOD Conference (1994). Gruber, T. (1993). A Translation Approach to Portable Ontologies. Knowledge Acquisition 5, 2 (1993), pp. 199-220. KaKas, A. C., Kowalski, R. A., and Toni, F. (1993) Abductive Logic Programming. Journalof Logic and Computation 2, 6 (1993), pp. 719-770. Kifer, M., Lausen, G., and Wu, J. (1995). Logical Foundations of Object-oriented and Frame-based Languages. JACM4 (1995), pp. 741-843. 49 Kriebel, C. H. (1979) Evaluating the Quality of Information Systems. Design and Implementation of Computer Based Information Systems. N. Szysperski and E. Grochla, Ed. Sijthtoff & Noordhoff, Germantown. Liu, L. and Pu, C. (1997). A Metadata Based Approach to Improving Query Responsiveness. Proceedings of the 2 nd IEEE Metadata Conference (1997). McCarthy, J. (1987). Generality in Artificial Intelligence. Communications of the ACM 30, 12 (1987), pp. 1030-1035. Naumann, F., Freytag, J. C., and Spiliopoulou, M. (1998). Quality-driven Source Selection Using Data Envelopment Analysis. Proceedingsof the 1998 Conference on Information Quality (1998), pp. 137-152. Shah, S. (1998). Design and Architecture of the Context Interchange System. MIT Sloan School of Management CISL Working Paper #98-05 (May 1998). Strong, D. M., Lee, Y. W., and Wang, R. Y. (1997). Data Quality in Context. Communications of the ACM 40, 5 (May 1997), pp. 103-110. Tayi, G. K. and Ballou, D. P. (1998). Examining Data Quality. Communications of the ACM 41, 2 (Feb. 1998), pp. 54-57. Wand, Y. and Wang, R. Y. (1996). Anchoring Data Quality Dimensions in Ontological Foundations. Communications of the ACM 39, 11 (Nov. 1996), pp. 86-95. Wang, R. Y. and Reddy, M. P. (1992). Quality Data Objects. MIT Sloan School of Management Working Paper #3517 (1993). Wang, R. Y. and Strong, D. M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journalon Management ofInformation Systems 12, 4 (1996). 50 9 Appendices Appendix A A list of financial terms with explanations. Share price This is the Closing or Last Bid Price. It is also referred to as the Current Price. For NYSE, AMEX, and NASDAQ traded companies, the Price is the previous Friday's closing price. For companies traded on the National Quotation Bureau's "Pink Sheets", and OTC bulletin boards, it is the bid price obtained at the time the report is updated. PIE ratio This ratio is calculated by dividing the current Price by the sum of the Diluted Earnings Per Share from continuing operations BEFORE Extraordinary Items and Accounting Changes in the previous fiscal period. EPS This is the sum of the Diluted Earnings Per Share from continuing operations BEFORE Extraordinary Items and Accounting Changes in the previous fiscal period. 5-yr EPS growth This growth rate is the estimated compound annual growth rate of Earnings Per Share Excluding Extraordinary Items and Discontinued Operations in the next 5 years. Divdend rate This value is the total of the expected dividend payments over the next twelve months. It is generally the most recent cash dividend paid or declared multiplied by the dividend payment frequency, plus any recurring extra dividends. Dividend yield This value is the current percentage dividend yield based on the present cash dividend rate. It is calculated as the Indicated Annual Dividend divided by the current Price, multiplied by 100. Shares outstanding This is the number of shares of common stock currently outstanding. It equals the number of shares issued minus the shares held in treasury. This field reflects all offerings and acquisitions for stock made after the end of the previous fiscal period. 51 Appendix B A list of Internet sources that provide information on companies publicly traded in US stock exchanges. Source Name: URL: Query Input: Query Result: Entity Domain: Usage Fee: Bloomberg http://www.bloomberg.com Ticker symbol Summary of financial ratios. Charts and earnings estimates available. US Free Source Name: URL: Query Input: Query Result: Entity Domain: Usage Fee: Data Broadcasting Corporation (DBC) http://www.dbc.com Ticker symbol Summary of financial ratios. Charts available. US Free Source Name: URL: Query Input: Query Result: Entity Domain: Usage Fee: Market Guide http://www.marketguide.com Ticker symbol Summary of financial ratios. Charts and earnings estimates available. US Free Source Name: URL: Query Input: Query Result: Quicken http://www.quicken.com Ticker symbol Summary of financial ratios. Charts and analyst recommendations available. US Free Entity Domain: Usage Fee: 52 Source Name: URL: Query Input: Query Result: Entity Domain: Usage Fee: RapidResearch http://www.rapidresearch.com Ticker symbol Summary of financial ratios. Charts available. US Free Source Name: URL: Query Input: Query Result: Yahoo! Finance http://quote.yahoo.com Ticker symbol Summary of financial ratios. Charts, company profile and earnings estimates available. US Free Entity Domain: Usage Fee: Source Name: URL: Query Input: Query Result: Entity Domain: Usage Fee: Edgar Online http://www.edgar-online.com Ticker symbol or company name SEC filings US Free for html viewing. Subscription required for downloading PDF format documents. 53 Appendix C List of definitions for objects in the Dividend Domain Model. semanticType(Dividend) semanticType(FieldName) semanticType(FrequencyRepresentation) semanticType(StockSymbol) semanticType(basic) modifiers(Dividend, [unit]) modifiers(FieldName, [field format]) modifiers(FrequencyRepresentation, [frequencyformat]) modifiers(StockSymbol, [format]) modifiers(basic, []) attributes attributes attributes attributes attributes (Dividend, [company, field, frequency]) (FieldName, [1) (FrequencyRepresentation, []) []) (StockSymbol, (basic, []) 54 Appendix D List of context definitions for objects in the Dividend Domain Model. Bloomberg context: modifier (dividend, 0, unit, cbb, M) cste(basic, M, cbb, "Number") modifier (fieldName, 0, fieldFormat, cbb, M) cste(basic, M, cbb, "Dividend") modifier (freqRepresentation, 0, freqFormat, cbb, M) cste(basic, M, cbb, "Quarterly") modifier (stockSymbol, 0, format, cbb, M) cste(basic, M, cbb, "String") DBC context: modifier (dividend, 0, unit, cdb, M) cste(basic, M, cdb, "Quarterly Number") modifier (fieldName, 0, fieldFormat, cdb, M) cste(basic, M, cdb, "Dividend Amount") modifier (freqRepresentation, 0, freqFormat, cdb, M) cste(basic, M, cdb, "Annual") modifier (stockSymbol, 0, format, cdb, M) cste(basic, M, cdb, "String") Market Guide context: modifier (dividend, 0, unit, cmg, M) cste(basic, M, cmg, "Number") modifier (fieldName, 0, fieldFormat, cmg, M) cste(basic, M, cmg, "Annual Dividend") modifier (freqRepresentation, 0, freqFormat, cmg, M) cste(basic, M, cmg, "Annual") modifier (stockSymbol, 0, format, cmg, M) cste(basic, M, cmg, "String") 55 Quicken context: modifier (dividend, 0, unit, cqk, M) cste(basic, M, cqk, "Number") modifier (fieldName, 0, fieldFormat, cqk, M) cste(basic, M, cqk, "Div/Shr") modifier (freqRepresentation, 0, freqFormat, cqk, M) cste(basic, M, cqk, "Quarterly") modifier (stockSymbol, 0, format, cqk, M) cste(basic, M, cqk, "String") RapidSearch context: modifier (dividend, 0, unit, crs, M) cste(basic, M, crs, "Number") modifier (fieldName, 0, fieldFormat, crs, M) cste(basic, M, crs, "Dividend") modifier (freqRepresentation, 0, freqFormat, crs, M) cste(basic, M, crs, "Last Fiscal Year") modifier (stockSymbol, 0, format, crs, M) cste(basic, M, c rs, "String") Yahoo! Finance context: modifier (dividend, 0, unit, cyh, M) cste(basic, M, cyh, "Number") modifier (fieldName, 0, fieldFormat, cyh, M) cste(basic, M, cyh, "Div/Shr") modifier (freqRepresentation, 0, freqFormat, cyh, M) cste(basic, M, cyh, "Annual") modifier (stockSymbol, 0, format, cyh, M) cste(basic, M, cyh, "String") 56 Appendix E List of definitions for objects in the EPS Domain Model. semanticType(EPS) semanticType(FieldName) semanticType(Date) semanticType(Time) semanticType(StockSymbol) semanticType(basic) modifiers (EPS, [unit]) modifiers (FieldName, [field format]) modifiers (Date, [dateformat]) modifiers (Time, [time format]) modifiers (StockSymbol, [format]) modifiers (basic, []) attributes attributes attributes attributes attributes attributes (EPS, [company, field, date, time]) (FieldName, []) (Date, []) (Time, []) (StockSymbol, ] ) (basic, []) 57 Appendix F List of context definitions for objects in the EPS Domain Model. Market Guide context: modifier (EPS, 0, unit, cmg, M) cste(basic, M, cmg, "Number") modifier (fieldName, 0, fieldFormat, cmg, M) cste(basic, M, cmg, "Earnings (TTM)$") modifier (date, 0, dateFormat, cmg, M) cste(basic, M, cmg, "MM/DD/YY") modifier (time, 0, timeFormat, cmg, M) cste(basic, M, cmg, "Fiscal") modifier (stockSymbol, 0, format, cmg, M) cste(basic, M, cmg, "String") Yahoo! Finance context: modifier (EPS, 0, unit, cyh, M) cste(basic, M, cyh, "Number") modifier (fieldName, 0, fieldFormat, cyh, M) cste(basic, M, cyh, "Earn/Shr" ) modifier (date, 0, dateFormat, cyh, M) cste(basic, M, cyh, "Month/Day") modifier (time, 0, timeFormat, cyh, M) cste(basic, M, cyh, "Trailing") modifier (stockSymbol, 0, format, cyh, M) cste(basic, M, cyh, "String") 58 RapidSearch context: modifier (EPS, 0, unit, crs, M) :cste(basic, M, c rs, "Number") modifier (fieldName, 0, fieldFormat, crs, M) cste(basic, M, c rs, "EPS") modifier (date, 0, dateFormat, c rs, M) cste(basic, M, crs, "YYMM") modifier (time, 0, timeFormat, crs, M) cste(basic, M, crs, "Quarterly") modifier (stockSymbol, 0, format, crs, M) cste(basic, M, crs, "String") 59