AI Magazine Volume 26 Number 1 (2005) (© AAAI) Articles Automatically Utilizing Secondary Sources to Align Information Across Sources Martin Michalowski, Snehal Thakkar, and Craig A. Knoblock ■ XML, web services, and the semantic web have opened the door for new and exciting informationintegration applications. Information sources on the web are controlled by different organizations or people, utilize different text formats, and have varying inconsistencies. Therefore, any system that integrates information from different data sources must identify common entities from these sources. Data from many data sources on the web does not contain enough information to link the records accurately using state-of-the-art record-linkage systems. However, it is possible to exploit secondary data sources on the web to improve the recordlinkage process. We present an approach to accurately and automatically match entities from various data sources by utilizing a state-of-the-art record-linkage system in conjunction with a data-integration system. The data-integration system is able to automatically determine which secondary sources need to be queried when linking records from various data sources. In turn, the record-linkage system is then able to utilize this additional information to improve the accuracy of the linkage between datasets. I n the recent past, researchers have developed various machine-learning techniques such as SoftMealy (Hsu and Dung 1998) and Stalker (Muslea, Minton, and Knoblock 2001) to easily extract structured data from various web sources. Using these techniques, users can build wrappers that allow them to easily query web sources much like databases. Web-based information-integration systems such as Information Manifold (Levy, Rajaraman, and Ordille 1996b), InfoMaster (Genesereth, Keller, and Duschka 1997), and Ariadne (Knoblock et al. 2001) can provide a uniform query interface for users to query information from various data sources. Furthermore, schema-matching techniques (Dhamankar et al. 2004; Madhavan and Halevy 2003; Miller, Haas, and Hernandez 2000) allow users to align different schemas from various data sources. While schemamatching techniques are useful for aligning attributes from different schemas, they cannot be utilized to determine if two records obtained from different sources refer to the same entity. Therefore, building an application that integrates data from various web sources requires a record-linkage component in addition to wrapper-generation, information-integration, and schema-matching components. For example, two restaurant web sites may refer to the same address using different textual information. Therefore, accurate record linkage is an essential component of any information-integration system used to integrate data accurately from various data sources. There has been some work done on consolidating data objects from various web sites using textual similarities and transformations [Bilenko and Mooney 2003; Chaudhuri et al. 2003; Dhamankar et al. 2004; Doan et al. 2003; Tejada, Knoblock, and Minton 2002). These approaches provide better consolidation results compared to the exact text-matching techniques in different application domains. How- Copyright © 2005, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2005 / $2.00 SPRING 2005 33 Articles Dinesite Zagat’s Matched Broadway Bar and Grill, 1460 3rdstreet Promenade, Santa Monica, CA 904012322, 310.393.4211 Broadway Bar&Grill, 1460 Third St. Promenade, Santa Monica, CA, 90401-2322, (310) 393-4211 Beverly Hills Café, 14 N. LaCienaga Boulvard, Beverly Hills, CA 90211-2205, 310.652.1529 Beverly Hills Samurai, 270 S. La Cienaga, Beverly Hills, CA 90211, (310) 360-9688 INCORRECT Figure 1. Textual Differences in Restaurant Records. ever, in some application domains it may be extremely difficult to consolidate records. For example, when matching names of people, it would be hard for the aforementioned techniques to determine if “Robert Smith” and “Bob Smith” refer to the same individual. This problem can often be solved by utilizing information from secondary sources on the web. For example, a web site that lists the common acronyms used for the first name may provide information that “Bob” and “Robert” are interchangeable as first names. There are many other application areas where information from secondary data sources can improve the performance of a record-linkage system. Additional examples include utilizing a geocoder to determine if two addresses are the same, utilizing historical area code changes to determine if two telephone numbers are the same, and utilizing the location and officers’ information for different companies to determine if two companies are the same. In this article, we describe our approach to exploiting secondary sources automatically for record linkage. The goal of the research is to link records more accurately across data sources. We have built a record-linkage system termed Apollo that can utilize information from secondary sources (Michalowski, Thakkar, and Knoblock 2003). In presenting our approach, we describe how Apollo can be combined with an information mediator to automatically identify and utilize information from secondary sources. We provide a motivating example, followed by some background work on 34 AI MAGAZINE record linkage. Subsequently, we present our approach to utilizing the secondary sources during the record-linkage process and an evaluation of our approach using real-world restaurant and company datasets. Finally, we discuss related work and put forward our conclusions and planned future work. Motivating Example To clarify the concepts presented in this article, we define the following terms: (1) record linkage, (2) primary data sources, and (3) secondary data sources. Record linkage is the process of determining if two records should be linked across some common property. In this article, this common property will be whether the two records refer to the same real-world entity. A primary data source is one of the two initial data sources used for record linkage. A secondary data source is any source, other then a primary data source that can provide additional information about entities in the primary data sources. Consider the following primary data sources: (1) Zagat and Dinesite data sources that provide information about various restaurants; (2) Travelocity and Orbitz data sources that provide information about various hotels; and (3) Yahoo and Moviefone data sources that provide information about various theaters. When the user sends a request to obtain information pertaining to restaurants within a given city, the record-linkage system needs to link records that refer to the same restaurant from the Articles Zagat and Dinesite data sources. However, due to the textual inconsistencies present in both data sources, determining which records refer to a common entity is a nontrivial task. A similar situation arises when attempting to combine information about hotels from Travelocity and Orbitz or about movies from Yahoo and Moviefone. Figure 1 shows the varying textual inconsistencies found in the restaurant data sources. Source 2 Source 1 Input: A1: (a1 a2 … ak) A2: (a1 a2 … ak) A3: (a1 a2 … ak) A4: … Input: B1: (b1 b2 … bk) B2: (b1 b2 … bk) B3: (b1 b2 … bk) B4: … Background Work Computing Similarity Scores In this section, we present the Active Atlas (Tejada, Knoblock, and Minton 2001; Tejada, Knoblock, and Minton 2002) record-linkage system. Its robust and extendable framework makes it an ideal candidate for a base system upon which to build a record-linkage system that utilizes secondary sources. For this reason, Active Atlas is used as a base system upon which Apollo is built. Candidate Generator Attribute similarity Formula: System Overview Active Atlas’s architecture consists of two separate components: a candidate generator and a mapping learner. The overall architecture of the system can be seen in figure 2. Its goal is to find common entities among two record sets from the same domain. The candidate generator proposes a set of potential matches based on the transformations available to the system. The transformation may be one of a number of string comparison types such as EQUALITY, SUBSTRING, PREFIX, SUFFIX, STEMMING, or others and are weighted equally when computing similarity scores for potential matches. Once the candidate generator has finished proposing potential matches, Active Atlas moves on to the second stage and uses the potential matches as the basis for learning mapping rules and transformation weights. The mapping learner establishes which of the potential matches are correct by adapting the mapping rules and transformation weights to the specific domain. Due to the fact that the initial similarity scores are very inaccurate, the system uses an active learning approach to refine and improve the transformation weights and mapping rules. This approach uses a decision tree (Quinlan 1996) committee model with three committee members. The key idea behind the committee learning approach is to divide the training data set into three parts and have each committee member learn a decision tree based on that member’s respective part. During active learning, the mapping learner then selects the most informative potential example. The most informative potential example is defined as a potential match with the Sn(an, bn) = Σ(ant • bnt) / (||a|| • ||b||) t∈T (Object Pairs, Similarity Scores, Total Scores, Transformations) ((A3 B2, (s1 s2 sk), W32, ((T1, T4), (T3, T1, Tn), (T4))) (A 45 B12, (s1 s2 sk), W45 12, ((T2), (T3,...Tn), (T1 T8)))…) Mapping Learner Mapping Rule Learner Mapping Rules: Attribute 1 > s1 ⇒ mapped Attribute n < sn ∧ Attribute 3 > s3 ⇒ mapped Attribute 2 < s2 ⇒ not mapped Transformation Weight Learner Transformations: T1 = p 1 Tn = p n Set of Mappings Between Objects ((A3 B2 mapped) (A45 B12 not mapped) (A5 B2 mapped) (A98 B23 mapped) Figure 2. Active Atlas System Architecture. SPRING 2005 35 Articles highest disagreement among the members. It then asks the user to label this example as either a match or nonmatch. The user’s response is used to refine and recalculate the transformation weights, learn new mapping rules, and reclassify record pairs. This process continues until (1) the committee learners converge and agree on one decision tree or (2) the user has been asked a predefined number of questions. Once the mapping rules and transformation weights have been learned, Active Atlas uses them to classify all the potential matches in the system as matched or not matched. The results are then made available to the user. Open Research Problem A difficult problem encountered when performing record linkage is the degree of certainty with which matches are proposed and rejected. A record-linkage system is only as good as the labeled data it has received and is therefore limited in accuracy with respect to its classification of matches. A record-linkage system is able to classify obvious matches and nonmatches. In our research of record linkage, we have found that there exists a “gray” area in the classification of potential matches. Classifying the matches in this “gray” with a high degree of confidence often requires additional knowledge or information. This is due to the fact that primary sources often lack sufficient information to resolve all ambiguous matches. These ambiguous matches cannot be classified with full confidence as a match, yet they are similar enough to be considered as potentially matched. This presents the need for a secondary source to help resolve this ambiguity. A secondary source would provide the system with additional information that it could use to help in the classification of the match. The following example helps illustrate the need for secondary sources. Consider the restaurant domain. Record linkage is performed on two different data sources, each source composed of records referring to a particular restaurant. The system returns all matches; however, there exists one returned match that is similar enough to be classified as matched by the system but that also contains enough inconsistencies across attributes to raise doubt that it is a true match. Manually looking closer at the record, it is determined that the telephone number field is the major source of the inconsistency. With the availability of a secondary source that contains a telephone area code’s history, it could be determined that the telephone numbers in question are in fact the same, since one area code is the successor of the other. This information 36 AI MAGAZINE could then be used by a system to confidently classify the record as a true match. Exploiting Secondary Sources for Record Linkage In this section, we describe our approach to performing more accurate record linkage by utilizing information from secondary sources. The intuition in this article is to utilize the domain knowledge from a data-integration system to obtain information automatically about available secondary sources and utilize the available secondary sources to improve the record-linkage process. When Apollo needs to link records from two data sources, it first obtains information about available secondary sources. Next, it queries the secondary sources and utilizes the additional information in the record-linkage process. Which Secondary Sources to Query? As we described earlier, primary data sources often do not contain enough information to distinguish between two entities purely based on textual similarities. Therefore, it may be valuable to obtain information from secondary data sources. Apollo utilizes the Prometheus mediator (Thakkar, Ambite, and Knoblock 2003) to obtain information pertaining to which secondary data sources should be queried. In this section, we provide a brief overview of the Prometheus mediator and describe how it is utilized in Apollo. Overview of Data-integration Systems. Various data-integration systems such as TSIMMIS (Garcia-Molina et al. 1995), Information Manifold (Levy, Rajaraman, and Ordille 1996a), InfoMaster (Genesereth, Keller, and Duschka 1997), InfoSleuth (Bayardo et al. 1997), and Ariadne (Knoblock et al. 2001) have been used to provide an integrated view of information from heterogeneous data sources. Traditionally, these systems model data sources in the form of relations. These systems also contain a set of virtual domain relations that the user utilizes to specify the queries to the mediator system. Every dataintegration system must specify a set of domain rules to relate the source relations to the domain relations. The user then sends queries to the data-integration system using these domain relations. The data-integration system then reformulates the user query into a set of source queries, executes the source queries, and provides the answer to the user’s query. Utilization in Apollo. Apollo has access to a data-integration system (the Prometheus mediator) Articles Locations name,streetaddress, city, state, zip,oldareacode, phone, lat, lon, areacode Geocoder $streetaddress, $city, $state, $zip, lat, lon Areacodeupdates $oldareacode, areacode HotelReviews $name, $hoteltype, review Theaters name, streetaddress, city, state, zip, areacode, phone Restaurants name, streetaddress, city, state, zip, areacode, phone, Zagat zname, zstreetaddress, zcity, zstate, zzip, zphone Dinesite dname, dstreetaddress, dcity, dstate, dzip, dphone Moviefone mname, mstreetaddress, mcity, mstate, mzip Yahoo yname, ystreetaddress, ycity, ystate, yzip, yphone Hotels name, streetaddress, city, state, zip, areacode, phone, hoteltype, review Travelocity tname, tstreetaddress, tcity, tstate, tzip, tphone, thoteltype Orbitz oname, ostreetaddress, ocity, ostate, ozip, ophone, ohoteltype Figure 3. Example Domain Model. which models various data sources. These data sources include all primary data sources as well as all available secondary data sources. When Apollo receives a request to link records from two data sources, it sends a request to the Prometheus mediator to obtain related data sources. To further illustrate how additional information is obtained, consider the domain model shown in figure 3. Our example domain model contains various data sources to obtain hotel, restaurant, and theater information. The Zagat and Dinesite data sources provide information about restaurants, the Travelocity and Orbitz data sources provide information about hotels, and the Yahoo and Moviefone data sources provide information about theaters. The data-integration system also has access to a Geocoder data source that provides geographic coordinates for a given address, an area code updates data source that provides information about area code changes, and a hotel review data source that accepts the name of a hotel and a hotel type and provides the review for the hotel. The data-integration system models the available data sources as source relations. In addition to these source relations, it has a set of domain relations. These domain relations are defined as views over the source relations. In our example, Locations, Theaters, Hotels, and Restaurants represent domain relations. Consider the situation where Apollo is attempting to link hotels obtained from the Orbitz and Travelocity data sources. Apollo sends a request to the Prometheus mediator to obtain all possible secondary sources that are related to the Orbitz or Travelocity source relations. Upon receiving the request, Prometheus first finds all domain relations that are defined as views over the given sources. In our example, it finds that the domain relation Hotels is defined as a join between the HotelReviews source relation and the union of the Orbitz and Travelocity source relations. In the second step, the mediator analyzes the view definition of the domain relations found in the first step to obtain all source relations that participate in the view definition. In the given example, Prometheus finds the HotelReviews, Orbitz, and Travelocity SPRING 2005 37 Articles source relations. Next, it picks the source relations other than the primary source relations as available secondary data sources. From the three source relations, Prometheus picks HotelReviews as the available secondary source, as both Orbitz and Travelocity are primary sources. Next, the mediator repeats the above-mentioned steps with the domain relation found in the first step. In the given example, Prometheus first finds that the view definition for the Locations domain relation includes the Hotels domain relation. Moreover, Prometheus finds that the Locations domain relation is defined as a join of Hotels, AreaCodeUpdates, and Geocoder source relations. Therefore, it adds the AreaCodeUpdates and Geocoder sources to the set of available secondary sources. Prometheus repeats the process with Locations as input and finds no qualifying domain relations. Therefore, for the given example, Prometheus returns the HotelReviews, Areacodeupdates, and Geocoder source relations as available secondary sources. It is important to note that the goal of the mediator is to find all available secondary sources, even the ones that may not be useful. In the next section, we describe how Apollo utilizes the information found in these secondary sources and determines which secondary sources are useful. How to Utilize Information from Secondary Sources Current record-linkage systems do a good job of learning how to weigh attributes of different records across data sources. Using machinelearning techniques such as decision trees (Quinlan 1996) and bagging and boosting (Abe and Mamitsuka 1998; Breiman 1996), they are able to determine which attributes are most relevant to consider when trying to match records across different data sources. Apollo takes advantage of this process by augmenting data sources with additional attributes. The machine-learning component of the record-linkage system is then able to use these attributes in learning the correct mapping rules used in the linkage process. Using the labeled examples provided, the system is able to learn whether the newly added attributes are informative enough to incorporate into the mapping rules. With the flexible nature of the record-linkage framework used in Apollo, incorporating the additional information from secondary sources is an easy and efficient process. A key point needs to be kept in mind when utilizing the information from the secondary sources: When augmenting primary data sources with additional information, it is important not to overload and confuse the record-linkage sys- 38 AI MAGAZINE tem. We address this point using a systematic approach to adding additional information discussed below. Secondary data sources fall into one of two possible categories: (1) one-to-one mapping source, (2) one-to-N mapping source. A one-toone mapping source is a secondary data source that takes an input and produces only one possible answer for each unique data record passed in. The geocoder data source, which takes as input a street address and produces a corresponding latitude and longitude for that address, is a one-to-one mapping source. As we know, each address will have only one latitude (lat) / longitude (lon) combination associated with it. For example, passing the address “4676 Admiralty Way, Marina del Rey, CA 90292” to the source yields the following output: <lat>33.980304877551021</lat>, <lon>-118.44027018504094 </lon>. A one-to-N mapping source is a secondary data source that takes an input and produces multiple possible answers for each unique data record passed in. The area code updates secondary source is an example of such a source because one unique area code may yield multiple past area codes (area codes could have been merged to produce the area code in question). For example, inputting 619 into this source yields the following past area codes: 760, 858, and 935. It is important to differentiate between the two when deciding how to use the data from the secondary source. Take as an example a one-to-N mapping source that returns five unique values for a single input. If the primary sources contain 100 records respectively and are augmented with data from this secondary source, the total number of unique records in each primary source would increase to 500. This increase clouds the real record-linkage problem at hand and would in fact make it harder to solve because of the increased complexity and the lack of a guarantee that the additional information is beneficial. In this work, we address this issue by only using one-to-N mapping sources when they are used to augment only one of the primary sources (our approach to choosing which primary source is discussed later). This approach avoids large increases in complexity and the problem introduced by extraneous records in each primary data source. We plan to conduct research on solving the problem of needing to augment both primary data sources with a one-to-N mapping source, and this work is discussed in the “Conclusion and Future Work” section. This problem does not exist if there are multiple one-to-one data sources available for a giv- Articles en primary data source, as each one-to-one secondary data source would add one or more attributes to each record from the primary data sources. Therefore, the number of records stays the same. Another issue that needs to be dealt with is the question of which primary source or sources should be augmented with additional information. Semantic types and attribute bindings are used to make this determination. The Apollo record-linkage system looks at the attribute types found in each primary source and compares them to the output types of the secondary source. If any of the primary data sources already contain an attribute of the same type as the output type of the secondary source, the primary source does not need to be augmented with information from the secondary source. We do not want to add preexisting attributes, because this addition may cause inconsistencies in data for each record in the primary source. These inconsistencies could lead to inaccurate classification of matches between the primary data sources, defeating the purpose of using secondary sources for improving record linkage. Once the attribute type data information has been added to the primary sources, a binding is made between the added information attribute types in each dataset. For example, if one primary data source contains the latitude and longitude attributes and the second contains a street address but no latitude and longitude attributes, the Apollo system would query the geocoder secondary source using the street address from the second primary source and augment this primary source with the returned data. Once the primary source has been augmented with this data, a binding is made between the latitude and longitude attributes in the two primary data sources and the recordlinkage process is run. This step is necessary because the record-linkage system needs an explicit declaration of the mappings of attributes from one primary data source to the other. As discussed previously, the data-integration component of the Apollo system handles the querying of secondary data sources. Once the data from the secondary source or sources is queried, the incorporation of additional data into one or both of the primary data sources is done. The Apollo system currently does this by appending the additional information to the record as an additional attribute. As mentioned earlier in this section, once an attribute is appended to a primary data source, a binding is created between this attribute and the corresponding attribute in the other primary data source. Experimental Evaluation We tested the Apollo system in two real-world domains, restaurants and companies. In the restaurant domain, we used wrapper technology discussed by Muslea, Minton, and Knoblock (2001) to extract restaurant records from the Zagat1 and Dinesite2 web sources. Each web source provided a restaurant’s name, address, city, state, telephone number, and cuisine type. Because of the inconsistencies between the two sources, a record-linkage system is required to find common restaurants. Furthermore, we used the geocoder3 web service as a secondary source. This source takes in an address as input and outputs the corresponding latitude and longitude coordinates. The Zagat data source contained 897 records, while the Dinesite data source contained 1,257 records. There were 136 matching records in the two datasets. As a first step we ran the candidate generator from the Active Atlas system to obtain about 9000 candidate matches. From the candidate matches, we randomly selected 100, 150, 200, and 250 record pairs, labeled them as matches or nonmatches, and used this as input into the Apollo system. The goal of the experiments was to show that by utilizing the geocoder secondary source, the system was able to link records more accurately across the two primary data sources using fewer labeled examples. The restaurant domain experimental results are shown in figures 4 and 5. As shown in figures 4 and 5, the addition of a secondary source led to a significant improvement in precision and recall. With the use of a secondary source, Apollo reached 100 percent precision and 76 percent recall with only 150 total labeled examples. Out of the 150 labeled examples, there were, on average, only 6 positive examples. In case of 50 labeled examples and 100 labeled examples, there were only 2 and 3 positive examples respectively. Therefore, there was not enough information (from positive-labeled examples) for the decision tree learner to utilize information from secondary sources. Without the secondary source, precision and recall levels were lower, even with 250 total labeled examples, than the levels seen in Apollo with just 150 total labeled examples. The improvement brought about by the secondary source is due to the secondary source’s ability to handle inconsistencies for the given attribute better than string transformations. With 250 training examples, a decrease in recall is caused by the fact that the decision tree learner begins to overfit the training data. In general, decision trees do not have the problem of overfitting when provided with more training examples. However, in our case, the ratio of pos- SPRING 2005 39 Articles 100 Precision % 80 With Secondary Source 60 Without Secondary Source 40 20 0 50 100 150 200 250 Number of Training Examples Recall % Figure 4. Restaurant Domain Precision Results. 90 80 70 60 50 40 30 20 10 0 With Secondary Source Without Secondary Source 50 100 150 200 250 Number of Training Examples Figure 5. Restaurant Domain Recall Results. itive to negative examples is very small. Therefore, when we increase the number of training examples, the decision trees learn rules specific to the few positive-labeled examples, causing the decrease in recall. However, we can see that the precision (100 percent) and recall values are still better when using a secondary source. In the company domain, we extracted company records from two primary data sources: (1) A company database containing company name and ticker symbol for 21,281 companies, and (2) using wrapper technology discussed by 40 AI MAGAZINE Muslea, Minton, and Knoblock (2001) to extract company information, containing company name, person name, and position from various news articles for 8,800 companies. The key challenge with the company domain data was that company name was the only common attribute and companies were referred to using different names in different articles. We used Yahoo Finance as a secondary source, and it provided the company name, ticker symbol, and three top officials for a given company. As shown in figures 6 and 7, record linkage Articles Precision % 100 80 With Secondary Source 60 Without Secondary Source 40 20 0 50 90 130 170 210 250 Number of Training Examples Figure 6. Business Domain Precision Results. 100 Recall % 80 With Secondary Source 60 Without Secondary Source 40 20 0 50 90 130 170 210 Number of Training Examples 250 Figure 7. Business Domain Recall Results. using a secondary source achieves better precision and recall values as it is able to utilize person and company names in the linkage process. It is worth noting that company articles may mention people other than the top three officials, in which case our secondary source is less useful. Related Work There has been significant work done on solving the record-linkage problem. We divided the related work into three categories: (1) record linkage (Bilenko and Mooney 2003; Doan et al. 2003; Hernández and Stolfo 1995; Jin, Li, and Mehrotra 2003; Monge and Elkan 1996; Sarawagi and Bhamidipaty 2002; Tejada, Knoblock, and Minton 2002), (2) record linkage for data cleaning (Chaudhuri et al. 2003; Raman and Hellerstein 2001), and (3) schema matching (Dhamankar et al. 2004; Madhavan and Halevy 2003; Miller, Haas, and Hernandez 2000). We describe the most closely related works in each of the aforementioned areas next. SPRING 2005 41 Articles Research in the record-linkage area includes research on entity matching (Doan et al. 2003; Jin, Li, and Mehrotra 2003), object consolidation (Jin, Li, and Mehrotra 2003; Tejada, Knoblock, and Minton 2002), and deduplication (Bilenko and Mooney 2003; Hernández and Stolfo 1995; Monge and Elkan 1996; Sarawagi and Bhamidipaty 2002). The problem of entity matching is to determine whether two given records refer to the same real-world entity. Object consolidation is defined as the process of merging records from two data sets, such that there is only one record per real-world entity. Record linkage is defined as a process of linking two records based on a set of common attributes. All systems utilize some form of textual similarity measures to determine whether two records should be linked. However, none of the systems incorporate the idea of utilizing secondary sources to obtain relevant information and use this information to improve the record-linkage process. Doan et al. (2003) describe a profiler-based approach to improving entity matching. The key idea in the article is to design profilers by mining large amounts of data from different web sources, obtaining input from domain experts, or by examining previously matched entities. The profilers generate rules that determine relationships between various attributes of entities; for example, someone with age 9 is not likely to have a salary of $200,000. This idea is complementary to our approach of utilizing secondary sources to provide additional attributes. Record linkage has been used for data cleaning by linking records from an unclean dataset to a data source that contains clean records. Chaudhari et al. (2003) describe a fuzzy matching approach for data cleaning in online data catalogs. The goal of their work is to determine the closest records in the reference relation with the given record from a relation containing erroneous records. This work relies on a reference relation to map records from a data source containing inconsistencies or errors to a clean data source. Such an approach is limited by the availability of reference relations. The Potter’s wheel system (Raman and Hellerstein 2001) provides a graphical user interface for linking records across two data sources. Potter’s wheel generates candidate matches between records from two data sets and presents them to the user. The user can then determine which records are matches or nonmatches. In these systems, Apollo could be utilized to provide additional information about the records, which would assist the user in interactively labeling the records in Potter’s wheel or provide additional information when cleaning records in 42 AI MAGAZINE system developed by Chaudhari et al. (2003). Identifying how one or more attributes of a dataset are related to one or more attributes of another dataset is often referred to as the schema-mapping problem (Dhamankar et al. 2004; Madhavan and Halevy 2003; Miller, Haas, and Hernandez 2000). This is quite different from the record-linkage problem addressed in this article. However, data-integration or data-sharing applications may require modules to solve both schema-mapping and record-linkage problems. The work done in this area could be used in conjunction with the work presented in this article to create a robust and efficient data-integration application. Conclusion and Future Work In this article, we presented our approach to utilizing secondary sources to improve the accuracy of record linkage. We showed how our Apollo record-linkage system discovers and utilizes secondary sources. Our experimental evaluation, in two different real-world domains, shows that Apollo reduces the number of labeled examples required as well as improving the precision and recall values for each domain. In the future we plan to address the issues associated with augmenting both primary data sources with information from one-to-N mapping secondary sources. Furthermore, we are researching ways in which we can reduce the number of queries sent to secondary sources, since queries may be expensive or time consuming. Finally, even though the transformations used in Apollo are quite comprehensive, they do not cover all possible sets of transformations. To address this problem, we are working on improving the field (attribute) level matching process. This work applies specific sets of transformations depending on the semantic types of different attributes and leads to more accurate confidence measures for the given attributes. Acknowledgements This material is based upon work supported in part by the National Science Foundation under Award No. IIS-0324955, in part by the Defense Advanced Research Projects Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services Division, under Contract No. NBCHD030010, in part by the Air Force Office of Scientific Research under grant numbers F49620-01-1-0053 and FA9550-04-1-0105, in part by the United States Air Force under contract number F49620-02-C-0103, and in part by a gift from the Microsoft Corporation. The U.S. government is authorized to repro- Articles duce and distribute the authors’ original reports for governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of any of the above organizations or any person connected with them. Notes 1. www.zagats.com 2. www.dinesite.com 3. terraworld.isi.edu/Geocode/Service1.asmx References Abe, N.; and Mamitsuka, H. 1998. Query Learning Strategies using Boosting and Bagging. In Proceedings of the Fifteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers. Bayardo Jr., R. J.; Bohrer, W.; Brice, R.; Cichocki, A.; Flower, J.; Helal, A.; Kashyap, V.; Ksiezyk, T.; Martin, G.; Nodine, M.; Rashid, M.; Rusinkiewicz, M;. Shea, R.; Unnikrishnan, C.; Unruh, A.; and Woelk, D. 1997. Infosleuth: Agent-based Semantic Integration of Information in Open and Dynamic Environments. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery. Bilenko, M.; and Mooney, R. J. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery. Breiman, L. 1996. Bagging Predictors. Machine Learning 24(2): 123–140. Chaudhuri, S.; Ganjam, K.; Ganti, V.; and Motwani, R. 2003. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery. Dhamankar, R.; Lee, Y.; Doan, A.; Halevy, A.; and Domingos, P. 2004. iMAP: Discovering Complex Semantic Mappings between Database Schemas. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery. Doan, A.; Lu, Y.; Lee, Y;. and Han, J. 2003. Object Matching for Data Integration: A Profile-Based Approach. IEEE Intelligent Systems 18(5): 54–59. Garcia-Molina, H.; Hammer, J.; Ireland, K.; Papakonstantinou, Y.; Ullman, J.; and Widom, J. 1995. Integrating and Accessing Heterogeneous Information Sources in TSIMMIS. In Information Gathering from Heterogeneous, Distributed Environments: Papers from the AAAI Spring Symposium. Technical Report SS-95-08. Menlo Park, CA: American Association for Artificial Intelligence. Genesereth, M. R.; Keller, A. M.; and Duschka, O. M. 1997. InfoMaster: An Information Integration Sys- tem. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery. Hernández, M. A.; and Stolfo, S. J. 1995. The Merge/Purge Problem for Large Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. New York: Association for Computing Machinery. Hsu, C.-N.; and Dung, M.-T. 1998. Generating FiniteState Transducers for Semistructured Data Extraction from the Web. Information Systems Journal 23(8): 521–538. Jin, L.; Li, C.; and Mehrotra, S. 2003. Efficient Record Linkage in Large Data Sets. In Proceedings of the Eighth International Conference on Database Systems for Advanced Applications (DASFAA 2003). Piscataway, NJ: Institute of Electrical and Electronics Engineers. Knoblock, C., S.; Minton, J. L.; Ambite, N.; Ashish, N.; Muslea, I.; Philpot, A.; and Tejada, S. 2001. The ARIADNE Approach to Web-Based Information Integration. International Journal on Intelligent Cooperative Information Systems 10(1-2): 145–169. Levy, A. Y.; Rajaraman, A.; and Ordille, J. J. 1996a. Query-Answering Algorithms for Information Agents. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press. Levy, A. Y.; Rajaraman, A.; and Ordille, J. J. 1996b. Querying Heterogeneous Information Sources Using Source Descriptions. In Proceedings of the Twenty-Second International Conference on Very Large Databases. San Francisco: Morgan Kaufmann Publishers. Madhavan, J.; and Halevy, A. 2003. Composing Mappings among Data Sources. In Proceedings of the Twenty-Ninth International Conference on Very Large Databases. San Francisco: Morgan Kaufmann Publishers. Michalowski, M.; Thakkar, S.; and Knoblock, C. A. 2003. Exploiting Secondary Sources for Automatic Object Consolidation. Paper presented at the First KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, D.C., 27 August. Miller, R. J.; Haas, L. M.; and Hernández, M. A. 2000. Schema Mapping as Query Discovery. In Proceedings of the Twenty-Sixth International Conference on Very Large Data Bases. San Francisco: Morgan Kaufmann Publishers. Monge, A. E.; and Elkan, C. 1996. The Field Matching Problem: Algorithms and Applications. In Proceedings of the Second InternationalConference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press. Muslea, I.; Minton, S.; and Knoblock, C. A. 2001. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and MultiAgent Systems 4(1): 93–114. Quinlan, J. R. 1996. Improved Use of Continuous Attributes in C4.5. Journal of Artficial Intelligence Research. 4: 77–90. Raman, V.; and Hellerstein, J. M. 2001. Potter’s Wheel: An Interactive Data Cleaning System. In Proceedings of the Twenty-Seventh International Conference on Very Large Data Bases, 381–390. San Francisco: Morgan Kaufmann Publishers. SPRING 2005 43 Articles AIED 2005 18 – 22 July 2005 Amsterdam The Netherlands Supporting Learning through Intelligent and Socially Informed Technology The 12th International Conference on Artificial Intelligence in Education (AIED 2005) is the latest in a longstanding series of biennial international conferences highlighting top quality research in the development of advanced educational applications. The AIED field is committed to the belief that true progress in learning technology requires combining advanced technology with advanced understanding of learners, learning, and the context of learning. The conference thus provides opportunities for the cross-fertilization of information and ideas from researchers in the many fields that make up this interdisciplinary research area, including: artificial intelligence, other areas of computer science, cognitive science, education, learning sciences, educational technology, psychology, philosophy, sociology, anthropology, linguistics, and the many domain-specific areas for which AIED systems have been designed and built. Invited Speakers Dan Schwartz Stanford University USA Tanja Mitrovic University of Christchurch New Zealand Justine Cassell NorthWestern University USA Ton de Jong University of Twente the Netherlands Interactivity and Learning Constraint-based tutors: a success story Learning with Virtual Peers Scaffolding inquiry learning; How much intelligence is needed and by whom? Program Chairs Chee-Kit Looi, Nanyang Technological University, Singapore (cklooi@nie.edu.sg) Gord McCalla, University of Saskatchewan, Canada (mccalla@cs.usask.ca) Web site See for more details on Program and Registration http://hcs.science.uva.nl/AIED2005/ Sarawagi, S.; and Bhamidipaty, A. 2002. Interactive Deduplication Using Active Learning. In Proceedings of the Eighth ACM SIGKDD Conference on Knowledge Discovery and Data Minning. New York: Association for Computing Machinery. Tejada, S.; Knoblock, C. A.; and Minton, S. 2001. Learning Object Identification Rules for Information Integration. Information Systems 26(8): 607–633. Tejada, S.; Knoblock, C. A.; and Minton, S. 2002. Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery. Thakkar, S.; Ambite, J. L.; and Knoblock, C. A. 2003. A View Integration Approach to Dynamic Composition of Web Services. Paper presented at the 2003 ICAPS Workshop on Plan ning for Web Services, Trento, Italy, 10 June. Martin Michalowski is a doctoral student in the Computer Science Department at the University of Southern California. His research interests include information integration, record linkage, constraintbased integration, heuristic-based planning, and machine learning. He received his B.CompE. in computer engineering from the University of Minnesota in 44 AI MAGAZINE 2001 and M.S. in computer science from the University of Southern California in 2003. He can be reached at martinm@isi.edu; or www.isi.edu/~martinm. Craig A. Knoblock is a senior project leader at the Information Sciences Institute and a research associate professor in computer science at the University of Southern California. His current research interests include information integration, automated planning, machine learning, and constraint reasoning and the application of these techniques to geospatial and biological data integration. He received his Ph.D. in computer science from Carnegie Mellon University. He is a member of AAAI and ACM. He can be reached at knoblock@isi.edu or www.isi.edu/~knoblock. Snehal Thakkar is a Ph.D. student in computer science at the University of Southern California. His research interests include information integration, automatic web service composition, and geographical information systems. He received his MS in computer science from the University of Southern California. He can be reached at thakkar@isi.edu or www.isi.edu/~thakkar.