From: AAAI Technical Report SS-96-05. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved. Learning Models for Multi-Source Integration* Sheila Tejada Craig A. KnoblockSteven Minton Universityof Southern CaJifomia/ISI 4676 AdmiraltyWay Marinadel Rey, California 90292 { tej ada,knoblock;rninton} @isi.edu Abstract One issue involved in accessing multiple heterogeneous information somces is howto integrate the retrieved dam. SIMS,an information mediator, handles this problem by mapping the data in each information source into a commoninformation or domain model. This model defines the relationships betweenthe data in the different sources, so that the ~,,t. can be integrated properly. Humanexpexts who are familiar with the contents of the information sources are required to generate the model and sourcos need to be accessed. In this situation it is necessary to have information about the telat/onships of the d~t~ within each source, as well as between sources, to properly access and integrate the d AtJ, reCcieved. The SIMS (Ambitee~ a/. 1995)informationbrokercapuu~this type of information in the form of a model, c~l/ed a domain model. Because all the information sources mapinto the domain model, the domain model serves aa an interface which can provide the user access to mnltiple sources. The user specifies transactions to be performed using the tenm of the domain model, and is not required to have performthe mapping of the sources.Automating this knowledge aboutspeci~csomees. process of constructing the model of the infonnadon sources would be mote convenient and efficient. In the next section we will descn’be the domainmodelin more detail, and then discuss our basic approach to address this problem of learning models for multi-source integration. Next, we will present our prel/minary work that we have conductedon creating abstract descz/ptions of the d,tA in order to mine for existing relationships. We will then provide a discussion of related work, and conclude by descn’bi~ the future directions of our work. Assnming that the ~ of the infomlatiou is a set of tables, the ftrst step in this processis mining the tables to discover relationships in the data. These relationships are then used to further develop the modelby helping to determine the necessary classes, sulmdess/subclassrelationships, and associations. Introduction Because of the growing number of infonnation sources available through the interact there are nowmanycases in which infonnalion needed to solve a problem or answer a question is slnead across several infom,ttion sources. In order to obtain the desinui informAtlon, multiple sources need to be accessed, and the retrieved &,t.+ then has to be integrated. Anexampleto illustrate this problemof multisource integration is whengiven two infonnation sources, one source containing information about comic books and the other about super heroes, and you want to ask the question "Does Spiderman appear in a Marvel comic book?" To answer this question both of the information *’t~ zesem~ n~led hae wmmmmm~ in pm by Rome Laboratory of the Air Force Systems Commandand the Advanced Rmearch Pmjecte Agency under ConUact Number F30602-94-C-0210,and in pert by in Augmentm/oa Awardfor Science and Engineet~q Research Training funded by the Advmr.edRmesurchProjects Agencyunder Contract Number F49620-93-1-0594. "l’ne views and conduiomcontained in this are throe of the autbea and should not be interlnted -~ tbe o/fi~l opinioa or policy of RL, ARPA,tbe U.S. Oovenmnema, or any pemonor Nlencycomnectedwith th*m- Model Description The diagram shown in Figure 1 is a papal model of both the ComicBook and Super Hero information sources. ComicBookSourceModel Artist Contains ~ II SuperHeroSourceM odel This modeldefines the relationships of the data within the twosomcesand also showstheir interaction. Theovals in the diagramre--at classes and the lines descn’bethe relationships betweenthe classes. Thereare two types of classes - a basic class and a compositeclass. Members of a basic class are single-v~Ineddements,such as strings or numbers; while the membersof a composite class are objects wh/chcan haveseveral atm~mtes. In this example CoIlc BookIllustrator and Super HeroCreatormeboth basic classes whichcontain artists’ names. Comic k~, Marvel Comic Book, and Super amall representedas cumpositeclasses. Tuearrows n~n~ntthe attKoutesof the objects, e.g. Artist, Title, for a superclass/subclassrelation~ip, but nowonly those classes that have ~ descriptions are tested. Oneof the mainreasonsto generatethese abstract descriptionsof the data me so that they can be used to help determine composite classes andtheir relationah’ms. Generafl~ Abslraet Descript/ons An example of a data table is shownin Figure 2. A abstract description is determinedfor each columnor atmlmteof the table. In this table the set of atm~vutes are Name,Artist, and Super Power. A set of features is computedfor each of these atm~outeswhichcharacterizes the basic classes to whicheach atm’butebelongs. Theset of feaUtresthat descn’besthe basic classes also helps in discoveringthe relationships betweenbasic classes. The relationships betweenbasic classes should reflect the relation,h’mS that exist between the atln~utes. Nm~e. TheArtistarrowbetween SuperHeroandSeqper Hero Creator meansthat the values of that atm’bute belong to the set of namesof Super Hero Creator. The thick fines in the modelrepresent superclass/subclass relationships,e.g the nsmasof Super HeroCreator me a subset of Comic Book Illustrator. The Contg/ns relationship betweenCemleBookaud Super Hero is aa association I|nk; whichlinks a ComicBookobject with a SUl~erHeroobject.Thisis like an attribute link, exceptthat it connectstwocougositeclasses. Nmne Artist SuperPower Spiderman Stan Lee Spawn ToddlVlcFarlane Stan Lee Cyclops Learningthe Model SpiderSense Immortality X-ray Power Super Hero Table’ Presently, domain models me manually generated by humanexperts whoare familiar with the data stored in the sources. Autom~on of this task wouldimproveefficiency and the quality of the model. Wehave conducted prefiminary work in automating the process of model construction, ~e basic idea for our approachis that we e~n~the damin the information sources to extract the relationsh/~ neededto create the domainmodeL Learnins the model is an iterative process which involves user interaction. The user can browse the generated model to makecorrections or specify an information source to be added or deleted. The model proposedto the user is generatedby heuristics wehave developedto derive the necessaryrelationships fromthe data. The user’s input is incorporated in ge~g this model. The heuristics that we have spplled involve cresting abstract descr/ptimsof the d-tn whichis descn’bed hutberia the nextsection. r~mz Shownin Figure 3 is the table that contains the computedset of featmesfor the atmlmtesgiven in Figure 2. Theset of featuresusedto characterizeeachattribute in this exampleis Unique,Type,Subtype,Length,and Set of VAlues. unique Setd Values {Spawn, Yes stag Artist No stag Ah~ Super Yes S~g Ah~ befit ,,,} {Sin ~...} betic Power AbstractDescriptions Wehaveassumedthat the d~t_- is ~nedas tables, andthat valueswithin a columnof a table merelated to eachother. Anabstract description of a columnof data defines the basic class for that collection of data. Properties or features of the data, such as type midrange, mecontained in the abstract descriptions. Thesedescriptions methen used to help determine the superclass/ subclass relationsh/ps that exist betweenthe basic classes. The descriptions can constrain the numberof poss~le related ckases. Potentially, All classes wouldneedto be checked 123 Other I1~=I.~12 {Spider Seal .1 FIgme 3 The Unique feature determines whether there me duplicate elements, e.g. the Nmueattribute does not contain duplicate values. The Type of an attribute is either str~g or number. The values for the feature Subtypeare subtypes of string aud number.The subtypes for mm~erme hm~ermdnil. For string there are four basic subtypes: alphabeOc,alphanumeric, numer/c, and othm’.Thevalue other refers to a string that has a least one character whichis not a letter, a space or a number. Thecollection of the values of an attn’bute maybe of one subtype or any combinationof the four, as shownby the SuperPower mribute. The Lengthor Rangefeature determines the range of the length of the set of strings or the rangeof the set of nnmhellin the attribute. For eachof the attributes a set of their valuesare calculatedby the Set of Valuesfeature; tbe~ore, duplicate values are removed,as is donefor the Artist attribute. Thereare manyother features whichcan be usedto fred regularities in tbe data, suchas detemfining if the alphabetic strings of an atmbuteare wonk.These featmesrepresentonly a subset. Relatione]dFe between Bask Clama The set of features shownin Figure 3 are the abstract descriptions of the attribute,. The~de~iptionsdefine the basic class of the attn’bnte. TheSet of Valuesfeature providesall of the instancesbelongingto that basic class. TheTypeand Lengthfeatures can be used to restrict the set of attr/butes that are related to a givenattribute. For the attribute Nmm, the value for Typeis string, and the value for Subtypeis Alplmbe~;the~ore, Nm~can only be related to other attnlzaes whosebasic classes also have Typs as stri~ and a S~ which ineledes All2mbe//c. Basedon the table in Figure3 the set of mla~am’butes whosebasic classes overlap w/th Nmne’sTypefeature is {Artlst, Super Power}. The same is nowdone for the Lengthfesmreof Name.There are no basic classes whose Length featme contains or is contained in the Name~ Lengthfeature, as shownby the comparisonsof Length feattm~depictedin Figure4; therefore, the set of related atropinesis angry. Ha’o.Arest attribute and the ComicBookIllustrator class of the ComicBook.Artist. The attributes are mappedinto the model using this process which is repeatedfor eachattribute in the table. For the attributes that have an overlapping or non-overlapping class relationship, a superelassof the twoattributes needsto be created to properly represent their interaction. Further infom~onneeds to be computedin order to showthat the set of these types of proposedrelationshipsare valid. Related Work There has been similar work conducted in this area (Perkowitz&Etzioul 1995), but it has focusedon learning only the attributes of the information sources. Their approachalso assumesthat it has instances as well as an initial model of the information to be learned from perspective information sources. Related work using mining techniques to learn a model of the information sources has been researched by Kern et al (Kero et al. 1995). Their work uses data minlna to determine which attributes are related, while our workdescribed in this paper is primarily interested in the types of the relationth’msthat exist between the attributes. Discussion Tn/s is preliminary workon the antomationof modeling information sources, which has focused mainly on discovering the relationships between basic classes. Further work still needs to be conducted to devise mechanismsfor deciding whento generate a primitive superclass for the overlapping and the non-overlapping related basic classes. Wemephumingto apply statistical methodsto assist in addressing these cases, as well as integrating a natural languageknowledgebase into this process to help determinewhetherchmses are semantically related. Therelatio~ between the basicclasses will be useful in the later steps of the modelingprocesswhichare to determine the associations and superclaas/subclass relatiomhips betweencompositeelms. Fi gm’e4 Next,the Set of Value,feemreis evaln!t__~_ usingthe set of attributes producedby the intersection of the two previoussets of related attributes. Thisfeature will decide the type of relationship that exists betweenthe basic classes. The possible types of relationships are superclass/subclus, overlapping classes, and nonoverlappingrelated classes. TheSet of Valuesfeature of Nmm is evaluatedby testing the overlapof its values with thoseof an attributein the relatedset. Whenthe set of values of an attn’bute is a subset (or supereet) of anotherattn’bute, then a superclass/subclass ~tionship exists betweentheir basic clan~. Anexm~le of this type of relation~ can be seen in FigureI between the basic class Super Hero Creator of the Super 124 References Kern, B., Russell, L, Tsur, S., & ShenW.AnOverviewof Database Minlno Techniques. The KDDworkshopin the 4th International Conferenceon Deductive and ObjectOriented Data~, Singapore, 1995. Ambite, J., Y. Arens, N. Ashish, C. Chce, C. Hsu, C. Knoblock, and S. Tejada. The SIMSManual, Technical RelxmISFIM-95-428,1995. Perkowitz, M. & Etzioni, O. Category Translation: Learning to UndemUmd Information on the Interact. The International Joint Conference on Artificial Intelligence, Monlzeal, 1995.