Learning Models for Multi-Source Integration* Steven Minton University

advertisement
From: AAAI Technical Report SS-96-05. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved.
Learning Models for Multi-Source Integration*
Sheila Tejada Craig A. KnoblockSteven Minton
Universityof Southern
CaJifomia/ISI
4676 AdmiraltyWay
Marinadel Rey, California 90292
{ tej ada,knoblock;rninton} @isi.edu
Abstract
One issue
involved
in accessing
multiple
heterogeneous
information
somces
is howto integrate
the retrieved dam. SIMS,an information mediator,
handles this problem by mapping the data in each
information source into a commoninformation or
domain model. This model defines the relationships
betweenthe data in the different sources, so that the
~,,t. can be integrated properly. Humanexpexts who
are familiar with the contents of the information
sources are required to generate the model and
sourcos need to be accessed. In this situation it is
necessary to have information about the telat/onships of the
d~t~ within each source, as well as between sources, to
properly access and integrate the d AtJ, reCcieved. The SIMS
(Ambitee~ a/. 1995)informationbrokercapuu~this type
of information in the form of a model, c~l/ed a domain
model. Because all the information sources mapinto the
domain model, the domain model serves aa an interface
which can provide the user access to mnltiple sources. The
user specifies transactions to be performed using the tenm
of the domain model, and is not required to have
performthe mapping
of the sources.Automating
this
knowledge
aboutspeci~csomees.
process of constructing the model of the infonnadon
sources would be mote convenient and efficient.
In the next section we will descn’be the domainmodelin
more detail, and then discuss our basic approach to address
this problem of learning models for multi-source
integration. Next, we will present our prel/minary work
that we have conductedon creating abstract descz/ptions of
the d,tA in order to mine for existing relationships. We
will then provide a discussion of related work, and
conclude by descn’bi~ the future directions of our work.
Assnming
that
the
~ of the
infomlatiou
is a set of tables, the ftrst step in this processis mining
the tables to discover relationships in the data. These
relationships are then used to further develop the
modelby helping to determine the necessary classes,
sulmdess/subclassrelationships, and associations.
Introduction
Because of the growing number of infonnation sources
available through the interact there are nowmanycases in
which infonnalion needed to solve a problem or answer a
question is slnead across several infom,ttion sources. In
order to obtain the desinui informAtlon, multiple sources
need to be accessed, and the retrieved &,t.+ then has to be
integrated. Anexampleto illustrate this problemof multisource integration is whengiven two infonnation sources,
one source containing information about comic books and
the other about super heroes, and you want to ask the
question "Does Spiderman appear in a Marvel comic
book?" To answer this question both of the information
*’t~ zesem~ n~led hae wmmmmm~
in pm by Rome
Laboratory of the Air Force Systems Commandand the
Advanced Rmearch Pmjecte Agency under ConUact Number
F30602-94-C-0210,and in pert by in Augmentm/oa
Awardfor
Science and Engineet~q Research Training funded by the
Advmr.edRmesurchProjects Agencyunder Contract Number
F49620-93-1-0594.
"l’ne views and conduiomcontained in this
are throe of the autbea and should not be interlnted -~ tbe o/fi~l opinioa or policy of RL, ARPA,tbe U.S.
Oovenmnema,
or any pemonor Nlencycomnectedwith th*m-
Model Description
The diagram shown in Figure 1 is a papal model of both
the ComicBook and Super Hero information sources.
ComicBookSourceModel
Artist
Contains
~
II
SuperHeroSourceM odel
This modeldefines the relationships of the data within
the twosomcesand also showstheir interaction. Theovals
in the diagramre--at classes and the lines descn’bethe
relationships betweenthe classes. Thereare two types of
classes - a basic class and a compositeclass. Members
of
a basic class are single-v~Ineddements,such as strings or
numbers; while the membersof a composite class are
objects wh/chcan haveseveral atm~mtes.
In this example CoIlc BookIllustrator and Super
HeroCreatormeboth basic classes whichcontain artists’
names. Comic k~, Marvel Comic Book, and Super
amall representedas cumpositeclasses. Tuearrows
n~n~ntthe attKoutesof the objects, e.g. Artist, Title,
for a superclass/subclassrelation~ip, but nowonly those
classes that have ~ descriptions are tested. Oneof the
mainreasonsto generatethese abstract descriptionsof the
data me so that they can be used to help determine
composite
classes andtheir relationah’ms.
Generafl~ Abslraet Descript/ons
An example of a data table is shownin Figure 2. A
abstract description is determinedfor each columnor
atmlmteof the table. In this table the set of atm~vutes
are
Name,Artist, and Super Power. A set of features is
computedfor each of these atm~outeswhichcharacterizes
the basic classes to whicheach atm’butebelongs. Theset
of feaUtresthat descn’besthe basic classes also helps in
discoveringthe relationships betweenbasic classes. The
relationships betweenbasic classes should reflect the
relation,h’mS
that exist between
the atln~utes.
Nm~e.
TheArtistarrowbetween
SuperHeroandSeqper
Hero Creator meansthat the values of that atm’bute
belong to the set of namesof Super Hero Creator. The
thick fines in the modelrepresent superclass/subclass
relationships,e.g the nsmasof Super HeroCreator me a
subset of Comic Book Illustrator.
The Contg/ns
relationship betweenCemleBookaud Super Hero is aa
association I|nk; whichlinks a ComicBookobject with a
SUl~erHeroobject.Thisis like an attribute link, exceptthat
it connectstwocougositeclasses.
Nmne
Artist
SuperPower
Spiderman
Stan Lee
Spawn
ToddlVlcFarlane
Stan Lee
Cyclops
Learningthe Model
SpiderSense
Immortality
X-ray Power
Super Hero Table’
Presently, domain models me manually generated by
humanexperts whoare familiar with the data stored in the
sources. Autom~on
of this task wouldimproveefficiency
and the quality of the model. Wehave conducted
prefiminary work in automating the process of model
construction, ~e basic idea for our approachis that we
e~n~the damin the information sources to extract the
relationsh/~ neededto create the domainmodeL
Learnins the model is an iterative process which
involves user interaction. The user can browse the
generated model to makecorrections or specify an
information source to be added or deleted. The model
proposedto the user is generatedby heuristics wehave
developedto derive the necessaryrelationships fromthe
data. The user’s input is incorporated in ge~g this
model. The heuristics that we have spplled involve
cresting abstract descr/ptimsof the d-tn whichis descn’bed
hutberia the nextsection.
r~mz
Shownin Figure 3 is the table that contains the
computedset of featmesfor the atmlmtesgiven in Figure
2. Theset of featuresusedto characterizeeachattribute in
this exampleis Unique,Type,Subtype,Length,and Set of
VAlues.
unique
Setd
Values
{Spawn,
Yes
stag
Artist
No
stag
Ah~
Super
Yes
S~g
Ah~
befit
,,,}
{Sin
~...}
betic
Power
AbstractDescriptions
Wehaveassumedthat the d~t_- is ~nedas tables, andthat
valueswithin a columnof a table merelated to eachother.
Anabstract description of a columnof data defines the
basic class for that collection of data. Properties or
features of the data, such as type midrange, mecontained
in the abstract descriptions. Thesedescriptions methen
used to help determine the superclass/ subclass
relationsh/ps that exist betweenthe basic classes. The
descriptions can constrain the numberof poss~le related
ckases. Potentially, All classes wouldneedto be checked
123
Other
I1~=I.~12
{Spider
Seal .1
FIgme 3
The Unique feature determines whether there me
duplicate elements, e.g. the Nmueattribute does not
contain duplicate values. The Type of an attribute is
either str~g or number. The values for the feature
Subtypeare subtypes of string aud number.The subtypes
for mm~erme hm~ermdnil. For string there are four
basic subtypes: alphabeOc,alphanumeric, numer/c, and
othm’.Thevalue other refers to a string that has a least
one character whichis not a letter, a space or a number.
Thecollection of the values of an attn’bute maybe of one
subtype or any combinationof the four, as shownby the
SuperPower
mribute.
The Lengthor Rangefeature determines the range of
the length of the set of strings or the rangeof the set of
nnmhellin the attribute. For eachof the attributes a set of
their valuesare calculatedby the Set of Valuesfeature;
tbe~ore, duplicate values are removed,as is donefor the
Artist attribute. Thereare manyother features whichcan
be usedto fred regularities in tbe data, suchas detemfining
if the alphabetic strings of an atmbuteare wonk.These
featmesrepresentonly a subset.
Relatione]dFe between Bask Clama
The set of features shownin Figure 3 are the abstract
descriptions of the attribute,. The~de~iptionsdefine the
basic class of the attn’bnte. TheSet of Valuesfeature
providesall of the instancesbelongingto that basic class.
TheTypeand Lengthfeatures can be used to restrict the
set of attr/butes that are related to a givenattribute. For
the attribute Nmm,
the value for Typeis string, and the
value for Subtypeis Alplmbe~;the~ore, Nm~can only
be related to other attnlzaes whosebasic classes also have
Typs as stri~ and a S~ which ineledes All2mbe//c.
Basedon the table in Figure3 the set of mla~am’butes
whosebasic classes overlap w/th Nmne’sTypefeature is
{Artlst, Super Power}. The same is nowdone for the
Lengthfesmreof Name.There are no basic classes whose
Length featme contains or is contained in the Name~
Lengthfeature, as shownby the comparisonsof Length
feattm~depictedin Figure4; therefore, the set of related
atropinesis angry.
Ha’o.Arest attribute and the ComicBookIllustrator
class of the ComicBook.Artist. The attributes are
mappedinto the model using this process which is
repeatedfor eachattribute in the table. For the attributes
that have an overlapping or non-overlapping class
relationship, a superelassof the twoattributes needsto be
created to properly represent their interaction. Further
infom~onneeds to be computedin order to showthat the
set of these types of proposedrelationshipsare valid.
Related
Work
There has been similar work conducted in this area
(Perkowitz&Etzioul 1995), but it has focusedon learning
only the attributes of the information sources. Their
approachalso assumesthat it has instances as well as an
initial model of the information to be learned from
perspective information sources. Related work using
mining techniques to learn a model of the information
sources has been researched by Kern et al (Kero et al.
1995). Their work uses data minlna to determine which
attributes are related, while our workdescribed in this
paper is primarily interested in the types of the
relationth’msthat exist between
the attributes.
Discussion
Tn/s is preliminary workon the antomationof modeling
information sources, which has focused mainly on
discovering the relationships between basic classes.
Further work still needs to be conducted to devise
mechanismsfor deciding whento generate a primitive
superclass for the overlapping and the non-overlapping
related basic classes. Wemephumingto apply statistical
methodsto assist in addressing these cases, as well as
integrating a natural languageknowledgebase into this
process to help determinewhetherchmses
are semantically
related. Therelatio~ between
the basicclasses will be
useful in the later steps of the modelingprocesswhichare
to determine the associations and superclaas/subclass
relatiomhips betweencompositeelms.
Fi gm’e4
Next,the Set of Value,feemreis evaln!t__~_
usingthe set
of attributes producedby the intersection of the two
previoussets of related attributes. Thisfeature will decide
the type of relationship that exists betweenthe basic
classes. The possible types of relationships are
superclass/subclus, overlapping classes, and nonoverlappingrelated classes. TheSet of Valuesfeature of
Nmm
is evaluatedby testing the overlapof its values with
thoseof an attributein the relatedset.
Whenthe set of values of an attn’bute is a subset (or
supereet) of anotherattn’bute, then a superclass/subclass
~tionship exists betweentheir basic clan~. Anexm~le
of this type of relation~ can be seen in FigureI between
the basic class Super Hero Creator of the Super
124
References
Kern, B., Russell, L, Tsur, S., & ShenW.AnOverviewof
Database Minlno Techniques. The KDDworkshopin the
4th International Conferenceon Deductive and ObjectOriented Data~, Singapore, 1995.
Ambite, J., Y. Arens, N. Ashish, C. Chce, C. Hsu, C.
Knoblock, and S. Tejada. The SIMSManual, Technical
RelxmISFIM-95-428,1995.
Perkowitz, M. & Etzioni, O. Category Translation:
Learning to UndemUmd
Information on the Interact. The
International Joint Conference
on Artificial Intelligence,
Monlzeal,
1995.
Download