Homogeneous discoveries contain no surprises: Inferring. risk-profiles

From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Homogeneousdiscoveries contain no surprises:
Inferring. risk-profiles from large databases
ArnoSiebes (arno@cwi.nl)
cw!
P.O. Box 94079, 1090 GB Amsterdam, The Netkerlandm
Abstract
-
Manymode/sof reality are probabilistic. For example,not everyoneorders crisps
with the/r beer, but a certain percentagedoes. Inferring suchprobabilistic knowledge from databases is one of the majorchallenges for data mining.
RecentlyAgrawaiet ai. [1] investigated a class of such problems.In this papera
newclass of suchproblemsis investigated, viz., inferring risk-profiles. The prototypical exampleof this class is: "whatis the probability that a given policy-holder
will file a da/mw/th the insurance companyin the next year". A risk-profile is
then a description of a groupofinsurants that havethe sameprobability for filing
a claim.
It is shownin this paper that homogeneous
descriptions are the most plausible
risk-profiles. Moreover, under modest assumpt/ons it is shownthat covers of
such homogeneousdescriptions are essentially unique. A direct consequenceof
this result is that it suff/ces to search for the homogeneous
description with the
highest associatedprobability.
The mainresult of this paper is thus that we showthat the inference problemfor
risk-profiles reducesto the well studied problemof maximisinga quality function.
Ke~wo~s~ Phrases: DataMining,
Probabilistic
Knowledge,
Probabilistic
Search,
ProbabilityTheory
1. INTRODUCTION
Manymodels
ofreality
areprobabilistic
rather
thandeterministic,
either
bylackof current
understanding
orbynature.
Forexample,
itisunrealistic
toexpect
a model
thatwillpredict
accurately
whether
or nota customer
willorder
crisps
withhisbeeror not.Itis,however,
verywellpossible
to havea model
thatyields
theprobability
thata customer
orders
crisps
withhisbeer.
Infact,
Agrawal
etal.haverecently
shown
thatsucha model
canbe inferred
efficiently
from
a database
[i].
Inthispaper
we introduce
a newclass
of suchmodels,
called
r/sk-profiles
andshowhowto
infer
themfroma database.
Theproto-typical
example
of thisclass
of problems
is derived
fromtheinsurance
business.
Take,e.g.,theinsurance
company
Saveor Sort’#
(SOS),
thathandles
carinsurances.
stayin business,
SOShasto satisfy
twoalmost
contradictory
requirements.
First,
thetotal
ofpremiums
received
ina given
yearshould
beatleast
ashighasthetotal
ofcosts
caused
by
KDD-94
AAAI.94 Workshop on Knowledge Discovery in Databases
Page 97
claims, i.e., the higher the premiumsthe better. Secondly, the premiumsit charges should
not be higher than those of the competitors, i.e., the lower the premiumsthe better.
Hence, SOSshould be able to predict the total claims of an insurant in a given year as
accurately as possible. Clearly, predicting this amount with a 100%accuracy for each and
every client is impossible. Whatinsurance companiesdo, is identifying groups of clients with
thesamee~eeted
claimamount
peryear.Suchgroups
aredescribed
by so-called
risk-profiles.
In otherwords,
by whatis called
a (set-)description
in datamining.
Clearly,
thesharper
thesedescriptions
are,themorecompetitive
SOScanbe.Sinceinsurancecompanies
havelargedatabases
withinsurance
andclient
data,inferring
risk-profiles
o~ersa majorandprofitable
challenge
to datamining.
Another
example
of theuseof risk-profiles
is in theanalysis
of medical
data.Bothdoctors
andpatients
wouldliketo knowthechance
thata patient
diesgiventhesymptoms
shehas.
It is reassuring
to knowthatsayonly0.1%of thepatients
withfiuwilldie,butperhaps
less
so if oneknowsthatthemortality
rateforthosewithboththefiuanda heart-condition
is
muchhigher.
In thispaperwe studya simpleriskinference
problem,
viz.,we onlytryto inferthe
probability
thata client
willfilea claimin a givenyear.Morecomplicated
casesaresimple
extensions
of thesolution
proposed
in thispaper.
Thissolution
is reached
as follows.
Afterrecapitulating
somebasicprobability
theory,
Section
2 presents
theformal
statement
of thisproblem
in termsof (set-)descriptions.
This
formal
problem
is analysed
in thethirdsection,
werewe indicate
thatgooddescriptions
are
homogeneous.
Roughly,
a description
is called
homogeneous
ifallextensions
of thisdescription
yieldnothing
surprising.
Thenotions
of homogeneity
andsurprises
areformalised
in Section
4.
In Section
5, it is shownthatcertain
maximal
homogeneous
descriptions
arein a certain
sen.’se
unique.
Thisuniqueness
result
indicates
thatthesemaximal
descriptions
areplausible
solutions
forourproblem.
Thenextsection
Showsthatfinding
thesemaximal
descriptions
is an instantiation
of thewell-studied
problem:
maximise
thisfunction.
It is thenargued
thatdueto thenatureof ourproblem,
probabilistic
maximisation
heuristics
arethemost
adequate.
In thefinalsection,
theconclusions
arestated
together
witha briefdescription
of
current
research.
Moreover,
a briefcomparison
withrelated
research
is given.
Note,thispaperis onlyan extended
abstract.
Alldefinitions,
results
andproof-sketches
aregivenin therunning
text.
2. PRELIMINARIES
In thefirstsubsection
we recall
somebasicfactsfromprobability
theory.
Fora moredetailed
treatment,
thereaderis referred
to textbooks
suchas [3].Thesecond
subsection
presents
the formalstatement
of the problem.
In the thirdandfinalsubsection
somenotational
conventions
areintroduced.
~.IBernouillz/ ezperiments
To determine
therisk-profiles
forSOS,it is assumed
thatitsclients
undertake
a Bernouill~/
or 0/Iexperiment
eachyear.Recall
thata Bernouilly
experiment
is an experiment
withtwo
possible
outcomes,
denotedby 0 and1. Theoutcome1 is oftencalleda success.
In our
example,
a client
succeeds
in a trialof theexperiment
ifhe filesa claim
in thatyear.
Page 98
AAAI.94 Workshop on Knowledge Discovery
in Databases
KDD-94
For a Bernouilly experiment, it is assumed that there is a constant probability, say p, of
success for each trial. The probability that we get exactly k success in n trials is then given
by the binomial distribution:
nP(#Sfkln,
)
P) ffi(
k pk(1 _ p)(.-k)
If n trials result in k successes, the best estimation of p is given by k/n. However,it is
conceivable that p differs from k/n. The (100 - if)% confidence interval for p, given n and
k is the largest interval Cl(n, k,6) _ [0,1] such that P(p CI(n, k, 6)) <_6%,i.e. , the
probability that p ¢ CI(n, k, 6) is smaller then if%.
Cl(n, k, 6) can be computed using the binomial distribution.
bounds:
P(lk/n
- P[ > ~) - ~ n)k p~(1 - p)(n-k)
k: Ik/n-pl>c
<2e
In fact, using the Chernoff
-dn/4pO-p)
one sees that with c ffi ~ [k/n - c, k/n + c] is an easily computed conservative approximation of Cl (n, k, ~).
~.~ Descriptions: the problem
If a client is insured for a long time with SOS,and if it is reasonable to assumethat the client’s
probability of success has not changed over the years, the confidence intervals of the previous
subsection can be used to determine the chance of success for each client individually. Alas,
these assumptions are not very often met. If only because, e.g., changes in traffic density
ensure that the probability of success for each client will vary over the years.
With only one or two trials per client, the method sketched above will give [0, 1] as the
95%confidence interval for our individual clients, which is pretty useless. The assumption
wemakeis that there are a few groups of clients, such that clients in the same group have the
same probability of success. That is, we assume that there is a cover {G1,..., Gz) of disjoint
subsets of SOS’sclients, such that:
Veil, el2 : [p(cla) = p(cl2)] *’* [3!i E (1,..., l) : [ell E Gi Ael2 E
where p(cli) denotes the probability of success of client cli. The assumption that l is small
is madeto ensure that each group has enough clients to allow for an accurate estimation of
the associated probability of success.
The second assumption is that membership of one of the Gi depends on only a few properties of the client. That is, we assume:
1. a set of attributes
~4 = {A1,...,
An) with domains D1,..., Dn, where dom(Ai) ffi Di;
2. a function p : D1 x ... x Dn "* [0,1];
such that if Ai(t, el) denotes the Ai-value client el has at time t, p(Al(t, cl),...
denotes c/’s probability of success at time t.
KDD-94
AAAI-94 Workshop on Knowledge Discovery
in Databases
,An(t,
Page 99
ForSOS,we canthinkof theattributes
age,gender,
andmilesperyearwiththeirobvious
domains.
Usingthesecondassumption,
we canrephrase
thefirstas saying,
we assumethatthere
exists
a cover,
C- {CI,...,
Cl},of disjoint
subsets
of DI x ...x Dn suchthat:
vvl,v~e DIx ... x D,:[p(vl)=p(v~)]
.- [Vie {i,...,l} : [vl¯ C~.~ v2¯
The chance of success that all the v E Ci share is called the associated probability of Ci and
is denoted by Pc~.
The problem discussed in this paper can nowbe stated as: ’Trod the (7/and their associated
probability Pc,".
In dat_umining terminology this means that we try to find a (set-)description for ear3
the Ci. That is, find logical formulae ¢i such that a value v ¯ DI x ... x D, satisfies ¢i,
denoted by ¢i(v), iff v ¯ C~. The formulae ~b are expressions in some description language
over .4. Whichdescription language @is chosen is unimportant for this paper as long as one
can expect that the Ci can be described by @and the following four assumptions are met:
I. ¯ is closed under negation, i.e., ~ ¯ @---, -,¢ ¯ @
2. @is closed under conjunctions, i.e.,
¢i,~b2 ¯ @-*Ca A ¢b2 ¯ @;
3. Implication is decidable for @.
4. ¯ is sparse with regard to P(DI x ... x Dn). That is, the number of subsets described
by @is small comparedto the numberof all possible subsets.
The necessity of these requirements will becomeclear in the next sections.
For SOS, an example of such a sparse description language is given by the disjunctive and
conjunctive closure of the following set of elementary descriptions:
1. gender = female, gender = male;
2. age - young, age -- middle aged, age - old;
3. miles per year --" low, miles per year = average, miles per year = high.
Of course,
themostimportant
assumption
on @ is thattheCi canbe described
in @. Hence,
choosing
an appropriate
@ is an important
taskin deriving
therisk-profiles.
To Sateourproblem
in termsof descriptions,
define
a set{~i,...,
Ck}of descriptions
to
be a disjunctive cover, abbreviated to discovery, if:
1. ¥i,j
¯ {1,...,k}:i
~j.--* [¢iACk"-*-L]
2. Vi=1
k ¢i ¯ @ and[V~=a¢~i] "" T
Notethatif@contains
2, we assume
thatdiscoveries
do notcontain
2..Theset{gender
-- female,
gender
ffimale)is a discovery
forSOS.
Theproblem
canthenbe restated
as:finda discovery
{¢I,...,
Ck}suchthat
Vvl,v2¯ D1x ... x D,:[p(vl)=p(v2)]~-[Vi¯ {1,...,k}: [~(v~)~. Cdv2)]].
Page 100
AAAI.94 Workshop on Knowledge Discovery
in Databases
KDD-94
~.3 Notational conventions
Thedatabase is seen as a recording of trials of our experiment. Hence,it is a table R, with
schema.4 U {S}, where S has domain{0,1}. Each time t an object o E P takes experiment
E, weinsert, regardless of duplicates, the tuple (Az(t,o), ... ,An(t,o), $(t,o)) in the table
where:
1. Ai(t, o) denotesthe value vi E Di that o has for Ai at time
2. $(t, o) - 1 if o succeededin this experimentand S(t, o) -- 0 otherwise.
Three
notations
usedfortheelements
of ¯ withrespect
toR are:(¢)to denote
thesubtable
ofR ofalltuples
thatsatisfy
¢,(¢)sdenotes
theelements
of(¢)thataresuccesses,
andII~[I
todenote
allv E Dix ...x D,~thatsatisfy
¢.
3.
PROBLEM ANALYSIS
Theproblem
as stated
in theprevious
section
consists
of twoparts.
We haveto finda
discovery
{¢z,...
,¢k}andwe havetoprove
thatforthisdiscovery
Vvz,v2
E Dzx ...x On:
[p(vl)
- p(v2)]
*-*[Vie{1,...,k}
: [¢,(vz)
~ ¢,(v2)]]
Finding
discoveries
iseasy,eachsequence
Cz,...,
Cn ¯ ~ of descriptions
generates
the
potential
discovery
~ -- {~z,’~¢z
A ~2,~IA ~(~2 A ~S,...,(~¢~1 A’’" A ~n)}. After
removal
ofduplicates
(~isa duplicate
of~ iff¢ ,-.¢)andJ.,~ isa discovery.
So,we mayexpect
thatthediscoveries
areabundant.
However,
theyarenotallequally
goodin thatsomewillnotsatisfy
thesecond
requirement.
Todiscuss
thisrequirement,
weshould
first
determine
whatprobability
should
beassociated
witha description
¢. Ifwe assume
thatthedatabase
isa random
selection
of thepotential
clients,
thisissimple,
since
foreachC E ~,(¢)canbeseenastherecord
ofsomeBernouilly
experiment
E~.Hence,
theprobability
p~ ofsuccess
associated
with¢ issimply
thefraction
z.
of i~ccesses
in(¢)and¢ hastheI00- 6%confidence
interval
CI~= CI(J(¢)J,
J(¢)sJ,
Ifthedatabase
isnota random
selection
ofthesetofallpotential
clients,
saythenumber
ofyoung
clients
isfarlessthancould
beexpected,
theabove
observation
isonlytrueforthe
"real"
¢iandthelogical
combinations
thereof.
Thisishowever
unimportant
forthepurposes
ofthispaper.
Thesecond
requirement
in turnconsists
of twosub-requirements.
Thefirstis thattwo
values
thatsatisfy
different
elements
ofthediscovery
havedifferent
chances
ofsuccess.
That
is,thedifferent
elements
aredistinct.
Thesecond
states
thattwovalues
thatsatisfy
the
sameelement
of thediscovery
havethesamechance
of success.
Thatis theelements
are
homogeneous.
UsingtheI00- 6% confidence
intervals,
wecantestdistinction.
Forif Cl~zf~ CI~2- 0
we knowthat Yv, w ¯ Dz x... x Dn: Cz(v) h ¢2(w) P(p(v) - p( w)) <_ 6, i f b oth
¢~ are homogeneous.
So,provided
we cantesthomogeneity
of descriptions,
we canforsomefixed6 discard
allthose
discoveries
thatcontain
atleast
twoelements
thatcannot
bedistinguished
witha
confidence
ofatleast
100- 6%.
Zln this paper we assume that al] descriptions we consider are su/Bciently large, such that reasonably
accurate estimates of the p~ can be found. See also footnotes and remarkslater in this paper.
KDD-94
AAAI-94 Workshop on Knowledge Discovery in Databases
Page 101
Can we testhomogeneity
of descriptions?
If a description
¢ is not homogeneous,
(¢)
contains
elements
r andw suchthatp(v)~ p(w).Moreprecisely,
if ¢ is nothomogeneous,
there
exists
a cover
C ffi{CI,...,
Cz},of disjoint
subsets
of [¢[suchthat:
Butthisis exactly
thequestion
we aretrying
to answer!
So,letus firsttryto answer
thequestion
whentheemptydescription,
i.e.,T is homogeneous.
Clearly,
if everysubsetof thedatabase
hasthesameassociated
probability
on
success,
thenthedatabase
is homogeneous.
However,
thisis obviously
impossible.
In fact,
the database is the record of a weighted Bernouilly ezperiment.
Let gi denote the relative size of group Gi in the universe of potential clients. Under the
assumption that the actual clients are a random subset of the potential clients the database
is the recording of a Bernouilly experiment E with its probability on success PE given by:
i--1
Notethatif JR)is large,
then,as onewouldexpect:
PE --- ~1 g,PG, ~ ~-1 ~7~’JP~, ~ ~-Z ~.J x ~ = the number oi~uccesses
in R
Hence, if we consider all possible subsets of the database, the numberof successes in the
subsets of a given size follow a binomial distribution. In other words, if weinspect all possible
subsets we can conclude nothing.
However, we are not interested in all possible subsets, but only in those subsets we can
describe. Since ¢ is sparse, one would expect that all descriptions have more or less the
same associated probability. If we think that T is homogeneousand it turns out that two
descriptions have distinct associated probabilities then we are surprised. Wewould not expect
this to happen. If female clients appear to drive muchsafer than male clients, weare no longer
sure that all clients have the same associated probability. That is, we are no longer convinced
that T is homogeneous.
In other words, it is plausible that a description is homogeneousif all its covers have the
same associated probability. This can be stated more succinctly in the slogan: Homogeneous
descriptions contain no surprises.
4. HOMOGENEOUS DISCOVERIES
To simplify the discussion, define a set ~¢1,...,
description ¢, if:
1. Vi, j¯ {1,...,k} :i#j...-~
Ck} of descriptions to be a discovery .for
[¢i ^ Ck--* -L]
2. V~=1¢i ¯ @ and k
If we assume
thatT ¯ ~, thediscoveries
of theprevious
sections
arediscoveries
forT; we
willcontinue
to usethisabbreviation.
Page 102
AAAI-94 Workshop on Knowledge Discovery
in Databases
KDD-94
If @ is an homogeneous
description,
so is @ ^ ~. In fact,thenpc andp¢^~should
be,
almost,
identical.
In theprevious
section,
wehaveseenthatp~ andPCA~areidentical
with
a probability 6 if CI~ andCI~^~are
disjoint. Therefore, wedefine a description ~ to be a
(100- 6)%su~risin9 description fora description
@,if CI~andCI~^,~
aredisjoint.
For example,
if Cl~~- [0.19,0.21]and
s~
CI~e,~er=,,~,a
0.27],
thenSender
= maleis
e - [0.25,
a 95%surprise for T.
Since
discoveries
cancontain
manyelements,
itis notimpossible
thata discovery
fora
homogeneous
description
@ contains
somesurprising
descriptions
for@. To quantify
the
number
ofsurprising
descriptions
ina discovery
thatissurprising,
notethata description
hasa chance
6 to besurprising
for@.Thatis,being
a surprise
is a Bernouilly
experiment,
withprobability
6.We wouldbe surprised
if thenumber
of successes
in thisexperiment
is
high.
Fora Bernouilly
experiment
withprobability
p of success,
define
Lwb(n,p,6)
to be the
lowest
integer
k suchthatP(#S~_ k[n,p)
_< 6. Fora description
@, define
Lwb(@,6)
Lwb([(@)[,p#,6).
Hence,wedefine a discovery{@1,.¯ ¯, ¢z} for a rule ~ to be a (100-6)%surprising discovery
/or if:
[{iE {1,...,/}l@i
is a (100- 6)%s.rp~sing
description
for~ }[_>Lwb(~,
6)
A description
forwhich
there
areno(100- 6)%surprising
discoveries,
iscalled
a (100homogeneousdescription 2.
Onemightwonder
whywe bother
Bernouilly
experiments
forcovers
since¯ is assumed
to
besparse.
However,
evenif@ is sparse,
it canstill
happen
that,I~distinguishes
allvalues
of an attribute
witha largedomain,
sayage.Forsucha densecover,
onecanexpect
some
sunrises.
However,
onlyifsuchsurprises
areconsiderable,
sayallclients
younger
than25,it
counts
asa real
surprise.
Notethatif,I~isnever
dense,
suchasinourexample,
onesurprise
ina cover
isenough
to
makethecover
surprising.
Let ~ be a description, a discovery {@1,..., @z}for ~ is (100 - 6)%homogeneous
discovery
.for ~, iif all of the @~are (100 - 6)%homogeneous
descriptions.
5.
SPLIT
(100--6)%
HOMOGENEOUS DISCOVERIES
ARE UNIQUE
Clearly, the real discovery we are searching for will be (100 - 6)%homogeneous.But are
all (100 - 6)% homogeneousdiscoveries plausible candidate solutions? If there are many
(100- 6)%homogeneous
discoveries, one might wonderwhich of these discoveries is the most
plausible answer.In this section, weshowthat all 8pHtdiscoveries are moreor less the same.
A (100 - 6)%homogeneous
discovery {@1,..., @z}is split iff
i#j
[
^
V@e#Vi,j6{1,...,l}: (@iA@)>>0 A
(¢j
6
--*CI~,^,AnOI~#^¢=¢
>> 0
21f@ isade,,cription
forwhich
I(V~)I
isalre~y
sosmall
that
for
almost
all
noncontradictory
¢,I(¢
^
is to small for an accurate estimation of P@A~are not considered to be homogeneous.
KDD-94
AAA1-94 Workshop on Knowledge Discovery in Databases
Page 103
Inother
words,
a discovery
issplit
ifitsdescriptions
aredistinct
onallbuttrivial
subsets.
Let{~bl,...,
~k}and{~Pl,...,
~P~}be twosplit
(100- 6)%homogeneous
discoveries.
Then
sinceboth/¢
andI areassumed
to be small(since
thenumber
of groups
is assumed
to be
small)
there
isforeach~i,atleast
one~pjsuch
that(~biA~Pj)
>>0.Butsince
both
discoveries
arehomogeneous
andsplit,
therecanbe at mostone.So,(~i)~ (~Pj)CI~~. I~j
C .
Define
a pre-order
onthe(100-6)%
homogeneous
discoveries
by{~i,..
¯,~k}-<{~Pl,.
¯ ¯,~Pz}
iff
there
exists
ainjective
mapping
f:{i,...,
k}-.j¯{I,
...,
l}such
that
CI~
Ci~/(o
.6 c
Then,
by theobservation
above,
we canidentify
allsplit(1006)%homogeneous
discoveries
andturnthisrelation
intoa partial
order.
Clearly,
thesplit(100- 6)%homogeneous
discoveries
areminimal
withregard
to this
order.
Note,however,
theyneednotexist.
Hence,
we cannot
identify
thesplit(100- 6)%
homogeneous
discoveries
withtheminimal
discoveries.
Of-all
the(100- 6)%homogeneous
discoveries,
theminimal
discoveries
arepreferable,
e.g.,
bytheminimal
description
length
principle.
Ifa split
discovery
exists,
thisoneisthemostplausible
ofallminimal
discoveries,
since
it
makes
thesharpest
distinction
between
itsdescriptions.
Inother
words,
itisthedescription
withthehighest
information-gain.
Thefactthatthesplit
discoveries
arenotunique
is simply
caused
bythefactthata set
of tuples
canhavemorethanonedescription.
Forexample,
itcouldhappen
thatalmost
all
young
clients
aremaleandviceversa.
Inthatcasethedescriptions
age- young
andgender
- maleareequally
goodfroma theoretical
point
of view.
Notnecessarily
froma practical
point
ofview.
For,itis verywellpossible
thatthedescription
age= young
makes
sense
to
a domain
expert
whilegender
= maledoesnot.Hence,
bothoptions
should
be presented
to
thedomain
expert.
6.’~FINDING
MINIMAL HOMOGENEOUS
DISCOVERIES
In thefirstsubsection
we showthatfinding
minimal
homogeneous
discoveries
reduces
to
maximisation
ofa function.
Inthesecond
we briefly
discuss
various
maximisation
algorithms.
6.1 Mo~misation
Define
thepartial
order
"<6on thesetof(I00- 6)%homogeneous
descriptions
asfollows:
¯ Ifforneither
¢~"<6~Pnortp"<6~ holds,
then
wesaythat
~ ~6~P.
Fromthedefinition
of(100-6)%
homogeneous
discoveries
itis immediately
clear
thatsuch
a discovery
neccesarily
contains
a description
~b suchthatforall(100- 6)%homogeneous
descriptions
Ipeither
~P"<6~bor~b~6~P.
Minimal
(100-6)%homogeneous
discoveries
differ
fromtheordinary
onesin thattheir
"maximal"
element
~ alsohasa maximal
[(~)ITherefore,
define
thefollowing
pre-order
the(100- 6)%homogeneous
descriptions:
~ r’_6~pif:
I. either if ~ "<6~;
Page 104
AAAI-94 Workshop on Knowledge Discovery in Databases
KDD-94
Thenminimal
(I00- 6)%homogeneous
discoveries
contain
an element
@ suchthatforall
(1006)%homogeneous
descriptions
~, ~ _-.6
Fromthisobservation,
we seethatthefollowing
pseudo
algorithm
yields
a minimal
(100
6)% homogeneous
cover:
1.Makea listd (I00- 6)%homogeneous
descriptions
as follows:
(a)finda (bthatismaximal
withregard
to__.6for
(b)remove
(~)fromR andadd~ tothelist.
(c)continue
withthisprocess
until:
either
T is homogeneous
on theremainder
of R;
orR is to small
forfurther
accuracy
estimations
ofprobabilities.
Inthiscase
makethelistempty.
2.Ifthelistisnon-empty,
turnitintoa potential
discovery
asindicated
atthebeginning
ofsection
3.Ifthelistisempty
return
nothing
Ifwe arereturned
a potential
discovery,
it is a minimal
(100- 6)%homogeneous
discovery
by virtue
ofourremarks
above.
Ifthealgorithm
falls,
nosolution
exists
in ~.Given
sucha
minimal
description,
itisstraightforward
tocheck
whether
itissplit.
Asanaside,
notethatthelistreturned
bythepseudo-algorithm
issimilar
toa decision-list
[10]
6.,eFinding
mazimal
descriptiona
Toturnthepseudo-algorithm
oftheprevious
subsection
intoanalgorithm,
itissufficient
to
givea procedure
thatreturns
a maximal
(100-6)%homogeneous
description
(b.For,
cancheckwhether
T is homogeneous
by checking
whether
~ and"I"havethesameassociated
chance.
Butfinding
su~a maximal
(100- 6)%homogeneous
description
@ is a simple
application
of maximising
a function.
In thefirst
phaseof thesearch
thisfunction
is theassociated
probability
of"a rule,
andinthesecond
phase
itisthecover
oftherule(while
retaining
the
maximal
probability
found).
Thisproblem
is well-documented
in themachine
learning
literature.
Oldersolutions
are
ID3,AQ15andCN2[9,6, 2].Moremodern,
non-deterministic
approaches
usegenetic
algorithms
or simulated
annealing.
See[4]foran overview
ofa someofthese
systems.
Forourproblem,
a non-deterministic
approach
is bestsuited.
For,therecanbe many
different
(100- 6)%homogeneous
discoveries,
thatdescribe
thesamesetsbutwithdifferent
descriptions.
Someofthese
descriptions
willbepurecoincidence,
while
others
havea sensible
interpretation.
Ifweusea deterministic
algorithm,
thechance
exists
thatwemissoutthesensible
descriptions
completely;
thiswould
render
theresult
virtually
useless.
KDD-94
AAAI-94 Workshop on Knowledge Discovery in Databases
Page 105
Of the two pro-dominant non-deterministic search methods, the genetic approach seems to
offer the best possibilities for our problem. Since, as described in [5], its genetic operators
can adapt the proven heuristics of the older deterministic algorithms.
In this particular case, we think in fact of two sets of operators. One for the first phase
and another for the second. In the first phase we search for a maximal probability so, one
can think of a hill-climber type of approach. In the second phase generalising a description
is far more important. Hence, we need a different set of operators.
Note that it is not our intention to turn genetic programminginto something deterministic,
rather, it is an attempt to lift genetic programmingto the level of symbolic inference.
7. CONCLUSIONSANDFUTURERESEARCH
The main result of this paper is that we have shownthat a new class of problems, viz., find
risk profiles, can be reduced to a well-knownclass of problems, viz., maximisation problems.
These results have been achieved by introducing the notions of a surprise and homogeneity.
In particular the slogan: "homogeneousdiscoveries contain no surprises", has proven to be
fruitful.
The important implication of these results is that we can built efficient algorithms to solve
this new class of real world problems. In fact, at the momentwe are building and testing a
system based on the ideas presented in this paper for an insurance companythat wants to
find such riskprofiles, for obvious commercialreasons.
Next to the choice for an adequate maximisation algorithm as discussed in Section 6, the
other main problem is the choice of a good description language. As we have seen in this
paper, this language should not be to rich. Obviously, it should neither be to poor if the
profiles
areto be expressible
in thislanguage.
Thedesign
of a language
thatmeetsthesetwo
conflicting
requirements
is currently
oneof ourmaintopics.
7. ~ Related work
As already indicated in the introduction, as far as the author is aware, inferring risk-profiles
is a new problem area in data mining research. It is probably most connected to diagnostic
problem solving as reported in [7, 8]. A diagnostic problem is a problem in which one is given
a set of symptomsand must explain why they are present.
The authors of [7, 8] introduce a solution that integrates symbolic causal inference with
numeric probabilistic inference. A crucial point in this integration is the use of a-priori
probabilities. The authors imply that these probabilities should be supplied by an expert.
However, using the technique presented in this paper, these a-priori probabilities can be
derived from a database.
ACKNOWLEDGEMENTS
Marcel Holsheimer has been the sparring partner in many stimulating discussions.
he pointed out an error in a previous version of this paper. Myheartfelt thanks.
Moreover,
Also thanks to the anonymous referee who pointed out many improvements, and brought
references [7] and [8] to myattention.
Page 106
AAAI.94 Workshop on Knowledge Discovery
in Databases
KDD-94
REFERENCES
I. Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Database mining: a performance
perspective. IEEE transactions on knowledge and data engineering, Vol. 5, No. 6, December:914 - 925, 1993.
2. Peter Clark and Tim Niblett. The CN2induction algorithm. Machine Learning, 3:261 283, 1989.
3.
William Feller. An introduction to probability theory and its applications, Vol 1. Wiley,
1950.
4.
Marcel Holsheimer and Arno P.J.M. Siebes. Data mining: the search for knowledge in
databases. Technical Report CS-R9406, CWI, January 1994.
5.
Zbigniew Michalewicz. Genetic Algorithms 4- Data Structures
Artificial Intelligence. Spinger-Verlag, 1993.
6.
Ryszard S. Michalski, Igor Mozetic, Jiarong Hong, and Nada Lavrac. The multi-purpose
incremental learning system AQ15and its testing application to three medical domains.
In Proceedingsof the 5th national conference on Artificial Intelligence, pages 1041 - 1045,
Philadelphia, 1986.
7.
Yun Peng and James A. Reggia. A probabilistic causal model for diagnostic problem solving - part I: Integrating symbolic causal inference with numeric probabiUstic inference.
IEEE transactions on Systems, Man, and Cybernatics, Vol. 17, No. 2, March/Aprih146
- 162, 1987.
YunPeng and James A. Reggia. A probabilistic causal model for diagnostic problem solving - part II: Diagnostic strategy. IEEEtransactions on Systems, Man, and Cybernatics,
Vol. 17, No. 3, May/June:395 - 406, 1987.
8.
9. J. Ross Quinlan. Induction of decision trees.
10:-.Ronald L. Rivest. Learning decision lists.
KDD-94
- Evolution Programs.
Machine Learning, 1:81 - 106, 1986.
MachineLearning, 2:229 - 246, 1987.
AAAI-94 Workshop on Knowledge Discovery
in Databases
Page 107
Page 108
AAAI.94 Workshop on Knowledge Discovery in Databases
KDD-94