Database: Mining ,m the Architecture of a ... : /Pre,processor for State-Aware Query ~.~ Optimization

From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Database: Mining,m the Architecture of a Semantic
: /Pre,processor for State-AwareQuery
~.~ Optimization
¯ Smabjot S. Anm~tl~vid A. Bell
"i)¯ Department
oflnformation
Systems
John G. Hughes
i i Fac~tty
of~tomu~s
~, Universi~ of Ulster
(Jmdanstown)
Northern
Ireland
’
F.,-Mafl:
ssanand,dbell~gh~uk.ac.ulster.ujvax
Abstract
: " . . Most sppl/~ of in database mining at present seemto, be ~centmled~d the discovery of knowledgefor
" -. . app~on~eriented
~onsupport.
In thislmPer We descn~’a
~I for.enlmncing me performance
of the
¯ "~ .:/ ’ " ~ ~entsy.~ itselfusing~anti~captured.by ~~g techniques.
We investigate
the use
.::
of~
mtningin
~portlng
State-aware
qU~
opdmizaace
where
improved
performance
is
sought
by
¯ d~tu~,Mlytlgl~_~ngthe u~on of available ges~.
¯.
Weconzkler, in pertic.n!m’, theuse of seman,tic knowledge
in enlmncingthe systemperformanceand the u~on
. .:-i of ~ hardware
such as l~’s SC~by. ,senq~ntJc
qUe~y,~fonn*1!~fio~The
~ is to discover
semantic
fmfotinatlon
about.th.e’data
stored
in~ed,tA~.using
.~mining tec~ues
~ useitto reformulate
queries
i:
.....
use
~m,~tabteresoui’CeS
atme:mn.e
ofexecufion
oftheq~,:Th~/the
execution
strategy
of the
~:. m ~ better
"
~ is~ Onth.ecunent
staie
ofthesystem
.(and
so’stat~’nware~).
~ ach/eve
tldswecreate
a rewr/te
system,
¯~_,~tng
ofrules
and
equations
mined
~mthe
,~,,Jh_~,
that
rewrites
q~es
made
tothe
database
for
state"awm’#
exlg;lt/on.
¯
Wepro.vide m architecture for this querypre-processoranddiscuss performance
comddeml~ns
involvedin its
kup~on.
Keywmds
: Database Mining, SemanticQueryReformulaann,Rewrite Systems, Associative Memories,
Specialized ~ Hardware. Resource UaU,~am.
1. Introduction
.Developersof !arge database applications are often f~ed with performanceboulen~.andam~,qui~l mdevelop
i~/~,,.~,ov~,~
m~to,~s
~em.
x.~s~
we~m~
=~h~.ofaquery
~-pro~.or
empioys~ mining.lec!miques
m help
distribute
me query
load ovm"
meavailable resources
(’mte-awm’~’
¯ : ’-.query
optimizat/on)reducingproblemsof resource boundeAness.
Databasemininguw.Jmiqum
can be employedfor achievingstate-aware QueryOpttml-mlonfor three separate
iwpom
¯ Dme~on
of~om~
beingun~k=-utnised
i " , ~m’muktion
of heuristics affecting the utilisation
oftheresomce
¯ Dim:overy
ofndesfrom,~,,, require, to refonn,,~-,~ queries m increase u "ulismJonof meresomce
Fog.thefirsttwopmposes,
Le.thedetection
of under-u~
~ andme formulation
of heurist/cs
about
the
~m of the re~,~ mining algori~sne~l to be ~ for detecting aaaodation patterna in dn,~
about queries executed, their ~ ,~At~dcs and their execution plans.
KDD-94
AAAI-94 Workshop on Knowledge Discovery
in Databases
Page 287
¯
t
wemmua~
in ~ p,x~onthea=dp~,weinVesaSa~
~e po~ble
mR
of ,a~mb~..
mU~g
to d~,ov=
rules fromdata required tO reformulatequeries, This is doneby using the STRIPulgorsdun[ANAN93b].
The
STRIP
alg .mimm
canbeemployed
in twoaiff~ntways
D~scovery
of allruies implicitly presentin the database,independently
of whatis requiredfor query
¯
~’
reforll~n!ndolrl
¯,.’. * Goal ~ dlscovery of rules
In seclton 5 wedescribe howthe S~miningalgorithmcan be employedfor the discovery of rules independently
of mequ~ret’m~n,_,!.,~cm
wUi~
Secu0n
6 describesme
roleof STUn’
in megoul~iimc~
dlscov~of nae~
¯ , Themoelvmlon
for ~ work~nted~ this paper~ fromperformance
re~ workun.dermken
bythe anmors
: a meNo,~
~..u~ ~u~nsm~ua~
pro;~y~mcm
damb~
sys~m
is oneof me
e cram).TheNUm
largeit relational database systems under .devel~t ~ E~I ~ ~e/is aPProximately 20 GBin size,
~buttd over five regions ofthe h0nsingex~utive.It ~ an expec~ workload of 3700on,line transactions per
: regIOmper day and ~ on.line uitem, his implementedm 11 ICL’s DRS6000mac~,~g INORES/Star.
The slze~of meNmB.a,,.._~ase appncmlon
~ medevel~,wima umber0fl~rommnce
problemsand
-... ~.u~a~or~nmova~.
ve soluaonswas~ ~ l~mme~mms
un~aninv~saaon
of the xns.~
::.! .....~h Acce|m~r developed by !~ in c~~onwithl ~i ~. [INOR]~ ~ehean of the Ingres Search
’if: i " ...Ac~~ is SCAPS;the ~nt g~on Of I~ s C0n~ Addre~l¢ File $~ (CAFS) t’BABB79,
’:" ’" ~
BABBg.5,COUL72,LEUN85].
SCAFS
tsa ~nd filter 4m~basemachine ~ can perform associative searches
" ". Ou~Jadonai~ tables. The advanmSes’of the Ingres ~h Ac~lm’m~are:
" " * i .it off-!~ thocPUwhichWouldoth~ be using most of its fimein trivial comparisons
¯ + it reducesthe I/O traffic as only:thosetuples that satisfy the queryarepassedinto mainmemory
.. ~ ~-_~stuaimpeaormed
on sc,~s [ANAN~
93a] for the Nm~
a~caaonshowed
mat me~ngresQueryOpamizer
¯ (1~). ufili~s
SCAFStO. a~ ~um. Tbe~clusion W~~in
l ~sentslme S CAFS ,w~uld not b e o f
m makeSCAFS
: much~p mfldsapplir~on. This ledmase~.h for meth~~could
’usable [ANAN93]
for the
.... NI~ appnuuion in pm~ular, andlarge~ appli~ wire high ~don taxes in genmL
. Thestudy of SC.3kFS[ANAN93a]
resulted in a set of heuristics ~ Could~ct whenthe IQOwoulddecide on
:. using S~AFSand whenit would decide against it. ~ suggested ~wayforward was to design a ’Query
" Pre~so~ fi~t could Iransform the query input into a form that Wouldmakethe IQOdecide to use SCAFS.
Refcmulm!on
of queries using knowledgehasbeenstudied in ~ detail in the topic of SemanticQuery
¯ : from S~ because it Ibm a diffezeut ~,W~eSQO~ ~edat ~g the cost of query proce~.g, the
mfic q~ry pre-~ssor mmsformsqueries m ~e be~ use of the#v~le resources. For example, if the
--~ . CPUis v~:bnsy,reformuladnga querym~it ~ SCAFS
nm~:l~.~beneficial f.~mesystemas a whole
, "eve,n if the ~ timefor this querywouldbe~r tlian the re .s~.n~.. time of ~ Origlnulquery.In this sense
-: . mopa~,.-~on
is ’m-awm’,.So~
ovmm
w~execua~i~
e~ a~thou~
the response
ameof me
.-.::~: nd’o~,,,almedquery maybegreaU~r
I ~dmtfor :meori~ query, ~us, ~e 8oulof ~que~ypre-processor is to
/:-~epthnlN me ufll!zsflon lind throughput peffmice indices ~I]~ ~ the response time which is
. : .. of~n~ n~ of ~tereSt in I~fm ~.Fora ~m intensive applie~ion like that of me housing
" execu~ve,opinionsmemmughput
and~~ofresourcesare of primaryimpmance.
TI~. reles
~ ~mn dam baseS .using ~ mi~8 [AN.a, JqgSb,PL~T91,1PIAT93,AGRA93a
, AGI~,93b]
ca be¢~ mfcmnrides ~ a mvri~systemC1~901.In our ~ the, u~nsof ~ rewritesystem would
be the semr~ ~dons ~ the q~ (see :secaon 3for It de~~uss!on
on ~e !~write
system used in our query
pm.~). ~ang mms~mi0n
~s in Seanana~ ~ Opamizaxiouas rewrite rules has been
ITm limk ld~ otto smoctmt,m ,mm~ ts m spp/y sumelmrdws~ sm~ 1o~ dtn~lym ~ ~ m a k ~ ~ ~, ~ ~ ~ d~ ~ ~
M’dm
IMml m I~ mmsmim~sm~u d~, dmmml m dmfims* rumpuS, d~m~by no* anly mddc~ ~ ~ ~ ~ ~g ~e ~ ~
~
to Im,--.~a maby tim lmmUw.
Page 288
AAAI.94 Workshopon KnowledgeDiscovery in Databases
KDD-94
¯ ~ byO,~t.~var~y
et.aL
[CHAgS4].
~write~mshave~o bee, u~in the contextof representing
2
~ "" ~ tra~formations in Rule Based Qu’ery Op~on[OR.AE87]. The Semantic Query Preoprocessor
"~ (SQP)regards the rules discove~dfrom the databa~ as rules in a rewrite ~sWm.
rest of the paperts in the followingformat..Fu~,
wehavetworeviewsectionsfor oursystem
q~plic~on.
In
~- ./i I~ .x, I~fl~w gjve a ~ewof workbe, ing done in/semi!(:°q~ryop~on,In partlc-lnr wefocus on
"iii~}:i~~m
e
I
~sgestedby =-~L,~hers
t. SO3.~0"13reviews~w,~~S ~d inu~e.~
the re.re system
(~ k~.wledge to g~d~ tile ~h for ~owledgeby ~ mdning a!g(~’thm, for que~ reform.,1,tion.
~w.ehe~=ethe ~ of ~mesperrorm~
ear, ero. SC~Sb~
the ~thm~XNg~a].Theseheuristics
- m em~-re~, ~ to ~ w~ ~ow. ~cs in ~. 2: Se~ ~ ~bes the~o~re of the semantic
¯,:.:
p=-p
.or,
inau a o,howSoaU,e d r.overy
ofrate,
’ ’:m!t~databaSe
mining
techniquescan
beused
inthe
query
pre-proceum,.
2. SemanticQueryOptimization
¯.- . King[KING79.,KINOSI] and, independently, ~er et. aL ~80] ~troduced a technique of query
:’.: ’i" opdm!-Jdoa that used knowledge aboutthe ~f~ in the ~, i.e. do~ ~ndent strategies, for query
: ~g. This techn!que came to be ~bwn. Semantic Qu~ ~liml-e~on ($(~))i S(~K:) uses knowledge
: : ~ the ~;e Stored in the ~ Wmmsform
queries into Other queries ~are sem~tically equivalent i.e. ~,At
/, " ~ thesmeamW~".the ofi~4nal qu~esbuthaveslower exec~onCosL
The ~owledgeon which the
i" Ya el. al [NU891Suggasted the ~ of Dy~cCo~aint$. These ate ~ ~ frequently satisfied by the
::i-. &.,~but ~ not always be true. These constraints would n~ to be ~ whenupdates are madeto the
-~i ~ as~~huemayafro-tthe ~ruth-value’
of the contr.Yh~se
seman~¢
Query
optimize,havethe
" advantageof being relatively easy to change.Newdomainfactstrotbe easily addedas can newoptlm;-Jtlon
¯
’
~Biq~,
¯ . ge~: Query ~on uses two ~ of knowledge:
’:¯ .Semantic Integri~. . and ~c Constraints or Trons/ormationKnowledge,used to Iransformgiven queries
s Heuristics’from ConventionalQueryOpl~t-~aonor Domain£~owledgethat help to guide the transformations.
¯
~ [KINGS
l] Wovides
a mtof heuristics for controllingthe uansformalions
of queries. Theheuristics are sub.. ~livtaed
intotwoSnmps:
¯ " s Hetwisdcsfor.considering a transformationand
¯ "~ for rejecting a uansformaeon
2.1 Heuristicsfor considering
transformations
¯ . ’ For a gtvea query there are p0m’blya ~ numberof ca~. ’date semantictransformationsavailable
¯ . in therewdto~ste~ Thus,.a me, of ~~g the numberof candJ_d~t_e semantic
:~~On,is req~~ ~cs pm~ in ~,_’~ ~no.. ~ a ~f~ senwe¢
....
t=mfemmiOn
i.e.
atransfonnat/on
that
would
reduce
the
rest
ofexecution
ofthe
query.
¯ 2.1,1 ladex la~ : Ira table involved in the query is indexedon an attribute O~t is not
~~.".possible
i.e;
ifittankeep
semantic
equivalence.
KDD-94
AAAI-94Workshopon KnowledgeDiscovery in Databases
Page289
example,co~ider the followingquery
, ~electsurnmne,~dd,
ess
....
from employees ..... Q1
¯ where income > I0000
:- $~ the table employeesis indexed on the field empnoand we have the following
’~". Tran~ormatlonKnowledge:
. : ¥1ncome
¯ 9000thenemp_nO
> 300.
:, T~ ~ my be ~uf~ned m
select surname,address
~ from employees..... Q2
where income > 10000
¯ aademp_no>300
~, .Tt~ querycan nowu~ll,e the indexon crop_no
andthereforehasa lowerexecutioncost.
IPl~
2.1.2Joinlntngin~lon:
If a relation
notinvolved
in theque~hasaclusteringlinks intoa much
larger
" : " . file thatisinvolved
in thequery,
therelation
should
beadded
to theque~
eventhough
this
. ~--..
iintzoduces
anextrajoin.
" ’ l~0¢example,cormiderthe following SQLquery
select surname,address
....."
" from employees ..... Q3
~ whereincome
¯ 10000
¯Ifthe~re ~ a clusu~’ing!ink fromtableemperksinto table employseaand emperksis muchsmaller
tlum employees,~c followingquerywill havea lower executioncosteventhougha join has been
¯
¯
:
select employees.vurname
~mployees~Idress
.from employees,emperks
Where employees.income > 10000 .... Q4
and employees:crop_no
= emperks.emp_no
Z!3Scan~tion : If one of the tab.los involvedin a join is unrestricted whilethe other
table
is
¯ :: ~.. in~gldy
restricted,restrictionsshouldbe added,if possible,onthe unrestrictedtable.
¯
" Pot’ example,Considerthe SQLquery
" ff
l select eaurname,el.car
’ from employees e,emperks el ..... Q$
wheree.income ¯ 10000
:
l
~
e’~
1
el.crop_no
¯ Ifwe’havethefollowing Tran~ormation
Knowledge
:~ii¢~ncome
>I000o
t~n
here_allowance~0
-.Thefollowing.queryis semantic~yequivalent to the query above
~ l : ~ " . se!ecte-y.urname,el.car
.....
fromemployeese,emperks el
:,~, "w~re t.income ¯ 10000
.iese Q6
,, . and el ~heuse_allowance= 10
and e.emp_no= el.crop_no
2,1AJa[neltm.~on:If atable is involvedin a join but noneof it’s attributes axein the outputlist of
¯ i the~,tm~form~/ontoeliminate
the table fromthe query should be
.... .: FOg~mple,consider the SQLquerybelow:
" ’ l" ~ ’l~l :~ ’ l . ~ct e.vtoz~e,e.address
": i from employees
e,emperks
el
where
e.lncome
¯ 10000
.....
Q7
Page 29O
AAAI-94 Workshop on Knowledge Discovery
in Databases
KDD-94
and e.emp_noffi el.emp no
¯ ThefollowingSQLqueryis semanticallyequivalentandhas no join andtherefore has a lowerexecutioncost :
select surname,address
from employees ..... Q8
wheree.income > 10000
(Note: ThequeryQ7wouldhavea lowerexecutioncost ff there wasa clustering link betweenthe tables
employeesand emperks)
2.2 Heuristics for rejecting a transformation
Givena particularquery,the heuri~csprovided
in this sectiondefinesemantic
transformations
that
wouldnotbeuseful in the attainment
of a lowercost of executionof the query.Suchtransformations
shouldnot be appliedto the query.
2,2.1 Donot introduce uniinkedjoins
¯ Theexamplegivenin 2.1.2 wouldnot be a cost reducingtransformationif there wasno clustering
link fromemperksinto employees.
2.2.2 Donot removetables with highlyrestricted attributes whichhaveclustering links on them.In the
examplein section 2.1Aif there existed a clustering link fromemperksandemployees,the
transfmmattonfromQ7to Q8wouldincrease the executioncost and shouldtherefore not be
performed.
2.2.3 Arelation highlyrestricted on an attribute with a clusteredindexshouldnot be transformed.
select surname,address
from employees
....Q8
where income > 10000
and emp_no> 300
If wehave the following
TransformationKnowledge
¯
~fincome
>
10000
then emp_no> 300
¯
A-transformation
Of Q9to the querybelowshouldbe rejected if there is a clustering indexon
emp_no.
select surname,address
from employees ..... Q9
where income> 10000
TheSemanticQueryPre-processorproposedin this paper differs fromqueryoptimizers (both conventionaland
semantiC)in that it takes into accountthe presentstateof the runningd~_!~base
sYstem.Considera large database
application with.a large workload.Useof ConventionalandSemanticQueryOptimizerscan reducethe cost of
proceminga query, but whathappenswhenthe CPUcannot handle any morequeries or the I/O channelsare too
busy to support any newqueries ? ThaWoposed
SQPtakes these environmentallimitations into accountand using
teclmiquesborrowedfromSQOreformulatesthe queSTto bemoresuitable for processingusing specialized
lumiware.In the environment
of our application the Specializedhardwareis SCAFS.
Usingit increasingthe load
on the I/O channelsandCPUonly ma~gina!lyallowingalarger nUmber
of queries to be processedconcurrently.
TheTrart~formation
Knowledge
in the SQpiS the sameas that in tit e SQO
but the SQPdrifters in its Domain
gnow/edg~,
TheDomain
Knowledge
has the samerole, i.e. to guidethelquery transformation,but it differs dueto
the changein goal Thegoal of the SQPis notreducingthe cost of processingor reducingresponsetime but to
makethe query more suitable for execution using SCAFS
i.e. llrtakir~g
more efficient use of ~.
The Domain
Knowledge
for the SQpconsists of a set of heuristics errived at by a previousstudy of SCAFS
undertakenby the antfiors [ANAN93a].
At pre~nt workis being undertakenin the authors’ laboratory to find ways
of discoveringthese heuristics themselvesautomaticaliyusing databaseminingtechniques.
KDD-94
AAAI-94 Workshop on Knowledge Discovery
in Databases
Page 291
3. Rewrite Systems
In this section wefirst give somegeneraldefinitionsof rewrite systemsalongwith a brief reviewof researchwork
¯ in progressin this field. Wethendefine the rewrite systemusedby us for our semanticpre-processor.
RewriteSystems[DERSg0]
are a set of ~ted equations usedto c0mputethe simplest form possible for a flmnula
by ~dly replacing sub-~ of the given formulawith equal terms. The final result of an unextendible
sequenceof rule applicationsis called a norma/form.
In rewrite systemswehavea set of termscalled constructortermsbuilt usingconstructors.Constructorsare the set
of irrednc~le symbols~
Theobjecfiveof a rewrite systemistoreplace sub-termsconsisting of non-construcwr
or
¯xlector terms with constructor sub,terms, This process is ~ until a constructor term is obtained. A systemis
to’be $~y[cientlycompleteif accordingto the semanticS,everytermis equalto (i.e. can be rewrittenas) a term
built only fromconsm~tors.
Anequationis an tmordered
pair of terms{s,t} in ~ whichis denotedby s ~ t. AtermT in ~ rewrites to Yin ~,
l +4a r, if I has a sub-termthat is an instanceof oneside of an equationin Eandr is the result of replacingthat subtenn with the conesponding
instanceof the other side of the equation.
Rewriterules imposedirectionality on the use ofequationsin proofs. Arule overa set of terms, ~, is an ordered
pair el,r> of terms, whichwewrite as I ~ r. Likeeq~tlous,rules are usedto replaceinstancesof T by
’
’ ~.’ Aset
......of suchrules forma term-rewritingsystem.For
¢¢nespondinginstancesof meright-hand
side
completeness
werestate the formaldefinition of a rewrite given in [DERS90]
:
Defn: Fora giventerm-rewritesystem,R, a terms in ~ rewritesto a termt in Y, writtens -’~Rt if
s~ = 1oandt = s[rO]pfor somerule 1 ~ r in R, positionp in s andsubstitution¢7.
Asubsl/tution, o, is a special kind of replacement
operationuniquelydefinedby a mapping
fromvariables to
termsv~rinenas {xI --~ Sl,X2 --, s2..,.,Xm--~Sm}~
Atermt in ~r matchesa terms in ~ if so = t for some
substitutiono. t is said to be an instanceof s or wesay that s subsumes
t.
Thesub-term,s~ at whichthe rewritecan take place is called the redex.If no redexexists the terms, is said to be
irreducibleor inrnonnalform.
Let us nowconsider howwemightuse this techniquein our querypre-processor.Considera relation ~ with
auributeset ~ = {AI,A2,.....,An}.In accordancewith a set of heuristiCSfor pre-processing
whichwill be given
in the nextsection,the attribute set d canbe partitionedinto twosetse~and~, where~ is the set of attributes in
that c~anbe searchedUsingSCAFS
and~is thesetof attributes in Athat Cannotbesearchedusing SCAFS
(i.e. the
Ingles queryoptimizerwouldprefer not to use SCAFS
for searchingon these attributes). Therefore,for a queryto
be processedusingSCAFS-the
search condition Of thequerymustonly consist of terms involvingauributes in e.
Tetrasinvolvingattributes in. f are called constructortermswhilethoseinvolvingattributes in d are called selector
terms. Thus,the goal ofthe rewrite systemis to rewd~query.searchconditionsthat include terms involving
attributes in ~ into querysearchconditionswith termsinvolvingonlyattributes in ~.
, Therewrite systemaboveconsists of eq,~t!ons and rules. In the case of relational databases,eq-~onswould
~epresem
l~U, tial ~denu’rydependende~
whilerules woulddefine l~rtialfunc~onaldependencies.Bell [B~I J 93]
defines a functional dependency
asa statementdenotedX--~Y, wherebothX and Yare sets of attributes in
relation ~, whichholdsin ¯ if and¯onlyif, for everypalrof tuples, u andv, if u[X]= u[Y]thenv[X]= v[Y].
Also, an!dent#ydependency,denotedX t-~Y, holds whenX ~Yand Y --~ X. Wedefine a par~’alfunctional
dependency
as a functionaldependency
that is not true for the wholedomainof Xi.e. for somevaluesof Xthere is
Page 292
AAAI-94 Workshop on Knowledge Discovery
in Databases
KDD-94
a uniquevalue for Y. Associatedwith a partial functional dependency
is a dependency
measure.The dependency
measure,din, is calculatedas :
(no. of uniquevaluesof Xthat havea uniquevalueof Yassoc~ttedwith it)
MI
din
(total number
of uniquevaluesof X).
Wesay that a~da/Men~ty
dependency
exists betweenXand.Yff for certain values of X there is a uniquevalue .
Of Y andfor those values of Ythere are uniquevaluesof X, As in ~e case.of a partial functional dependency,a
partial identity dependency
has a dependency
measureassociatedvath it, whichis measuredas :
(no. of uniquevaluesof X that havea uniquevalueof Yassociatedwith i0
dm
(total number
of uniquevalues of X).
(no. of uniquevaluesof Ythat havea uniquevalueof Xassociatedwith i0
m
.(total number
of uniquevaluesof Y).
or an identity dependency,
the value of dm is 1, Thusthe dependency
.. Clearly, for a functional dependency
measureindicates as to howclose the partial functionaler identity dependency
betweenXandY is to a functional
or identity dependency
betweenthe twofields.
Theequationsand rules required for the rewrite systemcan be automaticallygeneratedusingalgorithmsdeveloped
in the field of database mininglike STRIP[ANAN93b]
and KID3[PLAT91].
Weuse the STRIPalgorithm.
4..DomainKnowledge
for the SQP
Tests carried out on SCAFS
[ANAN93a]
for the NIHEapplication revealed that the Ingles QueryOptimizerwas
not using SCAFS
Wits full potential. Themeof ~nmryand Secondaryaccess:methodsis preferred most of the
lime to’using SCAFS.
TheIngres SearchAceel"era/or only employsSCAFS
after it has decidedto scan the whole
,fable being queried without taking into’accountthe advantagesto be ~ed by using SCAFS.
Thus, wecameto
the conclusionthatthe Ingres SearchAcceleratorandthe Ingres QueryOptimizerare very loosely coupledand
somemethodneededtO be developedtobridge this gap betweenthem.
We,therefore, soughta set of heurisfi’c~ whichwouldindicate siuuuionsunderwhichSCAFS
wouldbe used for a
particular query. Theseheuristics are ~ to guide the databaseminingalgorithmin its search for knowledge
¯ whichcouldbe usedfor queryreformulation.Thestudyof SCAFS
resulted in the followingheuristics :
ILllf the searchconditionin the querycontainsa restriction on a primaryattribute of the relation SCAFS
is no
used.
is not used
H.2 If the searchconditioncontains a restriction on an attributewith a secondaryaccessmethod,SCAFS
unless the expectednumber
of tuples satisfyingthe queryis greater than approx,s%of the relation size where,
, " .S = 13.83. for hash file orga,17Atlon
¯s = 9.75 for iSAMfile organization
s = 8.74 for BTree/Lieorganization
IL3 If the searchconditioncontainsan attribute with.anassociatedaccessmethodbut the accessmethodis not
nsabledue tO the unture 0f the predicate, SCAFS
is used (e.g. a conditionof the type account_no
< 100where
the relation is hashedon account_no
cannotuse the hashfunctionfor the search).
HAIf the search condition!~, a disjunctionin it, SCAFS
is alwaysused.
KDD-94
AAAI-94 Workshop on Knowledge Discovery
in Databases
Page 293
Keepingthe aboveheuristics in mindweshaUnowdiscuss the applicability of the heuristics for semanticquery
optimizationgivenbyKing[KING81]
as listed in section 2.1.
A. Index Introduction: Clearly fromthe heuristics ILl-H.4 introducingindexesin the queryis not recommended
aS this wouldensurethe use of the indexrather than SCAFS.
Infact, rather than indexinlzoduction,for the
SQPIndex £1i~nation is recommended.
B. Join Introduction:Thereasonfor introducingnewjoins in SQOis to take advantageof indexes(see 2.1.2).
Therefore,Join Inlroductionis not useful for the SQP.
C. ScanReduction: ScanReductionis useful for the SQPas the ~eater the restrictions on the tables the greater is
¯ the advantageOf using SCAFS.The join ~on has to be performed/n~ehost machineitself and so by
.restricting relations morewemereducingtlie numberof tuples being sent to the host machine,by SCAFS,
for
thejoinopmuion.
D. Join Elimination; Join e~’mationis useful for the SQPas a lowernumberof join operationimplies less work
for thei host
n~,
ff~o!nscan be eliminatedthey shouldbe.
i
~., chine. Therefore
i
5. The SemanticQueryIre-processor and its Usage
In section 4, weenumeratedthe DomainKnowledge
that wouldbe used by the SemanticQueryPre-processor. As
can be seen fromthe heuristics abovei the SQPuses someof ~e h~tics used by SemanticQueryOptimizers.
Thus, lhe SQPperformsa cena~ amountof query 0pfim~onthough this is moreOf a by-productrather than the
maingoal. Themainobjective of theSQPiS to use SCAFS
whererequired. Thebasic idea behindthe SQPis
givenin Table.1.
if (SCAFS
is not busy)
{
if (CPUis busy)
or (I/OIraffic is heavy)
or(cost_of_query(SCAFS)
- cost_of..query(~SCAFS)
< threshold)
{
,Reformulate Query using DomainKnowledge
and Transformation Knowledge
to use SCAFS
}
else
{
Reformulate Query using DomainKnowledge
and TransformationKnowledge
to reduce the
cost of executionof Query
Table 1 : TheSemanticQueryPre-processor(SQP)
Figure1 showsthe role of the SQP.h acts as a filter betweenthe queryinput andthe conventionalquery
opemi.
The Trat~ormation
Knowledge
used by the SQPconsists of two types of knowledge
: Static (Integrity
Conslraints)andDynamic
(knowledge
about the dAtAstored in the databasethat is frequendyIrue but not always
Page 294
AAM.94 Workshop on Knowledge Discovery
in Databases
KDD-94
Irue) Knowledge.The STRIPdatabaseminingalgorith m workingin depen~dyof the query preprocessor
¯ discovers
eq,,n!!ons
andrules.thatformtheDynamic
TvanM’o~tion
KnOwledge
ofthepre.processcf
(wediscuss
°.
the Soal-dlrecteddiscoveryroie of database, miningin the nexisecdon)~
Theseequationsand rules needto be
upda.ted along with any updatesm~to the dn!~ in.the database.Toreduce this overheadof updating the Dyno~c
Xn~ledse,
mes~e of the knowledgebase maybe restricted ma ~,predef’med.sizeusing a methodsimilar to
duuU~tby Siegel eL al [SIEG92]for discoveringrules for SQO.We
define our methodin the next secaon.
The
TranM’ormation
Knowledge
formsthe rewrite systemiamt rewrites queries as describedin section 3 above.
STrong
Rule
Induction
in.Parallel
Transformation
Host
,¯¯ .
Machine
Semantic
Query
Pre- processor
(SQP)
Query
" I Conventional.
(CQo)Query
Optimizer
¯
fig. 1 : Therole of Database
Miningin the Semandc
QueryPre-processor
(SQP)
The Domain
Knowledse
usedby the pre-processorhas already beendescribedin section 4 as a set of
heuri tics usedm guide the semanticwansformafions
(Therole of the STRIPalgorithmin goal-directeddatabase
miningis discussedin the next section).
The MachineUsagelqformationis providedby the UNIX
sys~mactivity report (mr) utility whichis
itandardUNIX
performance
moni~ringutility. Theestimationof querycosts is carried out using a
methodsimilar to that given by King[KING81].
6. PerformanceConsiderations
In this section wewill discuss howthe performance
of the SQPcan be improvedi.e. howthe cost of the SQPcan be
reduced.Themainidea is to reslrict the s~e Of the l ransfonnafionknowledge
and use databaseminingtechniquesin
a goal~lirectedrule discoveryrole (see fig. 2).
Theonly variable in.the architecture of the SQPis the size of the mmsformation
knowledge.There is a ~ff
be(weenthe cost ofthe SQPandthe. completenessof the IransformAtlon
knowledge.
Theeffect of the size of the
Inmsfafmation
knowledge
on ,the c0St Of the SQPis twofold :
¯* Thelarger the size of the,transformationknowledge
base, the greater is the cost of reformulAHon
of queriesas therewould~ a larger.set Of candidatetransformationsfor anygiven query.
: ~"wThelarger the transformationknowledge
base, the greater is the cost of up0~dng
the dynamic
" ummfommtion
knowledgewhenev~updatesare
madem the database.
KDD-94
AAAI-94 Workshop on Knowledge Discovery
in Databases
Page 295
Tiros tbe cost of the SQPis directly proportionalto the size of the knowledge
baseandrestricting the size of the
knowledgebase can be benefic/al froma performanceviewpoint.
Weconcenwateonly on the ~’ansaction queries hereandignore MISqueries, as they use SCAFS
anyway.
Themmsoc:fion
queries are ~efinedand are knownat the designstage of the application. Thus,the static
. .. !reJ~f. " ormationknowledge(Integr/ty Cons~,aboknOwn
at the d~gnstage) can be applied to these queries
¯ i ~theimplementation
stage of.. the i applicationandthetwoversims
~ ~ query., stored. Asthe static transformation
knowledge
never ~ges,lthe~ copies ofthe querieswili: alwaysbe semanticallyexlu/valent. Thus,weare left to
deal only with the dynamicuandorma~on
knowledgeand a set otparfiASSyu’ansformedqueries.
NoW
supposewehaveto limit the size of the dynmnictransformationknowledge
to k rules / equationsdueto
perforiannce considerati0ns. Using the STRIP~ese miningalgo~ [ANAN93b
] we can generate from the
. databa~ as above,an imti~ randomsetofk rules / eq~ns. To~hofthe~ rules we associate an interest
¯ ime~,e, v~o~form~for ~ m~of n~s have ~n suggested in DAtAbase Mining literature
:¯ ’ [PIATgI,PIAT93]
btlt wereckonl/utt the nieasuregivenby Siegel et: al. [SiEG92]
is the mostappropriatefor the
$
~
.$ie
!g
el
.
.
ft.
¯
alsu
gg
......
esteda
m~ure
based
oh
dA~AhA~eusa
e
~
Durin
a
trandonn
" "ifaparucular
"
at/on
¯
. g p . . . g
. . .... _.......
rule is not foundin the Iransf~i~ation knowledge
base, ~ ~e is ~nsideredto be a ca~dl.do~erule and gets an
interest measureassocia~ with it as well. Whenever
the interest meas~~iated with any of the candidate
¯ rules becomes
greater, than ~heinterest measureof anyof the rules in the transformationknowledge
base, STRIPis
¯ invokedto verify tbe candidateride which,if verified, replacesthe ~e with slowerinterest measurein the
transformationknowledgebase.
STrong
Rule
Induction
in Parallel
Transformation
Mining
Algorithm
Host=
Machine
Semantic
QueryPre - processor
(saP)
Query
Conventional
QueryOptimizer
¯
(CQO)
fig. 2 : Therole of DatabaseMiningin the SQPas a goal-directedrule discoverer
[ANAN93a].
. "the fmmalafor interest measuret~ed.in the SQPisbased on the scheduling methodused by SCAPS
Inthis fmmula,apart fromdatabase?~.eepatxems,the i~Ctivity of rules present in the lnmsformation
knowledge
. ~ asW~
ascandidate
rulesaffectsthein=restmeasureoftherule. Theinterest measure
of a rule diminishes
. withtheamo=t
of timethatit hasbeen_
inactive.
~ ~emeas=~ewould
bemore
accurate
fortheSQp
thanthe
one8iveii bySiegel et. aL, as the followingexamplewill show:
Considera rule, R1, in the lransformationknowledge
base that is used200 timesa day withinan hour. Thenour
Page296
AAA1-94Workshopon KnowledgeDiscovery in Databases
KDD-94
interest measurewouldreplacethis rule by anotherrule, R2, onceRI has beeninactive for a while, as the interest
measureof R1will have ~hed. This wouldnot be the case if weuse the formulaprovidedby Siegel et. al.
Thedatabaseusagepatterns and ~tivity of rides informationis stored as domainknowledge
and formspart of the
input to the STRIP
databaseminingalgorithmchangingits role to a goal-directedrule discoveryalgorithm(see fig.
2).
7. Conclusionand Future Work
In this paper wehaveshownhowdatabaseminingcan be used in a Systemperformanceproblemas opposedto its
usual application usage. Optim’m~ienof ~ of ~ and throughput is carded out by transforming
¯ quezies into $emanti~flyequivaient queri¢~ ~t u~e remureesw~availAhili~ varies. Wehave discussed how
databaseminin8
leclmiquekferman m~gralpart of the architec!meof such a querypreprocessorin obtainingrules
for qu~ reformulation, el~er ~dependentof the goal or goal~ted. Webeh’evetl~ the use of the Semantic
Preprocessor in large, Iransection in~ve data~ management
systems promisesadditional performance
enhancement,
particularly if wehavea possibility of onder-utilizafionof somevaluableresources.
As the heuristics for ~e utilisatj’on of SC~wereknownto the antho~fi-oma previous study of SCAFS
[ANAN93a],
wehavenot used da!~e miningtechniquesfor discevefingthe heuristics in the presented work.
¯ Howeverweare currently:stud~g methodsof confinn~gthe he,tics induced during the empirical study of
SCAFS[ANAN93a]
using d~t~b~ mining techniques.
8. References
[AGRA93a]
Agrawal,
IL, Imielinski, T., Swami,A. DatabaseMining:A PerformancePerspective
¯ m~Transactionson
Knowledgeand Data Engineering, Special Issue on ~amingand
Discoveryin Knowledge
- BasedSystems,VoL5, No. 6, Dec. 1993.
[AGRA93b]
Agrawal,R., Imielinski, T., Swami,A. Mi’ningAssociationsbetweenSets of Items
¯ in Massive Databases ACMSIGMOD,Washington DC., May 1993.
JAN91] An, H. C., Henschen,L.J. Knowledge-Based Semantic QueryOptimization
Lecture Notesin AI, 199I, Vol. 542, Pg. 82-91.
[ANAN93]
$. S. Ammd,
D. A. Bell, J. G. HughesUsing SemanticKnowledgeto Enhancethe
PerformanCeof Specialized Database HardwareWorkshopof the FsTrcs’93 Conference,
¯ IIT Bombay,December,1993.
[ANAN93a
] Anand,S. S., Hughes,J. G., Bell, D.A. An Empirical Study of ICL’s SCAFS
Internal Reportof the Depamnent
of Infommtion
Systems,Universityof Ulster at
Jordenstown
[ANAN93b]Anand,.S. S,. Bell, D,A, Hughes, J.O.. An A~h to Database Mining Using
Evidential Reasoning,internal Reportof the Department
of InformationSystems,Universityof
Ulster at Jordanstown
[BABB79]Babb~dImplementinga Relational Database by meansof Specialised Hardware
ACM
transactions of databasesystems, Vol. 4, No. 1, March1979.
[BABB85]
Babb,EdCAFSfile - correlation unit ICLTechnical Journal, Nov.1985.
[BELL93] Bell, DA.FromData Properties to Evidence,Wl~E Transactions on Knowledge
and Data
Engineering,Special Issue on Learningand Discoveryin Knowledge
- BasedDatabases,VoL5, No. 6,
KDD-94
AAAI-94Workshopon KnowledgeDiscovery in Databases
Page297
Dec. 1993
AboutData Semantics
[BERT92] Bertino, E., Musto, D. QueryOptimizationby using Knowledge
Data &Knowledge
Engineering9, 1992, Pg. 121 - 155.
[CHAK84]
Chakravarthy,U., Fishman,D., ’M.inker, J. SemanticQueryOptimizationin expert
¯ ’ systems and database systems ~gs from the Ist International Workshopon Expert
¯
DatabaseSystemsed. L. KerschbergPg. 659 - 674.
[COUL72]Coularis,G.F., Evans,J.M.,M/tchelI,R.W.Towardscontent addressing in databases
TheComputer
Journal, Vol. 15, No. 2, 1972.
[’DERsg0]Dershowitz, N., Jouannaud, J.P. Rewrite Systems Formal ModelsAndSemantics
¯ ed. Jan Van Lceuwen,ELsevier, 1990.
[FERRgl] Ferrari,D., Serazzi,G., Zeigner,A. Measurement
and Tuningof ComputerSystems
Prentice-Hall,1981.
[GRAE87]Graefe, G., DeWitt, D. The EXODUS
optimizer generator Proceedings of the 1987 ACM
¯ SIGMOD
Conferenceon Management
of Data (San Francisco, May1987) Pg. 160 - 171.
[HAMMg0]
Hammer,M., Zdonik, S. B, Jr. KnowledgeBased Query Processing proc. of the 5th
International Conferenceon VeryLarge Databases,MOntreal,October1980, Pg. 137 - 147.
[INGR]TheIngres Search Accelerator Users Guide
for QueryProcessing Efficiency
[KING79] King, J. J Exploringthe u~e of DomainKnowledge
Stanford University ComputerScience DepartmentHeuristic Programming
Project
Technical Report HPP-79-30,December1979.
[KINGgl]Kingt J. J QUIST: A SystemFor SemanticQueryOptimizationIn Relational Databases
- Proc. of the 6th International Conferenceon VeryLargeDatabases,1980,Pg. 510 - 517.
[I.EUN8$]Lean8,C.H.C,Wong,K.S. File processingefficiency on the content addressablefile store
Prec. VLDB
conference, 1985.
[PIATgl] Piatetsky-Shapiro, G., Frawley, W.J (ed.) KnowledgeDiscoveryin Databases AAAI
Press / MITPress, 1991.
[PLAT93]Piatetsky-Shapim, G. (WorkshopChair) WorkingNotes of Workshopon Knowledge
Discoveryin Databases, AAAI-93,
1993.
[SIF.G92]Siegel, M., Sciore, E., Salveter, S, A Me~edFor Rule Derivation to Support
..... SePtic Query Optimization ACMTransactions on D~t~be~eSystems Vol. 17, No. 4,
1992,Pg. 563- 600.
[YU89]Yu, C., Sun, W, Automatic KnowledgeAcquisition and Maintenanceof Semantic Query
Optimization IEEETrans. Know.Data Eng. 1,3 (Sept. 1989) Pg. 362 - 375.
Page298
AAAI-94Workshopon KnowledgeDiscovery in Databases
KDD-94