From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Database: Mining,m the Architecture of a Semantic : /Pre,processor for State-AwareQuery ~.~ Optimization ¯ Smabjot S. Anm~tl~vid A. Bell "i)¯ Department oflnformation Systems John G. Hughes i i Fac~tty of~tomu~s ~, Universi~ of Ulster (Jmdanstown) Northern Ireland ’ F.,-Mafl: ssanand,dbell~gh~uk.ac.ulster.ujvax Abstract : " . . Most sppl/~ of in database mining at present seemto, be ~centmled~d the discovery of knowledgefor " -. . app~on~eriented ~onsupport. In thislmPer We descn~’a ~I for.enlmncing me performance of the ¯ "~ .:/ ’ " ~ ~entsy.~ itselfusing~anti~captured.by ~~g techniques. We investigate the use .:: of~ mtningin ~portlng State-aware qU~ opdmizaace where improved performance is sought by ¯ d~tu~,Mlytlgl~_~ngthe u~on of available ges~. ¯. Weconzkler, in pertic.n!m’, theuse of seman,tic knowledge in enlmncingthe systemperformanceand the u~on . .:-i of ~ hardware such as l~’s SC~by. ,senq~ntJc qUe~y,~fonn*1!~fio~The ~ is to discover semantic fmfotinatlon about.th.e’data stored in~ed,tA~.using .~mining tec~ues ~ useitto reformulate queries i: ..... use ~m,~tabteresoui’CeS atme:mn.e ofexecufion oftheq~,:Th~/the execution strategy of the ~:. m ~ better " ~ is~ Onth.ecunent staie ofthesystem .(and so’stat~’nware~). ~ ach/eve tldswecreate a rewr/te system, ¯~_,~tng ofrules and equations mined ~mthe ,~,,Jh_~, that rewrites q~es made tothe database for state"awm’# exlg;lt/on. ¯ Wepro.vide m architecture for this querypre-processoranddiscuss performance comddeml~ns involvedin its kup~on. Keywmds : Database Mining, SemanticQueryReformulaann,Rewrite Systems, Associative Memories, Specialized ~ Hardware. Resource UaU,~am. 1. Introduction .Developersof !arge database applications are often f~ed with performanceboulen~.andam~,qui~l mdevelop i~/~,,.~,ov~,~ m~to,~s ~em. x.~s~ we~m~ =~h~.ofaquery ~-pro~.or empioys~ mining.lec!miques m help distribute me query load ovm" meavailable resources (’mte-awm’~’ ¯ : ’-.query optimizat/on)reducingproblemsof resource boundeAness. Databasemininguw.Jmiqum can be employedfor achievingstate-aware QueryOpttml-mlonfor three separate iwpom ¯ Dme~on of~om~ beingun~k=-utnised i " , ~m’muktion of heuristics affecting the utilisation oftheresomce ¯ Dim:overy ofndesfrom,~,,, require, to refonn,,~-,~ queries m increase u "ulismJonof meresomce Fog.thefirsttwopmposes, Le.thedetection of under-u~ ~ andme formulation of heurist/cs about the ~m of the re~,~ mining algori~sne~l to be ~ for detecting aaaodation patterna in dn,~ about queries executed, their ~ ,~At~dcs and their execution plans. KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 287 ¯ t wemmua~ in ~ p,x~onthea=dp~,weinVesaSa~ ~e po~ble mR of ,a~mb~.. mU~g to d~,ov= rules fromdata required tO reformulatequeries, This is doneby using the STRIPulgorsdun[ANAN93b]. The STRIP alg .mimm canbeemployed in twoaiff~ntways D~scovery of allruies implicitly presentin the database,independently of whatis requiredfor query ¯ ~’ reforll~n!ndolrl ¯,.’. * Goal ~ dlscovery of rules In seclton 5 wedescribe howthe S~miningalgorithmcan be employedfor the discovery of rules independently of mequ~ret’m~n,_,!.,~cm wUi~ Secu0n 6 describesme roleof STUn’ in megoul~iimc~ dlscov~of nae~ ¯ , Themoelvmlon for ~ work~nted~ this paper~ fromperformance re~ workun.dermken bythe anmors : a meNo,~ ~..u~ ~u~nsm~ua~ pro;~y~mcm damb~ sys~m is oneof me e cram).TheNUm largeit relational database systems under .devel~t ~ E~I ~ ~e/is aPProximately 20 GBin size, ~buttd over five regions ofthe h0nsingex~utive.It ~ an expec~ workload of 3700on,line transactions per : regIOmper day and ~ on.line uitem, his implementedm 11 ICL’s DRS6000mac~,~g INORES/Star. The slze~of meNmB.a,,.._~ase appncmlon ~ medevel~,wima umber0fl~rommnce problemsand -... ~.u~a~or~nmova~. ve soluaonswas~ ~ l~mme~mms un~aninv~saaon of the xns.~ ::.! .....~h Acce|m~r developed by !~ in c~~onwithl ~i ~. [INOR]~ ~ehean of the Ingres Search ’if: i " ...Ac~~ is SCAPS;the ~nt g~on Of I~ s C0n~ Addre~l¢ File $~ (CAFS) t’BABB79, ’:" ’" ~ BABBg.5,COUL72,LEUN85]. SCAFS tsa ~nd filter 4m~basemachine ~ can perform associative searches " ". Ou~Jadonai~ tables. The advanmSes’of the Ingres ~h Ac~lm’m~are: " " * i .it off-!~ thocPUwhichWouldoth~ be using most of its fimein trivial comparisons ¯ + it reducesthe I/O traffic as only:thosetuples that satisfy the queryarepassedinto mainmemory .. ~ ~-_~stuaimpeaormed on sc,~s [ANAN~ 93a] for the Nm~ a~caaonshowed mat me~ngresQueryOpamizer ¯ (1~). ufili~s SCAFStO. a~ ~um. Tbe~clusion W~~in l ~sentslme S CAFS ,w~uld not b e o f m makeSCAFS : much~p mfldsapplir~on. This ledmase~.h for meth~~could ’usable [ANAN93] for the .... NI~ appnuuion in pm~ular, andlarge~ appli~ wire high ~don taxes in genmL . Thestudy of SC.3kFS[ANAN93a] resulted in a set of heuristics ~ Could~ct whenthe IQOwoulddecide on :. using S~AFSand whenit would decide against it. ~ suggested ~wayforward was to design a ’Query " Pre~so~ fi~t could Iransform the query input into a form that Wouldmakethe IQOdecide to use SCAFS. Refcmulm!on of queries using knowledgehasbeenstudied in ~ detail in the topic of SemanticQuery ¯ : from S~ because it Ibm a diffezeut ~,W~eSQO~ ~edat ~g the cost of query proce~.g, the mfic q~ry pre-~ssor mmsformsqueries m ~e be~ use of the#v~le resources. For example, if the --~ . CPUis v~:bnsy,reformuladnga querym~it ~ SCAFS nm~:l~.~beneficial f.~mesystemas a whole , "eve,n if the ~ timefor this querywouldbe~r tlian the re .s~.n~.. time of ~ Origlnulquery.In this sense -: . mopa~,.-~on is ’m-awm’,.So~ ovmm w~execua~i~ e~ a~thou~ the response ameof me .-.::~: nd’o~,,,almedquery maybegreaU~r I ~dmtfor :meori~ query, ~us, ~e 8oulof ~que~ypre-processor is to /:-~epthnlN me ufll!zsflon lind throughput peffmice indices ~I]~ ~ the response time which is . : .. of~n~ n~ of ~tereSt in I~fm ~.Fora ~m intensive applie~ion like that of me housing " execu~ve,opinionsmemmughput and~~ofresourcesare of primaryimpmance. TI~. reles ~ ~mn dam baseS .using ~ mi~8 [AN.a, JqgSb,PL~T91,1PIAT93,AGRA93a , AGI~,93b] ca be¢~ mfcmnrides ~ a mvri~systemC1~901.In our ~ the, u~nsof ~ rewritesystem would be the semr~ ~dons ~ the q~ (see :secaon 3for It de~~uss!on on ~e !~write system used in our query pm.~). ~ang mms~mi0n ~s in Seanana~ ~ Opamizaxiouas rewrite rules has been ITm limk ld~ otto smoctmt,m ,mm~ ts m spp/y sumelmrdws~ sm~ 1o~ dtn~lym ~ ~ m a k ~ ~ ~, ~ ~ ~ d~ ~ ~ M’dm IMml m I~ mmsmim~sm~u d~, dmmml m dmfims* rumpuS, d~m~by no* anly mddc~ ~ ~ ~ ~ ~g ~e ~ ~ ~ to Im,--.~a maby tim lmmUw. Page 288 AAAI.94 Workshopon KnowledgeDiscovery in Databases KDD-94 ¯ ~ byO,~t.~var~y et.aL [CHAgS4]. ~write~mshave~o bee, u~in the contextof representing 2 ~ "" ~ tra~formations in Rule Based Qu’ery Op~on[OR.AE87]. The Semantic Query Preoprocessor "~ (SQP)regards the rules discove~dfrom the databa~ as rules in a rewrite ~sWm. rest of the paperts in the followingformat..Fu~, wehavetworeviewsectionsfor oursystem q~plic~on. In ~- ./i I~ .x, I~fl~w gjve a ~ewof workbe, ing done in/semi!(:°q~ryop~on,In partlc-lnr wefocus on "iii~}:i~~m e I ~sgestedby =-~L,~hers t. SO3.~0"13reviews~w,~~S ~d inu~e.~ the re.re system (~ k~.wledge to g~d~ tile ~h for ~owledgeby ~ mdning a!g(~’thm, for que~ reform.,1,tion. ~w.ehe~=ethe ~ of ~mesperrorm~ ear, ero. SC~Sb~ the ~thm~XNg~a].Theseheuristics - m em~-re~, ~ to ~ w~ ~ow. ~cs in ~. 2: Se~ ~ ~bes the~o~re of the semantic ¯,:.: p=-p .or, inau a o,howSoaU,e d r.overy ofrate, ’ ’:m!t~databaSe mining techniquescan beused inthe query pre-proceum,. 2. SemanticQueryOptimization ¯.- . King[KING79.,KINOSI] and, independently, ~er et. aL ~80] ~troduced a technique of query :’.: ’i" opdm!-Jdoa that used knowledge aboutthe ~f~ in the ~, i.e. do~ ~ndent strategies, for query : ~g. This techn!que came to be ~bwn. Semantic Qu~ ~liml-e~on ($(~))i S(~K:) uses knowledge : : ~ the ~;e Stored in the ~ Wmmsform queries into Other queries ~are sem~tically equivalent i.e. ~,At /, " ~ thesmeamW~".the ofi~4nal qu~esbuthaveslower exec~onCosL The ~owledgeon which the i" Ya el. al [NU891Suggasted the ~ of Dy~cCo~aint$. These ate ~ ~ frequently satisfied by the ::i-. &.,~but ~ not always be true. These constraints would n~ to be ~ whenupdates are madeto the -~i ~ as~~huemayafro-tthe ~ruth-value’ of the contr.Yh~se seman~¢ Query optimize,havethe " advantageof being relatively easy to change.Newdomainfactstrotbe easily addedas can newoptlm;-Jtlon ¯ ’ ~Biq~, ¯ . ge~: Query ~on uses two ~ of knowledge: ’:¯ .Semantic Integri~. . and ~c Constraints or Trons/ormationKnowledge,used to Iransformgiven queries s Heuristics’from ConventionalQueryOpl~t-~aonor Domain£~owledgethat help to guide the transformations. ¯ ~ [KINGS l] Wovides a mtof heuristics for controllingthe uansformalions of queries. Theheuristics are sub.. ~livtaed intotwoSnmps: ¯ " s Hetwisdcsfor.considering a transformationand ¯ "~ for rejecting a uansformaeon 2.1 Heuristicsfor considering transformations ¯ . ’ For a gtvea query there are p0m’blya ~ numberof ca~. ’date semantictransformationsavailable ¯ . in therewdto~ste~ Thus,.a me, of ~~g the numberof candJ_d~t_e semantic :~~On,is req~~ ~cs pm~ in ~,_’~ ~no.. ~ a ~f~ senwe¢ .... t=mfemmiOn i.e. atransfonnat/on that would reduce the rest ofexecution ofthe query. ¯ 2.1,1 ladex la~ : Ira table involved in the query is indexedon an attribute O~t is not ~~.".possible i.e; ifittankeep semantic equivalence. KDD-94 AAAI-94Workshopon KnowledgeDiscovery in Databases Page289 example,co~ider the followingquery , ~electsurnmne,~dd, ess .... from employees ..... Q1 ¯ where income > I0000 :- $~ the table employeesis indexed on the field empnoand we have the following ’~". Tran~ormatlonKnowledge: . : ¥1ncome ¯ 9000thenemp_nO > 300. :, T~ ~ my be ~uf~ned m select surname,address ~ from employees..... Q2 where income > 10000 ¯ aademp_no>300 ~, .Tt~ querycan nowu~ll,e the indexon crop_no andthereforehasa lowerexecutioncost. IPl~ 2.1.2Joinlntngin~lon: If a relation notinvolved in theque~hasaclusteringlinks intoa much larger " : " . file thatisinvolved in thequery, therelation should beadded to theque~ eventhough this . ~--.. iintzoduces anextrajoin. " ’ l~0¢example,cormiderthe following SQLquery select surname,address ....." " from employees ..... Q3 ~ whereincome ¯ 10000 ¯Ifthe~re ~ a clusu~’ing!ink fromtableemperksinto table employseaand emperksis muchsmaller tlum employees,~c followingquerywill havea lower executioncosteventhougha join has been ¯ ¯ : select employees.vurname ~mployees~Idress .from employees,emperks Where employees.income > 10000 .... Q4 and employees:crop_no = emperks.emp_no Z!3Scan~tion : If one of the tab.los involvedin a join is unrestricted whilethe other table is ¯ :: ~.. in~gldy restricted,restrictionsshouldbe added,if possible,onthe unrestrictedtable. ¯ " Pot’ example,Considerthe SQLquery " ff l select eaurname,el.car ’ from employees e,emperks el ..... Q$ wheree.income ¯ 10000 : l ~ e’~ 1 el.crop_no ¯ Ifwe’havethefollowing Tran~ormation Knowledge :~ii¢~ncome >I000o t~n here_allowance~0 -.Thefollowing.queryis semantic~yequivalent to the query above ~ l : ~ " . se!ecte-y.urname,el.car ..... fromemployeese,emperks el :,~, "w~re t.income ¯ 10000 .iese Q6 ,, . and el ~heuse_allowance= 10 and e.emp_no= el.crop_no 2,1AJa[neltm.~on:If atable is involvedin a join but noneof it’s attributes axein the outputlist of ¯ i the~,tm~form~/ontoeliminate the table fromthe query should be .... .: FOg~mple,consider the SQLquerybelow: " ’ l" ~ ’l~l :~ ’ l . ~ct e.vtoz~e,e.address ": i from employees e,emperks el where e.lncome ¯ 10000 ..... Q7 Page 29O AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 and e.emp_noffi el.emp no ¯ ThefollowingSQLqueryis semanticallyequivalentandhas no join andtherefore has a lowerexecutioncost : select surname,address from employees ..... Q8 wheree.income > 10000 (Note: ThequeryQ7wouldhavea lowerexecutioncost ff there wasa clustering link betweenthe tables employeesand emperks) 2.2 Heuristics for rejecting a transformation Givena particularquery,the heuri~csprovided in this sectiondefinesemantic transformations that wouldnotbeuseful in the attainment of a lowercost of executionof the query.Suchtransformations shouldnot be appliedto the query. 2,2.1 Donot introduce uniinkedjoins ¯ Theexamplegivenin 2.1.2 wouldnot be a cost reducingtransformationif there wasno clustering link fromemperksinto employees. 2.2.2 Donot removetables with highlyrestricted attributes whichhaveclustering links on them.In the examplein section 2.1Aif there existed a clustering link fromemperksandemployees,the transfmmattonfromQ7to Q8wouldincrease the executioncost and shouldtherefore not be performed. 2.2.3 Arelation highlyrestricted on an attribute with a clusteredindexshouldnot be transformed. select surname,address from employees ....Q8 where income > 10000 and emp_no> 300 If wehave the following TransformationKnowledge ¯ ~fincome > 10000 then emp_no> 300 ¯ A-transformation Of Q9to the querybelowshouldbe rejected if there is a clustering indexon emp_no. select surname,address from employees ..... Q9 where income> 10000 TheSemanticQueryPre-processorproposedin this paper differs fromqueryoptimizers (both conventionaland semantiC)in that it takes into accountthe presentstateof the runningd~_!~base sYstem.Considera large database application with.a large workload.Useof ConventionalandSemanticQueryOptimizerscan reducethe cost of proceminga query, but whathappenswhenthe CPUcannot handle any morequeries or the I/O channelsare too busy to support any newqueries ? ThaWoposed SQPtakes these environmentallimitations into accountand using teclmiquesborrowedfromSQOreformulatesthe queSTto bemoresuitable for processingusing specialized lumiware.In the environment of our application the Specializedhardwareis SCAFS. Usingit increasingthe load on the I/O channelsandCPUonly ma~gina!lyallowingalarger nUmber of queries to be processedconcurrently. TheTrart~formation Knowledge in the SQpiS the sameas that in tit e SQO but the SQPdrifters in its Domain gnow/edg~, TheDomain Knowledge has the samerole, i.e. to guidethelquery transformation,but it differs dueto the changein goal Thegoal of the SQPis notreducingthe cost of processingor reducingresponsetime but to makethe query more suitable for execution using SCAFS i.e. llrtakir~g more efficient use of ~. The Domain Knowledge for the SQpconsists of a set of heuristics errived at by a previousstudy of SCAFS undertakenby the antfiors [ANAN93a]. At pre~nt workis being undertakenin the authors’ laboratory to find ways of discoveringthese heuristics themselvesautomaticaliyusing databaseminingtechniques. KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 291 3. Rewrite Systems In this section wefirst give somegeneraldefinitionsof rewrite systemsalongwith a brief reviewof researchwork ¯ in progressin this field. Wethendefine the rewrite systemusedby us for our semanticpre-processor. RewriteSystems[DERSg0] are a set of ~ted equations usedto c0mputethe simplest form possible for a flmnula by ~dly replacing sub-~ of the given formulawith equal terms. The final result of an unextendible sequenceof rule applicationsis called a norma/form. In rewrite systemswehavea set of termscalled constructortermsbuilt usingconstructors.Constructorsare the set of irrednc~le symbols~ Theobjecfiveof a rewrite systemistoreplace sub-termsconsisting of non-construcwr or ¯xlector terms with constructor sub,terms, This process is ~ until a constructor term is obtained. A systemis to’be $~y[cientlycompleteif accordingto the semanticS,everytermis equalto (i.e. can be rewrittenas) a term built only fromconsm~tors. Anequationis an tmordered pair of terms{s,t} in ~ whichis denotedby s ~ t. AtermT in ~ rewrites to Yin ~, l +4a r, if I has a sub-termthat is an instanceof oneside of an equationin Eandr is the result of replacingthat subtenn with the conesponding instanceof the other side of the equation. Rewriterules imposedirectionality on the use ofequationsin proofs. Arule overa set of terms, ~, is an ordered pair el,r> of terms, whichwewrite as I ~ r. Likeeq~tlous,rules are usedto replaceinstancesof T by ’ ’ ~.’ Aset ......of suchrules forma term-rewritingsystem.For ¢¢nespondinginstancesof meright-hand side completeness werestate the formaldefinition of a rewrite given in [DERS90] : Defn: Fora giventerm-rewritesystem,R, a terms in ~ rewritesto a termt in Y, writtens -’~Rt if s~ = 1oandt = s[rO]pfor somerule 1 ~ r in R, positionp in s andsubstitution¢7. Asubsl/tution, o, is a special kind of replacement operationuniquelydefinedby a mapping fromvariables to termsv~rinenas {xI --~ Sl,X2 --, s2..,.,Xm--~Sm}~ Atermt in ~r matchesa terms in ~ if so = t for some substitutiono. t is said to be an instanceof s or wesay that s subsumes t. Thesub-term,s~ at whichthe rewritecan take place is called the redex.If no redexexists the terms, is said to be irreducibleor inrnonnalform. Let us nowconsider howwemightuse this techniquein our querypre-processor.Considera relation ~ with auributeset ~ = {AI,A2,.....,An}.In accordancewith a set of heuristiCSfor pre-processing whichwill be given in the nextsection,the attribute set d canbe partitionedinto twosetse~and~, where~ is the set of attributes in that c~anbe searchedUsingSCAFS and~is thesetof attributes in Athat Cannotbesearchedusing SCAFS (i.e. the Ingles queryoptimizerwouldprefer not to use SCAFS for searchingon these attributes). Therefore,for a queryto be processedusingSCAFS-the search condition Of thequerymustonly consist of terms involvingauributes in e. Tetrasinvolvingattributes in. f are called constructortermswhilethoseinvolvingattributes in d are called selector terms. Thus,the goal ofthe rewrite systemis to rewd~query.searchconditionsthat include terms involving attributes in ~ into querysearchconditionswith termsinvolvingonlyattributes in ~. , Therewrite systemaboveconsists of eq,~t!ons and rules. In the case of relational databases,eq-~onswould ~epresem l~U, tial ~denu’rydependende~ whilerules woulddefine l~rtialfunc~onaldependencies.Bell [B~I J 93] defines a functional dependency asa statementdenotedX--~Y, wherebothX and Yare sets of attributes in relation ~, whichholdsin ¯ if and¯onlyif, for everypalrof tuples, u andv, if u[X]= u[Y]thenv[X]= v[Y]. Also, an!dent#ydependency,denotedX t-~Y, holds whenX ~Yand Y --~ X. Wedefine a par~’alfunctional dependency as a functionaldependency that is not true for the wholedomainof Xi.e. for somevaluesof Xthere is Page 292 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 a uniquevalue for Y. Associatedwith a partial functional dependency is a dependency measure.The dependency measure,din, is calculatedas : (no. of uniquevaluesof Xthat havea uniquevalueof Yassoc~ttedwith it) MI din (total number of uniquevaluesof X). Wesay that a~da/Men~ty dependency exists betweenXand.Yff for certain values of X there is a uniquevalue . Of Y andfor those values of Ythere are uniquevaluesof X, As in ~e case.of a partial functional dependency,a partial identity dependency has a dependency measureassociatedvath it, whichis measuredas : (no. of uniquevaluesof X that havea uniquevalueof Yassociatedwith i0 dm (total number of uniquevalues of X). (no. of uniquevaluesof Ythat havea uniquevalueof Xassociatedwith i0 m .(total number of uniquevaluesof Y). or an identity dependency, the value of dm is 1, Thusthe dependency .. Clearly, for a functional dependency measureindicates as to howclose the partial functionaler identity dependency betweenXandY is to a functional or identity dependency betweenthe twofields. Theequationsand rules required for the rewrite systemcan be automaticallygeneratedusingalgorithmsdeveloped in the field of database mininglike STRIP[ANAN93b] and KID3[PLAT91]. Weuse the STRIPalgorithm. 4..DomainKnowledge for the SQP Tests carried out on SCAFS [ANAN93a] for the NIHEapplication revealed that the Ingles QueryOptimizerwas not using SCAFS Wits full potential. Themeof ~nmryand Secondaryaccess:methodsis preferred most of the lime to’using SCAFS. TheIngres SearchAceel"era/or only employsSCAFS after it has decidedto scan the whole ,fable being queried without taking into’accountthe advantagesto be ~ed by using SCAFS. Thus, wecameto the conclusionthatthe Ingres SearchAcceleratorandthe Ingres QueryOptimizerare very loosely coupledand somemethodneededtO be developedtobridge this gap betweenthem. We,therefore, soughta set of heurisfi’c~ whichwouldindicate siuuuionsunderwhichSCAFS wouldbe used for a particular query. Theseheuristics are ~ to guide the databaseminingalgorithmin its search for knowledge ¯ whichcouldbe usedfor queryreformulation.Thestudyof SCAFS resulted in the followingheuristics : ILllf the searchconditionin the querycontainsa restriction on a primaryattribute of the relation SCAFS is no used. is not used H.2 If the searchconditioncontains a restriction on an attributewith a secondaryaccessmethod,SCAFS unless the expectednumber of tuples satisfyingthe queryis greater than approx,s%of the relation size where, , " .S = 13.83. for hash file orga,17Atlon ¯s = 9.75 for iSAMfile organization s = 8.74 for BTree/Lieorganization IL3 If the searchconditioncontainsan attribute with.anassociatedaccessmethodbut the accessmethodis not nsabledue tO the unture 0f the predicate, SCAFS is used (e.g. a conditionof the type account_no < 100where the relation is hashedon account_no cannotuse the hashfunctionfor the search). HAIf the search condition!~, a disjunctionin it, SCAFS is alwaysused. KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 293 Keepingthe aboveheuristics in mindweshaUnowdiscuss the applicability of the heuristics for semanticquery optimizationgivenbyKing[KING81] as listed in section 2.1. A. Index Introduction: Clearly fromthe heuristics ILl-H.4 introducingindexesin the queryis not recommended aS this wouldensurethe use of the indexrather than SCAFS. Infact, rather than indexinlzoduction,for the SQPIndex £1i~nation is recommended. B. Join Introduction:Thereasonfor introducingnewjoins in SQOis to take advantageof indexes(see 2.1.2). Therefore,Join Inlroductionis not useful for the SQP. C. ScanReduction: ScanReductionis useful for the SQPas the ~eater the restrictions on the tables the greater is ¯ the advantageOf using SCAFS.The join ~on has to be performed/n~ehost machineitself and so by .restricting relations morewemereducingtlie numberof tuples being sent to the host machine,by SCAFS, for thejoinopmuion. D. Join Elimination; Join e~’mationis useful for the SQPas a lowernumberof join operationimplies less work for thei host n~, ff~o!nscan be eliminatedthey shouldbe. i ~., chine. Therefore i 5. The SemanticQueryIre-processor and its Usage In section 4, weenumeratedthe DomainKnowledge that wouldbe used by the SemanticQueryPre-processor. As can be seen fromthe heuristics abovei the SQPuses someof ~e h~tics used by SemanticQueryOptimizers. Thus, lhe SQPperformsa cena~ amountof query 0pfim~onthough this is moreOf a by-productrather than the maingoal. Themainobjective of theSQPiS to use SCAFS whererequired. Thebasic idea behindthe SQPis givenin Table.1. if (SCAFS is not busy) { if (CPUis busy) or (I/OIraffic is heavy) or(cost_of_query(SCAFS) - cost_of..query(~SCAFS) < threshold) { ,Reformulate Query using DomainKnowledge and Transformation Knowledge to use SCAFS } else { Reformulate Query using DomainKnowledge and TransformationKnowledge to reduce the cost of executionof Query Table 1 : TheSemanticQueryPre-processor(SQP) Figure1 showsthe role of the SQP.h acts as a filter betweenthe queryinput andthe conventionalquery opemi. The Trat~ormation Knowledge used by the SQPconsists of two types of knowledge : Static (Integrity Conslraints)andDynamic (knowledge about the dAtAstored in the databasethat is frequendyIrue but not always Page 294 AAM.94 Workshop on Knowledge Discovery in Databases KDD-94 Irue) Knowledge.The STRIPdatabaseminingalgorith m workingin depen~dyof the query preprocessor ¯ discovers eq,,n!!ons andrules.thatformtheDynamic TvanM’o~tion KnOwledge ofthepre.processcf (wediscuss °. the Soal-dlrecteddiscoveryroie of database, miningin the nexisecdon)~ Theseequationsand rules needto be upda.ted along with any updatesm~to the dn!~ in.the database.Toreduce this overheadof updating the Dyno~c Xn~ledse, mes~e of the knowledgebase maybe restricted ma ~,predef’med.sizeusing a methodsimilar to duuU~tby Siegel eL al [SIEG92]for discoveringrules for SQO.We define our methodin the next secaon. The TranM’ormation Knowledge formsthe rewrite systemiamt rewrites queries as describedin section 3 above. STrong Rule Induction in.Parallel Transformation Host ,¯¯ . Machine Semantic Query Pre- processor (SQP) Query " I Conventional. (CQo)Query Optimizer ¯ fig. 1 : Therole of Database Miningin the Semandc QueryPre-processor (SQP) The Domain Knowledse usedby the pre-processorhas already beendescribedin section 4 as a set of heuri tics usedm guide the semanticwansformafions (Therole of the STRIPalgorithmin goal-directeddatabase miningis discussedin the next section). The MachineUsagelqformationis providedby the UNIX sys~mactivity report (mr) utility whichis itandardUNIX performance moni~ringutility. Theestimationof querycosts is carried out using a methodsimilar to that given by King[KING81]. 6. PerformanceConsiderations In this section wewill discuss howthe performance of the SQPcan be improvedi.e. howthe cost of the SQPcan be reduced.Themainidea is to reslrict the s~e Of the l ransfonnafionknowledge and use databaseminingtechniquesin a goal~lirectedrule discoveryrole (see fig. 2). Theonly variable in.the architecture of the SQPis the size of the mmsformation knowledge.There is a ~ff be(weenthe cost ofthe SQPandthe. completenessof the IransformAtlon knowledge. Theeffect of the size of the Inmsfafmation knowledge on ,the c0St Of the SQPis twofold : ¯* Thelarger the size of the,transformationknowledge base, the greater is the cost of reformulAHon of queriesas therewould~ a larger.set Of candidatetransformationsfor anygiven query. : ~"wThelarger the transformationknowledge base, the greater is the cost of up0~dng the dynamic " ummfommtion knowledgewhenev~updatesare madem the database. KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 295 Tiros tbe cost of the SQPis directly proportionalto the size of the knowledge baseandrestricting the size of the knowledgebase can be benefic/al froma performanceviewpoint. Weconcenwateonly on the ~’ansaction queries hereandignore MISqueries, as they use SCAFS anyway. Themmsoc:fion queries are ~efinedand are knownat the designstage of the application. Thus,the static . .. !reJ~f. " ormationknowledge(Integr/ty Cons~,aboknOwn at the d~gnstage) can be applied to these queries ¯ i ~theimplementation stage of.. the i applicationandthetwoversims ~ ~ query., stored. Asthe static transformation knowledge never ~ges,lthe~ copies ofthe querieswili: alwaysbe semanticallyexlu/valent. Thus,weare left to deal only with the dynamicuandorma~on knowledgeand a set otparfiASSyu’ansformedqueries. NoW supposewehaveto limit the size of the dynmnictransformationknowledge to k rules / equationsdueto perforiannce considerati0ns. Using the STRIP~ese miningalgo~ [ANAN93b ] we can generate from the . databa~ as above,an imti~ randomsetofk rules / eq~ns. To~hofthe~ rules we associate an interest ¯ ime~,e, v~o~form~for ~ m~of n~s have ~n suggested in DAtAbase Mining literature :¯ ’ [PIATgI,PIAT93] btlt wereckonl/utt the nieasuregivenby Siegel et: al. [SiEG92] is the mostappropriatefor the $ ~ .$ie !g el . . ft. ¯ alsu gg ...... esteda m~ure based oh dA~AhA~eusa e ~ Durin a trandonn " "ifaparucular " at/on ¯ . g p . . . g . . .... _....... rule is not foundin the Iransf~i~ation knowledge base, ~ ~e is ~nsideredto be a ca~dl.do~erule and gets an interest measureassocia~ with it as well. Whenever the interest meas~~iated with any of the candidate ¯ rules becomes greater, than ~heinterest measureof anyof the rules in the transformationknowledge base, STRIPis ¯ invokedto verify tbe candidateride which,if verified, replacesthe ~e with slowerinterest measurein the transformationknowledgebase. STrong Rule Induction in Parallel Transformation Mining Algorithm Host= Machine Semantic QueryPre - processor (saP) Query Conventional QueryOptimizer ¯ (CQO) fig. 2 : Therole of DatabaseMiningin the SQPas a goal-directedrule discoverer [ANAN93a]. . "the fmmalafor interest measuret~ed.in the SQPisbased on the scheduling methodused by SCAPS Inthis fmmula,apart fromdatabase?~.eepatxems,the i~Ctivity of rules present in the lnmsformation knowledge . ~ asW~ ascandidate rulesaffectsthein=restmeasureoftherule. Theinterest measure of a rule diminishes . withtheamo=t of timethatit hasbeen_ inactive. ~ ~emeas=~ewould bemore accurate fortheSQp thanthe one8iveii bySiegel et. aL, as the followingexamplewill show: Considera rule, R1, in the lransformationknowledge base that is used200 timesa day withinan hour. Thenour Page296 AAA1-94Workshopon KnowledgeDiscovery in Databases KDD-94 interest measurewouldreplacethis rule by anotherrule, R2, onceRI has beeninactive for a while, as the interest measureof R1will have ~hed. This wouldnot be the case if weuse the formulaprovidedby Siegel et. al. Thedatabaseusagepatterns and ~tivity of rides informationis stored as domainknowledge and formspart of the input to the STRIP databaseminingalgorithmchangingits role to a goal-directedrule discoveryalgorithm(see fig. 2). 7. Conclusionand Future Work In this paper wehaveshownhowdatabaseminingcan be used in a Systemperformanceproblemas opposedto its usual application usage. Optim’m~ienof ~ of ~ and throughput is carded out by transforming ¯ quezies into $emanti~flyequivaient queri¢~ ~t u~e remureesw~availAhili~ varies. Wehave discussed how databaseminin8 leclmiquekferman m~gralpart of the architec!meof such a querypreprocessorin obtainingrules for qu~ reformulation, el~er ~dependentof the goal or goal~ted. Webeh’evetl~ the use of the Semantic Preprocessor in large, Iransection in~ve data~ management systems promisesadditional performance enhancement, particularly if wehavea possibility of onder-utilizafionof somevaluableresources. As the heuristics for ~e utilisatj’on of SC~wereknownto the antho~fi-oma previous study of SCAFS [ANAN93a], wehavenot used da!~e miningtechniquesfor discevefingthe heuristics in the presented work. ¯ Howeverweare currently:stud~g methodsof confinn~gthe he,tics induced during the empirical study of SCAFS[ANAN93a] using d~t~b~ mining techniques. 8. References [AGRA93a] Agrawal, IL, Imielinski, T., Swami,A. DatabaseMining:A PerformancePerspective ¯ m~Transactionson Knowledgeand Data Engineering, Special Issue on ~amingand Discoveryin Knowledge - BasedSystems,VoL5, No. 6, Dec. 1993. [AGRA93b] Agrawal,R., Imielinski, T., Swami,A. Mi’ningAssociationsbetweenSets of Items ¯ in Massive Databases ACMSIGMOD,Washington DC., May 1993. JAN91] An, H. C., Henschen,L.J. Knowledge-Based Semantic QueryOptimization Lecture Notesin AI, 199I, Vol. 542, Pg. 82-91. [ANAN93] $. S. Ammd, D. A. Bell, J. G. HughesUsing SemanticKnowledgeto Enhancethe PerformanCeof Specialized Database HardwareWorkshopof the FsTrcs’93 Conference, ¯ IIT Bombay,December,1993. [ANAN93a ] Anand,S. S., Hughes,J. G., Bell, D.A. An Empirical Study of ICL’s SCAFS Internal Reportof the Depamnent of Infommtion Systems,Universityof Ulster at Jordenstown [ANAN93b]Anand,.S. S,. Bell, D,A, Hughes, J.O.. An A~h to Database Mining Using Evidential Reasoning,internal Reportof the Department of InformationSystems,Universityof Ulster at Jordanstown [BABB79]Babb~dImplementinga Relational Database by meansof Specialised Hardware ACM transactions of databasesystems, Vol. 4, No. 1, March1979. [BABB85] Babb,EdCAFSfile - correlation unit ICLTechnical Journal, Nov.1985. [BELL93] Bell, DA.FromData Properties to Evidence,Wl~E Transactions on Knowledge and Data Engineering,Special Issue on Learningand Discoveryin Knowledge - BasedDatabases,VoL5, No. 6, KDD-94 AAAI-94Workshopon KnowledgeDiscovery in Databases Page297 Dec. 1993 AboutData Semantics [BERT92] Bertino, E., Musto, D. QueryOptimizationby using Knowledge Data &Knowledge Engineering9, 1992, Pg. 121 - 155. [CHAK84] Chakravarthy,U., Fishman,D., ’M.inker, J. SemanticQueryOptimizationin expert ¯ ’ systems and database systems ~gs from the Ist International Workshopon Expert ¯ DatabaseSystemsed. L. KerschbergPg. 659 - 674. [COUL72]Coularis,G.F., Evans,J.M.,M/tchelI,R.W.Towardscontent addressing in databases TheComputer Journal, Vol. 15, No. 2, 1972. [’DERsg0]Dershowitz, N., Jouannaud, J.P. Rewrite Systems Formal ModelsAndSemantics ¯ ed. Jan Van Lceuwen,ELsevier, 1990. [FERRgl] Ferrari,D., Serazzi,G., Zeigner,A. Measurement and Tuningof ComputerSystems Prentice-Hall,1981. [GRAE87]Graefe, G., DeWitt, D. The EXODUS optimizer generator Proceedings of the 1987 ACM ¯ SIGMOD Conferenceon Management of Data (San Francisco, May1987) Pg. 160 - 171. [HAMMg0] Hammer,M., Zdonik, S. B, Jr. KnowledgeBased Query Processing proc. of the 5th International Conferenceon VeryLarge Databases,MOntreal,October1980, Pg. 137 - 147. [INGR]TheIngres Search Accelerator Users Guide for QueryProcessing Efficiency [KING79] King, J. J Exploringthe u~e of DomainKnowledge Stanford University ComputerScience DepartmentHeuristic Programming Project Technical Report HPP-79-30,December1979. [KINGgl]Kingt J. J QUIST: A SystemFor SemanticQueryOptimizationIn Relational Databases - Proc. of the 6th International Conferenceon VeryLargeDatabases,1980,Pg. 510 - 517. [I.EUN8$]Lean8,C.H.C,Wong,K.S. File processingefficiency on the content addressablefile store Prec. VLDB conference, 1985. [PIATgl] Piatetsky-Shapiro, G., Frawley, W.J (ed.) KnowledgeDiscoveryin Databases AAAI Press / MITPress, 1991. [PLAT93]Piatetsky-Shapim, G. (WorkshopChair) WorkingNotes of Workshopon Knowledge Discoveryin Databases, AAAI-93, 1993. [SIF.G92]Siegel, M., Sciore, E., Salveter, S, A Me~edFor Rule Derivation to Support ..... SePtic Query Optimization ACMTransactions on D~t~be~eSystems Vol. 17, No. 4, 1992,Pg. 563- 600. [YU89]Yu, C., Sun, W, Automatic KnowledgeAcquisition and Maintenanceof Semantic Query Optimization IEEETrans. Know.Data Eng. 1,3 (Sept. 1989) Pg. 362 - 375. Page298 AAAI-94Workshopon KnowledgeDiscovery in Databases KDD-94