Dr. K. Raghava Rao Professor of CSE Dept. of MCA,KL University krraocse@gmail.com http://advdbms.blog.com Generalized Model for Active Databases and Oracle Triggers Triggers are executed when a specified condition occurs during insert/delete/update Triggers are action that fire automatically based on these conditions Triggers follow an Event-condition-action (ECA) model • Event: Database modification E.g., insert, delete, update, • Condition: Any true/false expression Optional: If no condition is specified then condition is always true • Action: Sequence of SQL statements that will be automatically executed When a new employees is added to a department, modify the Total_sal of the Department to include the new employees salary; CREATE TRIGGER Total_sal1 AFTER INSERT ON Employee FOR EACH ROW WHEN (NEW.Dno is NOT NULL) UPDATE DEPARTMENT SET Total_sal = Total_sal + NEW. Salary WHERE Dno = NEW.Dno; From the previous example we can find that: • Logically this means that we will CREATE a TRIGGER, let us call the trigger Total_sal1 This trigger will execute AFTER INSERT ON Employee table (EVENT) It will do the following FOR EACH ROW WHEN NEW.Dno is NOT NULL (CONDITION) The trigger will UPDATE DEPARTMENT (ACTION) By SETting the new Total_sal to be the sum of old Total_sal and NEW. Salary WHERE the Dno matches the NEW.Dno; CREATE TRIGGER <name> • Creates a trigger if one does not exist ALTER TRIGGER <name> • Alters a trigger if one does exist Works in both cases, whether a trigger exists or not Conditions can be one of the following: AFTER • Executes after the event BEFORE • Executes before the event INSTEAD OF • Executes instead of the event Note that event does not execute in this case INSTEAD OF triggers are used for modifying views Triggers can be • Row-level FOR EACH ROW specifies a row-level trigger Executed separately for each affected row NEW and OLD keywords can be used here • Statement-level Default (when FOR EACH ROW is not specified) Execute once for the SQL statement, Any true/false condition to control whether a trigger is activated or not • Absence of condition means that the trigger will always execute for the even (True) • Otherwise, condition is evaluated before the event for BEFORE trigger after the event for AFTER trigger Action can be • One SQL statement • A sequence of SQL statements enclosed between a BEGIN and an END Action specifies the relevant modifications Design and Implementation Issues for Active Databases An active database allows users to make the following changes to triggers (rules) • Activate • Deactivate • Drop An event can be considered in 3 ways: Immediate consideration: • Part of the same transaction and can be one of the following depending on the situation Before After Instead of Deferred consideration: • Condition is evaluated at the end of the transaction Detached consideration: • Condition is evaluated in a separate transaction Potential Applications for Active Databases Notification • Automatic notification when certain condition occurs • Enforcing integrity constraints Triggers are smarter and more powerful than constraints • Maintenance of derived data Automatically update derived data and avoid anomalies due to redundancy E.g., trigger to update the Total_sal in the earlier example Triggers in SQL-99 Can alias variables inside the REFERENCING clause Trigger examples in SQL-99 Time Representation, Calendars, and Time Dimensions Time is considered ordered sequence of points in some granularity • Use the term choronon instead of point to describe minimum granularity A calendar organizes time into different time units for convenience. • Accommodates various calendars Gregorian (western), Chinese, Islamic, Etc. Point events • Single time point event E.g., bank deposit • Series of point events can form a time series data Duration events • Associated with specific time period Time period is represented by start time and end time Transaction time • The time when the information from a certain transaction becomes valid Bitemporal database • Databases dealing with two time dimensions Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple • Valid start time • Valid end time Incorporating Time in Object-Oriented Databases Using Attribute Versioning A single complex object stores all temporal changes of the object Time varying attribute • An attribute that changes over time E.g., age Non-Time varying attribute • An attribute that does not changes over time E.g., date of birth Declarative Language • Language to specify rules Inference Engine (Deduction Mechanism) • Can deduce new facts by interpreting the rules • Related to logic programming Prolog language (Prolog => Programming in logic) Uses backward chaining to evaluate Top-down application of the rules Specification consists of: • Facts Similar to relation specification without the necessity of including attribute names • Rules Similar to relational views (virtual relations that are not stored) Predicate has • a name • a fixed number of arguments Convention: Constants are numeric or character strings Variables start with upper case letters • E.g., SUPERVISE(Supervisor, Supervisee) States that Supervisor SUPERVISE(s) Supervisee Rule • Is of the form head :- body where “:-” is read as “if and only if” • E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y) • E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y) Query • Involves a predicate symbol followed by some variable arguments to answer the question • E.g., SUPERIOR(james,Y)? • E.g., SUBORDINATE(james,X)? (a) Prolog notation (b) Supervisory tree • Program is built from atomic formulas Literals of the form p(a1, a2, … an) where p predicate name n is the number of arguments Built-in predicates are included E.g., <, <=, =, etc. • A literal is either An atomic formula An atomic formula preceded by not A formula can have quantifiers • Universal • Existential In clausal form, a formula must be transformed into another formula with the following characteristics • All variables are universally quantified • Formula is made of a number of clauses where each clause is made up of literals connected by logical ORs only • Clauses themselves are connected by logical ANDs only. Any formula can be converted into a clausal form • A specialized case of clausal form are horn clauses that can contain no more than one positive literal Datalog program are made up of horn clauses There are two main alternatives for interpreting rules: • Proof-theoretic • Model-theoretic Proof-theoretic • Facts and rules are axioms • Ground axioms contain no variables • Rules are deductive axioms • Deductive axioms can be used to construct new facts from existing facts • This process is known as theorem proving Model-theoretic • Given a finite or infinite domain of constant values, we assign the predicate every combination of values as arguments • If this is done for every predicated, it is called interpretation Model • An interpretation for a specific set of rules Model-theoretic proofs • Whenever a particular substitution to the variables in the rules is applied, if all the predicated are true under the interpretation, the predicate at the head of the rule must also be true Minimal model • Cannot change any fact from true to false and still get a model for these rules A program is safe if it generates a finite set of facts Two main methods of defining truth values • Fact-defined predicates (or relations) Listing all combination of values that make a predicate true • Rule-defined predicates (or views) Head (LHS) of one or more Datalog rules, for example: Many operations of relational algebra can be defined in the for of Datalog rules that defined the result of applying these operations on database relations (fact predicates) • Relational queries and views can be easily specified in Datalog Define an inference mechanism based on relational database query processing concepts • Example: predicate dependencies Knowledge databases 42 KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: we never see the whole Then run Data data set, so will put it in Mining algorithms the memory of computers What is the knowledge? How to represent and use it? 43 Data, Information, Knowledge We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Knowledge can be considered data at a high level of abstraction and generalization. 44 From Data to Knowledge Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes ... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS ... Numerical attribute categorical attribute missing values class labels IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy] 45 Data Rich Knowledge Poor People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Raw data is rarely of direct benefit. Its true value depends on the ability to extract information useful for decision support. Impractical Manual Data Analysis How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. ? knowledge base inference engine Tradition: via knowledge engineers New trend: via automatic programs 46 Benefits of Knowledge Discovery Value Disseminate DSS Generate Volume MIS Rapid Response EDP EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems 47 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD 48 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) Multiple process non-trivial process valid novel useful understandable Justified patterns/models Previously unknown Can be used by human and machine 49 The Knowledge Discovery Process a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 5 4 3 Putting the results in practical use Interpret and Evaluate discovered knowledge Data Mining 2 1 Extract Patterns/Models Collect and Preprocess Data Understand the domain and Define problems KDD is inherently interactive and iterative 50 The KDD Process Data organized by function Create/select target database Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data Normalize values Transform values 2 Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Transform to different representation Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods 5 51 Main Contributing Areas of KDD [data warehouses: integrated data] Statistics [OLAP: On-Line Analytical Processing] Databases Store, access, search, update data (deduction) Infer info from data (deduction & induction, mainly numeric data) KDD Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) 52 Potential Applications Business information Manufacturing information - Marketing and sales data analysis - Investment analysis - Loan approval - Controlling and scheduling - Fraud detection - Network management - etc. - Experiment result analysis - etc. Scientific information - Personal information Sky survey cataloging Biosequence Databases Geosciences: Quakefinder etc. 53 KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.) 54 KDD: A New and Fast Growing Area KDD workshops: since 1989. Inter. Conferences: KDD (USA), first in 1995; PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997. ML’04/PKDD’04 (in Pisa, Italy) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning). “Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ. 55 Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. Classification ? maps a data item to a real-valued prediction variable. Regression discovering the most significant changes in the data Deviation and change detection identifying a finite set of categories or clusters to describe the data. Clustering finding a model which describes significant dependencies between variables. Dependency Modeling finding a compact description for a subset of data Summarization 56 Classification “What factors determine cancerous cells?” Examples Data Cancerous Cell Data Mining Algorithm General patterns Classification Algorithm - Rule Induction - Decision tree - Neural Network 57 Classification: Rule Induction “What factors determine a cell is cancerous?” If and and Then Color = light Tails = 1 Nuclei = 2 Healthy Cell If and and Then Color = dark Tails = 2 Nuclei = 2 Cancerous Cell (certainty = 92%) (certainty = 87%) 58 Classification: Decision Trees Color = dark #nuclei=1 #tails=1 healthy #tails=2 cancerous #nuclei=2 cancerous Color = light #nuclei=1 #nuclei=2 healthy #tails=1 #tails=2 healthy cancerous 59 Classification: Neural Networks “What factors determine a cell is cancerous?” Color = dark # nuclei = 1 … Healthy Cancerous # tails = 2 60 End of Unit-iv