SI 654 Database Application Design Winter 2003 Dragomir R. Radev 1 © 2002 by Prentice Hall Database Processing Eighth Edition Managing Databases with Oracle 2 Chapter 12 David M. Kroenke © 2002 by Prentice Hall What is Oracle? • Oracle is the world’s most popular DBMS that… – Is extremely powerful and robust – Runs on many different operating systems – Can be configured and tailored – Operates with most, if not all, addon products 3 © 2002 by Prentice Hall Oracle Complexity • The power and flexibility of Oracle makes it very complex: – Installations are difficult – The configuration options are numerous – System requirements are high – System maintenance is complex 4 © 2002 by Prentice Hall The Language of Oracle… SQL Plus • SQL Plus is used in Oracle to: – Define the structure of a database and the definition of the data – Insert, delete, and modify data – Define the behavior of the system through stored procedures and triggers – Retrieve data and generate reports 5 © 2002 by Prentice Hall Gaining Access to SQL Plus • To gain access to SQL Plus, you will need a username and password and possibly a host string (depending on your system configuration) • When Oracle is first installed, it establishes several default accounts, namely... – internal/oracle (a privileged account) – sys/change_on_install (a privileged account) – system/manager (a privileged account) – scott/tiger (a non-privileged account) 6 © 2002 by Prentice Hall Creating the Database • Ways to create an Oracle database: – Using SQL Plus • Start button –> Programs –> Oracle – OraHome81 –> Applications Development –> SQL Plus – Using Oracle’s Database Configuration Assistant 7 • Start button –> Programs –> Oracle – OraHome81 –> Database Administration –> Database Configuration Assistant © 2002 by Prentice Hall Entering SQL Plus Commands • The SQL Plus Buffer – As a user types commands, the commands are saved into the SQL Plus buffer. • The SQL Plus Editor – Users may edit or alter SQL Plus commands using a text editor. 8 © 2002 by Prentice Hall SQL Plus Buffer Commands • SQL Plus is not case sensitive (except within quotation marks). • List – displays the content of the SQL Plus buffer • List n – display line number n and changes the current line number to n • Change – performs a search/replace operation for the current line number • Semi-colon (;) or slash (/) executes 9 © 2002 by Prentice Hall SQL Plus Editor • The SQL Plus Edit command will launch the SQL Plus text editor • After the SQL statement is complete and correct, exit the editor • To execute the statement, type the slash key (/) at the SQL prompt • To retrieve an existing SQL file: – SQL> Edit file1.sql 10 © 2002 by Prentice Hall SQL Plus Commands • • • • • • • • Desc – lists the fields in the specified table Select – retrieve data Create – create objects Drop – delete objects Alter – change objects Insert – input data Delete – delete data Update – change data 11 © 2002 by Prentice Hall Select Syntax Select field1, field2 From table_a, table_b; 12 © 2002 by Prentice Hall Create Syntax Create Table tablename ( field1 data_type(size) NOT NULL, field2 data_type (size) NULL); Create Sequence tableID Increment by 1 start with 1000; (this command creates a counter that automatically increments for each new record – does not ensure uniqueness) 13 © 2002 by Prentice Hall Alter Table Syntax Alter Table tablename1 Add Constraint FieldPK Primary Key (Field1, Field2); Alter Table tablename2 Add Constraint FieldFK Foreign Key (Field1, Field2) references tablename1 On Delete Cascade; 14 © 2002 by Prentice Hall Insert Syntax Insert into tablename (fieldID, field2) Values (fieldID.NextVal, ‘data content’); 15 © 2002 by Prentice Hall Drop Syntax • Drop Table tablename; • Drop Sequence fieldID; 16 © 2002 by Prentice Hall Indexes • Indexes are used to enforce uniqueness and to enable fast retrieval of data. • Create Unique Index fieldIndex on Table(field1, field2); 17 © 2002 by Prentice Hall Changing the Table Structures • Alter Table tablename add field4 datatype (size); • Alter Table tablename Drop Column field2; –you will permanently lose the data in field2 • Alter Table tablename Modify field3 not null; 18 © 2002 by Prentice Hall Changing the Data… Update Syntax Update tablename Set field1 = ‘value_a’ Where field3 = value; 19 © 2002 by Prentice Hall Check Constraint Provide a list of valid values or a valid range… Create Table tablename ( Field1 datatype (size) Not Null, Field2 datatype (size) Null Check (field2 in (‘value_a’, ‘value_b’))); 20 © 2002 by Prentice Hall Check Constraint • Alter Table tablename Add Constraint DateChk Check (DateField1 <= DateField2); • Alter Table tablename Add Constraint NumRange Check (Field1 Between 180 and 400); • Alter Table tablename Drop Constraint constraintname; 21 © 2002 by Prentice Hall Views • Displaying the data from the database just the way a user wants it… Create View View1 As Select * From Tablename With Read Only; 22 © 2002 by Prentice Hall PL/SQL • Allowing SQL to act more like a programming language. • Row-at-a-time versus set-at-a-time. • PL/SQL permits Cursors • A stored procedure is a PL/SQL (or other program) stored in the database. A stored procedure may have parameters. 23 © 2002 by Prentice Hall PL/SQL Parameter Types • IN – specifies the input parameters • OUT – specifies the output parameters • IN OUT – a parameter that may be an input or an output 24 © 2002 by Prentice Hall PL/SQL Code • Variables are declared following the AS keyword • The assignment operator is := as follows variable1 := ‘value’ • Comments in PL/SQL are enclosed between /* and */ as follows… /* This is a comment */ 25 © 2002 by Prentice Hall PL/SQL Control Structures FOR variable IN list_of_values LOOP Instructions END LOOP; IF condition THEN BEGIN Instructions END; 26 © 2002 by Prentice Hall Saving, Compiling, and Executing PL/SQL Code • The last line in the PL/SQL procedure should be a slash (/). • The procedure must be saved to a file • To compile the procedure, type the keyword Start, followed by the procedure filename – START MyProg.SQL • To see any reported errors, type SHOW ERRORS; • To execute the procedure type EXEC MyProg (‘parameter1’, ‘parameter2’); 27 © 2002 by Prentice Hall Triggers • A trigger is a stored procedure that is automatic invoked by Oracle when a specified activity occurs • A trigger is defined relative to the activity which invoked the trigger – BEFORE – execute the stored procedure prior to the activity – AFTER – execute the stored procedure after the activity – INSTEAD OF – execute the stored procedure in lue of the activity 28 © 2002 by Prentice Hall Trigger Example Create or Replace Trigger triggername Before Insert or Update of fieldname on tablename For Each Row Begin /* instructions */ End; 29 © 2002 by Prentice Hall A Trigger Knows the Old and New Values for Fields • The variable :new.fieldname1 stores the new information for fieldname1 as entered by the user. • The variable :old.fieldname1 stores the information in fieldname1 prior to the user’s request. 30 © 2002 by Prentice Hall Activating a Trigger • The trigger must be saved to a file • To compile the trigger, type the keyword Start, followed by the trigger filename – START MyTrigger.SQL • To see any reported errors, type SHOW ERRORS; • If no errors were encountered, the trigger is automatically activated 31 © 2002 by Prentice Hall Data Dictionary • The data dictionary contains information that Oracle knows about itself… the metadata. • It includes information regarding just about everything in the database including the structure and definition of tables, sequences, triggers, indexes, views, stored procedures, etc. • The data dictionary table names are stored in the DICT table. 32 © 2002 by Prentice Hall Concurrency Control • Since Oracle only reads committed changes, dirty reads and lost updates are avoided • Transaction isolation levels: – Read Committed – Serializable – Read-only – Explicit Locks 33 © 2002 by Prentice Hall Read Committed Transaction Isolation • Reads may not be repeatable (2 reads may result in 2 data values, based on timing of updates and reads) • Phantoms are possible (data from a read may be deleted after the read occurred) • Uses exclusive locks • Deadlocks are possible and are resolved by rolling-back one of the transactions 34 © 2002 by Prentice Hall Serializable Transaction Isolation • Reads are always repeatable • Phantoms are avoided • Must issue the following command: Set Transaction Isolation Level Serializable; or Alter Sessions Set Isolation Level Serializable; • Coordinates activities in submission order. When this coordination detects difficulties, the application program(s) must intervene. 35 © 2002 by Prentice Hall Read-only Transaction Isolation • An Oracle-only isolation level • No inserting, updating, or deleting is permitted 36 © 2002 by Prentice Hall Explicit Locking • Not recommended • Oracle does not promote locks. As a result, a table may have many, many locks within it. Oracle manages these locks transparently. Issuing explicit locks may interfere with these transparent locks. 37 © 2002 by Prentice Hall Oracle Security • Username and Password is used to manage DBMS access • Users may be assigned to one or more profiles • Oracle provides extensive resource limitations and access rights. These restrictions may be applied to users or profiles. • The SQL Grant operator provides additional access rights • The SQL Revoke operator remove access rights 38 © 2002 by Prentice Hall Backup/Recovery • Committed changes are saved to destination Tablespaces. • Uncommitted changes are saved in the Rollback Tablespace. • Redo Log files save all changes made in the Tablespaces. • To start and/or recover from a system failure, the Control Files are read. 39 © 2002 by Prentice Hall Archivelog • If Oracle is running in ARCHIVELOG mode, backup copies are made of the redo log files. • Otherwise, the redo log files are periodically overwritten with new information. 40 © 2002 by Prentice Hall Types of Failures • Application Failure – When a program bug is encountered or when a program does not correctly respond to current system conditions. • Instance Failure – When Oracle is unable to do what it needs to do. • Media Failure – When a disk becomes inaccessible to Oracle. 41 © 2002 by Prentice Hall Recovery of an Application Failure • Oracle rolls back uncommitted changes. 42 © 2002 by Prentice Hall Recovery of an Instance Failure • Oracle would be restarted using the following sequence… – Read the Control File – Restore system to last known valid state – Roll forward changes not in system (replay the Redo Log Files) 43 © 2002 by Prentice Hall Recovery from a Media Failure • Restore system from Backup • Read the Control File • Roll forward changes using the Archive Log Files (from the ARCHIVELOG) • Roll forward changes from the on-line Log Files (the most recent versions of the Logs) 44 © 2002 by Prentice Hall Types of Recoveries • Consistent Backup – After the restoration, delete all uncommitted activities – This ensures consistency, may lose recent changes • Inconsistent Backup – After the restoration, all uncommitted activities remain 45 © 2002 by Prentice Hall Database Processing Eighth Edition Networks, Multi-Tier Architectures, and XML 46 Chapter 14 David M. Kroenke © 2002 by Prentice Hall Networks • A network is a collection of computers that communicate with one another using standard sets of rules, called protocols • Common Network Environments: – Internet – Intranet – Wireless Network Access 47 © 2002 by Prentice Hall The Internet • Internet - a publicly accessible network of networks spanning the globe • Uses communications protocol called Transmission Control Program/Internet Protocol (TCP/IP) 48 © 2002 by Prentice Hall Key Dates for the Internet • The Internet was born in the 1960’s by the US armed services and was called ARPANET • HTTP: HyperText Transfer Protocol (used to create Web Pages) was created in 1989 by CERN – Key HTTP characteristics: • Request-based (waits for user action) • Stateless (does not sequence or remember activities) 49 © 2002 by Prentice Hall HTTP: Stateless Property • In applications development, you may often wish to save the application state. • Several Internet tools exist to help accomplish this: – Microsoft Internet Information Server (IIS) – Microsoft Active Server Pages (ASP) – Java Servlets with Java ServerPages (JSP) 50 © 2002 by Prentice Hall The Intranet • Some organizations use Internet technologies to create their own privately accessible network called an intranet. • If a connection to the Internet does exists, it does so through a firewall • An intranet is almost always faster than the Internet 51 © 2002 by Prentice Hall Firewall • Firewall - a security gateway that protects an organization from unauthorized access via the Internet • Consists of software and sometimes hardware components 52 © 2002 by Prentice Hall Wireless Network Access • Due to less reliability, inferior screen displays, and slower transfer rates, the traditional wired protocols are not appropriate for wireless environments. • A few protocols have been developed which allow wireless devices to communicate via the Internet: – Wireless Application Protocol (WAP) – Wireless Markup Language (WML) 53 © 2002 by Prentice Hall Multi-tier Architectures • Multi-tier Architectures – Tiers are the number of computers (serving a like function) that a user must use to satisfy his/her request. – Common tiers include Web server and database server. 54 © 2002 by Prentice Hall A Three-Tier Architecture 55 © 2002 by Prentice Hall Functions of Tiers 56 © 2002 by Prentice Hall Processing at the Different Tiers • Since each tier serves a different function, each tier may have a different operating system and different application software offerings. 57 © 2002 by Prentice Hall Processing • Client Processing – Using the browser (e.g., Netscape Navigator) • Server Processing – Using Server Software (e.g., ASP) 58 © 2002 by Prentice Hall Windows 2000 Web Server Languages • • • • • 59 JavaScript VBScript Perl ActiveX Control Java © 2002 by Prentice Hall Standards & Languages Common With MS Web Server 60 © 2002 by Prentice Hall Unix/Linux Web Server Environment • • • • • • • 61 JavaScript Java Applets Java Servlets Java Server Pages Perl Java CGI © 2002 by Prentice Hall N-Tier Processing • The 3-Tier architecture may be extended to include additional tiers. • This produces a distributed processing model using various servers on the Internet 62 © 2002 by Prentice Hall Markup Languages • Markup Languages are used to specify the appearance and behavior of Web Pages • Markup Language flavors: – HTML – a subset of the SGML – DHTML – RDS/ADO – XML 63 © 2002 by Prentice Hall HTML • HyperText Markup Language – PROS • Simple • Standardized – CONS • Static content • Limited connectivity • Mixed structure/content 64 © 2002 by Prentice Hall DHTML • Dynamic HyperText Markup Language • Encapsulates the entire HTML command set • Provides access to objects on the page using the Document Object Model (DOM) • Allows for Cascading Style Sheets (CSS) 65 © 2002 by Prentice Hall Data Services • Data services allow Web pages to exchange data with databases • RDS is a set of ObjectX controls – The data exchanges must be relatively simple • ADO is a set of ActiveX Data Controls – These data exchanges may be more complex 66 © 2002 by Prentice Hall Extensible Markup Language –XML • XML clearly separates content from structure and allows developers to easily define their own elements. • Rather than hard-coding Web pages, you create rules that govern how the document should look. Then merge the structure and the content files. So, the very nature of XML is dynamic. 67 © 2002 by Prentice Hall Document Type Declaration – DTD • A DTD defines the data content and may provide the data values • While a DTD is desirable, it is not mandatory – XML documents using DTDs are termed type-valid documents – XML documents not using DTDs are termed not-type-valid documents 68 © 2002 by Prentice Hall XML & CSS • Similar to DHTML, Cascading Style Sheets (CSS) may be used with XML documents to present a consistent, standardized Web site. 69 © 2002 by Prentice Hall Extensible Style Language Transformation –XMLT • XMLT is used to transform one document into another document 70 © 2002 by Prentice Hall XML Schema • XML Schema is the next generation of DTD • The schema itself is an XML document • A W3 standard is currently being developed • A document that conforms to an XML Schema is termed schema-valid. 71 © 2002 by Prentice Hall XML Schema Concepts • Simple Elements – Consist of a single content value • Complex Elements – Consist of multiple content values 72 © 2002 by Prentice Hall XML Namespaces • Namespaces define where to look for files • An XML document may have: – Up to one default namespace – Many labeled namespaces • Naming conventions: – Must be unique within all schemas – Typically resembles a URL, but is not a URL 73 © 2002 by Prentice Hall Wireless Application Protocol (WAP) • WAP has been developed to facilitate Web development for wireless devices such as Personal Data Assistants (PDA) or cellular phones 74 © 2002 by Prentice Hall WAP Server • A WAP Server transforms XML documents into Wireless Markup Language (WML) – WML is a subset of XML • A WML Scripting Language also exists 75 © 2002 by Prentice Hall XML and Database Applications • Any document that can process a DTD or XML Schema document can correctly interpret any arbitrary database view • XML can easily process multiple multi-valued paths (several SQL statements would be required) 76 © 2002 by Prentice Hall XML and Database Applications • The separation of structure and content allows for: – The same data to be displayed in many different ways – The same structure (report) may be regenerated many times with different/updated data. – Permits document validation checking 77 © 2002 by Prentice Hall OASIS – Document structures may be published and made publicly available – Organization for the Advancement of Structured Information Standards (OASIS): • A clearinghouse for XML publications and schema standards 78 © 2002 by Prentice Hall DBMS Integration of XML – Oracle • XML DOM parser • Xpath • XSQL – SQL Server • XML DOM parser • Xpath • ADO • ASP 79 © 2002 by Prentice Hall Web-based databases 80 © 2002 by Prentice Hall Types of databases • Textual databases • Semi-structured databases 81 © 2002 by Prentice Hall Indexing textual data • Inverted files • Boolean queries • Signature files • Signature S1 matches signature S2 if S2&S1=S2 82 © 2002 by Prentice Hall XML-QL 83 © 2002 by Prentice Hall XML-QL Two slides from Johannes Gehrke, Cornell University <IMG SRC=“xysq.gif” ALT=“(x+y)^2”> <apply> <power/> <apply> <plus/> <ci>x</ci> <ci>y</ci> </apply> <cn>2</cn> </apply> WHERE <BOOK> <NAME><LAST>$1</LAST></NAME> </BOOK> in “www.booklist.com/books.xml CONSTRUCT <RESULT> $1 </RESULT> 84 © 2002 by Prentice Hall XML-QL (continued) WHERE <BOOK> $b <BOOK> IN “www.booklist.com/books.xml”, <AUTHOR> $n </AUTHOR> <PUBLISHED> $p </PUBLISHED> in $e CONSTRUCT <RESULT> <PUBLISHED> $p </PUBLISHED> WHERE <LAST> $l </LAST> IN $n CONSTRUCT <LAST> $l </LAST> </RESULT> 85 © 2002 by Prentice Hall XML-QL (continued) <!ELEMENT book (author+, title, publisher)> <!ATTLIST book year CDATA> <!ELEMENT article (author+, title, year?, (shortversion|longversion))> <!ATTLIST article type CDATA> <!ELEMENT publisher (name, address)> <!ELEMENT author (firstname?, lastname)> 86 © 2002 by Prentice Hall XML-QL (continued) WHERE <book> <publisher><name>AddisonWesley</name></publisher> <title> $t</title> <author> $a</author> </book> IN "www.a.b.c/bib.xml" CONSTRUCT $a 87 © 2002 by Prentice Hall XML-QL (continued) WHERE <book> <publisher><name>AddisonWesley</></> <title> $t</> <author> $a</> </> IN "www.a.b.c/bib.xml" CONSTRUCT $a 88 © 2002 by Prentice Hall XML-QL (continued) WHERE <book> <publisher><name>Addison-Wesley</></> <title> $t</> <author> $a</> </> IN "www.a.b.c/bib.xml" CONSTRUCT <result> <author> $a</> <title> $t</> </> 89 © 2002 by Prentice Hall XML-QL (continued) <bib> <book year="1995"> <!-- A good introductory text --> <title> An Introduction to Database Systems </title> <author> <lastname> Date </lastname> </author> <publisher> <name> Addison-Wesley </name > </publisher> </book> <book year="1998"> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> <author> <lastname> Date </lastname> </author> <author> <lastname> Darwen </lastname> </author> <publisher> <name> Addison-Wesley </name > </publisher> </book> </bib> 90 © 2002 by Prentice Hall XML-QL (continued) <result> <author> <lastname> Date </lastname> </author> <title> An Introduction to Database Systems </title> </result> <result> <author> <lastname> Date </lastname> </author> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> </result> <result> <author> <lastname> Darwen </lastname> </author> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> </result> 91 © 2002 by Prentice Hall XML-QL (continued) WHERE <book > $p</> IN "www.a.b.c/bib.xml", <title > $t</>, <publisher><name>Addison-Wesley</>> IN $p CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN $p CONSTRUCT <author> $a</> </> 92 © 2002 by Prentice Hall XML-QL (continued) <result> <title> An Introduction to Database Systems </title> <author> <lastname> Date </lastname> </author> </result> <result> <title> Foundation for Object/Relational Databases: The Third Manifesto </title> <author> <lastname> Date </lastname> </author> <author> <lastname> Darwen </lastname> </author> </result> 93 © 2002 by Prentice Hall XML-QL (continued) WHERE <article> <author> <firstname> $f </> // firstname $f <lastname> $l </> // lastname $l </> </> CONTENT_AS $a IN "www.a.b.c/bib.xml" <book year=$y> <author> <firstname> $f </> // join on same firstname $f <lastname> $l </> // join on same lastname $l </> </> IN "www.a.b.c/bib.xml", y > 1995 CONSTRUCT <article> $a </> 94 © 2002 by Prentice Hall XML-QL (continued) 95 © 2002 by Prentice Hall XML-QL (continued) <!ATTLIST person ID ID #REQUIRED> <!ATTLIST article author IDREFS #IMPLIED> 96 © 2002 by Prentice Hall XML-QL (continued) <person ID="o123"> <firstname>John</firstname> <lastname>Smith<lastname> </person> <person ID="o234"> ... </person> <article author="o123 o234"> <title> ... </title> <year> 1995 </year> </article> 97 © 2002 by Prentice Hall XML-QL (continued) 98 © 2002 by Prentice Hall XML-QL (continued) WHERE <article><author><lastname> $n</></></> IN "abc.xml” WHERE <article author=$i> <title> </> ELEMENT_AS $t </>, <person ID=$i> <lastname> </> ELEMENT_AS $l </> CONSTRUCT <result> $t $l</> 99 © 2002 by Prentice Hall Scalar values NOT! <title>A Trip to <titlepart> the Moon </titlepart></title> <title><CDATA> A Trip to </CDATA><titlepart><CDATA> the Moon</CDATA></titlepart></title> 100 YES © 2002 by Prentice Hall Tag variables WHERE <$p> <title> $t </title> <year>1995</> <$e> Smith </> </> IN "www.a.b.c/bib.xml", $e IN {author, editor} CONSTRUCT <$p> <title> $t </title> <$e> Smith </> </> 101 © 2002 by Prentice Hall Transforming data <!ELEMENT book (author+, title, publisher)> <!ATTLIST book year CDATA> <!ELEMENT article (author+, title, year?, (shortversion|longversion))> <!ATTLIST article type CDATA> <!ELEMENT publisher (name, address)> <!ELEMENT author (firstname?, lastname)> <!ELEMENT person (lastname, firstname, address?, phone?, publicationtitle*)> 102 © 2002 by Prentice Hall Transforming data (cont’d) WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> </> <title> $t </> </> IN "www.a.b.c/bib.xml", CONSTRUCT <person ID=PersonID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> </> 103 © 2002 by Prentice Hall Integrating data from different sources WHERE <person> <name></> ELEMENT_AS $n <ssn> $ssn</> </> IN "www.a.b.c/data.xml", <taxpayer> <ssn> $ssn</> <income></> ELEMENT_AS $i </> IN "www.irs.gov/taxpayers.xml" CONSTRUCT <result> $n $i </> 104 © 2002 by Prentice Hall Query blocks WHERE <$e> <title> $t </> <year> 1995 </> </> CONTENT_A $p IN "www.a.b.c/bib.xml" CONSTRUCT <result ID=ResultID($p)> <title> $t </> </> { WHERE $e = "journal-paper", <month> $m </> IN $p CONSTRUCT <result ID=ResultID($p)> <month> $m </> </> } { WHERE $e = "book", <publisher>$q </> IN $p CONSTRUCT <result ID=ResultID($p)> <publisher>$q </> </> } 105 © 2002 by Prentice Hall WSQ 106 © 2002 by Prentice Hall Web-supported queries SIGMOD2000 (Goldman and Widom) WebPages (SearchExp,T1,T2,…,Tn,URL,Rank, Date) SELECT NAME, COUNT FROM STATES, WEBCOUNT WHERE NAME = T1 ORDER BY COUNT DESC 107 © 2002 by Prentice Hall KDD: Data Mining 108 © 2002 by Prentice Hall The big problem • Billions of records • A small number of interesting patterns • “Data rich but information poor” 109 © 2002 by Prentice Hall Data mining • Knowledge discovery • Knowledge extraction • Data/pattern analysis 110 © 2002 by Prentice Hall Types of source data • • • • 111 Relational databases Transactional databases Web logs Textual databases © 2002 by Prentice Hall Association rules • 65% of all customers who buy beer and tomato sauce also buy pasta and chicken wings • Association rules: X Y 112 © 2002 by Prentice Hall Association analysis • IF 20 < age < 30 AND 20K < INCOME < 30K • THEN – Buys (“CD player”) • SUPPORT = 2%, CONFIDENCE = 60% 113 © 2002 by Prentice Hall Basic concepts • Minimum support threshold • Minimum confidence threshold • Itemsets • Occurrence frequency of an itemset 114 © 2002 by Prentice Hall Association rule mining • Find all frequent itemsets • Generate strong association rules from the frequent itemsets 115 © 2002 by Prentice Hall Support and confidence • Support (X) • Confidence (X Y) = Support(X+Y) / Support (X) 116 © 2002 by Prentice Hall Example 117 TID T100 T200 T300 T400 T500 T600 T700 T800 T900 List of items IDs I1, I2, I5 I2, I4 I2, I3 I1, I2, I4 I1, I3 I2, I3 I1, I3 I1, I2, I3, I5 I1, I2, I3 © 2002 by Prentice Hall Example (cont’d) • • • • • • • 118 Frequent itemset l = {I1, I2, I5} I1 AND I2 I5 C = 2/4 = 50% I1 AND I5 I2 I2 AND I5 I1 I1 I2 AND I5 I2 I1 AND I5 I3 I1 AND I2 © 2002 by Prentice Hall Example 2 119 TID date items T100 10/15/99 {K, A, D, B} T200 10/15/99 {D, A, C, E, B} T300 10/19/99 {C, A, B, E} T400 10/22/99 {B, A, D} min_sup = 60%, min_conf = 80% © 2002 by Prentice Hall Correlations • Corr (A,B) = P (A OR B) / P(A) P (B) • If Corr < 1: A discourages B (negative correlation) • (lift of the association rule A B) 120 © 2002 by Prentice Hall Contingency table 121 Game ^Game Sum Video 4,000 3,500 7,500 ^Video 2,000 500 2,500 Sum 6,000 4,000 10,000 © 2002 by Prentice Hall Example • • • • 122 P({game}) = 0.60 P({video}) = 0.75 P({game,video}) = 0.40 P({game,video})/(P({game})x(P({video}) ) = 0.40/(0.60 x 0.75) = 0.89 © 2002 by Prentice Hall Example 2 hotdog ^hotdo Sum s gs 123 hamburgers 2000 500 2500 ^hamburger 1000 s 1500 2500 Sum 2000 5000 3000 © 2002 by Prentice Hall Classification using decision trees • Expected information need S • I (s1, s2, …, sm) = - pi log (pi) • s = data samples • m = number of classes 124 © 2002 by Prentice Hall RID Age Income student credit buys? 1 <= 30 High No Fair 2 <= 30 High No Excellent No 3 31 .. 40 High No Fair Yes 4 > 40 Medium No Fair Yes 5 > 40 Low Yes Fair Yes 6 > 40 Low Yes Excellent No 7 31 .. 40 Low Yes Excellent Yes 8 <= 30 Medium No Fair No 9 <= 30 Low Yes Fair Yes 10 > 40 Medium Yes Fair Yes 11 <= 30 Medium Yes Excellent Yes 12 31 .. 40 Medium No Excellent Yes 13 31 .. 40 High Yes Fair 14 > 40 Medium no excellent no 125 No Yes © 2002 by Prentice Hall Decision tree induction • I(s1,s2) = I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 = = 0.940 126 © 2002 by Prentice Hall Entropy and information gain •E(A) = S S1j + … + smj s I (s1j,…,smj) Entropy = expected information based on the partitioning into subsets by A Gain (A) = I (s1,s2,…,sm) – E(A) 127 © 2002 by Prentice Hall Entropy • Age <= 30 s11 = 2, s21 = 3, I(s11, s21) = 0.971 • Age in 31 .. 40 s12 = 4, s22 = 0, I (s12,s22) = 0 • Age > 40 s13 = 3, s23 = 2, I (s13,s23) = 0.971 128 © 2002 by Prentice Hall Entropy (cont’d) • E (age) = 5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694 • Gain (age) = I (s1,s2) – E(age) = 0.246 • Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048 129 © 2002 by Prentice Hall Final decision tree age > 40 31 .. 40 student credit yes no yes no 130 yes excellent no fair yes © 2002 by Prentice Hall Other techniques • Bayesian classifiers • X: age <=30, income = medium, student = yes, credit = fair • P(yes) = 9/14 = 0.643 • P(no) = 5/14 = 0.357 131 © 2002 by Prentice Hall Example • P P P P P P P P 132 (age < 30 | yes) = 2/9 = 0.222 (age < 30 | no) = 3/5 = 0.600 (income = medium | yes) = 4/9 = 0.444 (income = medium | no) = 2/5 = 0.400 (student = yes | yes) = 6/9 = 0.667 (student = yes | no) = 1/5 = 0.200 (credit = fair | yes) = 6/9 = 0.667 (credit = fair | no) = 2/5 = 0.400 © 2002 by Prentice Hall Example (cont’d) • P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 • P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019 • P (X | yes) P (yes) = 0.044 x 0.643 = 0.028 • P (X | no) P (no) = 0.019 x 0.357 = 0.007 • Answer: yes/no? 133 © 2002 by Prentice Hall More types of data mining • • • • 134 Classification and prediction Cluster analysis Outlier analysis Evolution analysis © 2002 by Prentice Hall