COP5725 Advanced Database Systems Spring 2016 Data Mining Tallahassee, Florida, 2016 Why Data Mining? • The Explosive Growth of Data: from terabytes to petabytes – Data collection and data availability • Automated data collection tools, database systems, Web, computerized society – Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube • We are drowning in data, but starving for knowledge! 1 What is Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data – Data mining: a misnomer? • Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Watch out: is everything “data mining”? – Simple search and query processing – (Deductive) expert systems 2 Knowledge Discovery (KDD) Process • This is a view from typical database systems and data warehousing communities • Data mining plays an essential role in the knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 3 KDD Process: A Typical View from ML and Statistics Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction Data Mining Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… PostProcessing Pattern Pattern Pattern Pattern evaluation selection interpretation visualization 4 Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques End User Business Analyst Data Mining Information Discovery Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA 5 Data Mining: Confluence of Multiple Disciplines Machine Learning Pattern Recognition Applications Data Mining Algorithm Database Technology Statistics Visualization High-Performance Computing 6 Data Mining: On What Kinds of Data? • Database-oriented data sets and applications – Relational database, data warehouse, transactional database • Advanced data sets and advanced applications – Data streams and sensor data – Time-series data, temporal data, sequence data – Structure data, graphs and social networks – Spatial data and spatiotemporal data – Multimedia database – Text databases – The World-Wide Web 7 Association and Correlation Analysis • Frequent pattern (or frequent itemsets) mining – What items are frequently purchased together in your Walmart shopping cart? • Association, correlation vs. causality – E.g., Diaper Beer [0.5%, 75%] (support, confidence) – Are strongly associated items also strongly correlated? • How to mine such patterns and rules efficiently in large datasets? • How to use such patterns for classification, clustering, and other applications? 8 Example 9 Classification • Classification and label prediction – Construct models (functions) based on some training examples – Describe and distinguish classes or concepts for future prediction • E.g., classify countries based on (climate), or classify cars based on (gas mileage) – Predict some unknown class labels • Typical methods – Decision trees, naïve Bayesian classification, SVM, neural networks, rule-based classification, pattern-based classification, logistic regression, … • Typical applications: – Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, … 10 Example 11 Example Deep Blue beat Kasparov at chess in 1997. Watson beat the brightest trivia minds at Jeopardy in 2011. Can you tell Fido from Mittens in 2013? https://www.kaggle.com/c/dogs-vs-cats 12 Example Cat Dog Cat Dog What about this? 13 Example 14 Clustering & Outlier Analysis • Unsupervised learning (i.e., Class label is unknown) – Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns – Principle: Maximizing intra-class similarity & minimizing interclass similarity • Outlier Analysis – A data object that does not comply with the general behavior of the data – Noise or exception? ― One person’s garbage could be another person’s treasure 15 Example 16 Structure and Network Analysis • Graph mining – Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) • Network analysis – Social networks: actors (objects, nodes) and relationships (edges) • e.g., author networks in CS, terrorist networks – Multiple heterogeneous networks • A person could be multiple information networks: friends, family, classmates, … – Links carry a lot of semantic information: Link mining • Web mining – Web is a big information network: from PageRank to Google • Web community discovery, opinion mining, usage mining, … 17 Example 18 Top 10 Data Mining Algorithms 1. C4.5: Decision-tree based classification 2. K-Means: Clustering 3. SVM: Classification and regression 4. Apriori: Frequent pattern mining 5. EM: MLE/MAP estimation, parameter estimation 6. PageRank: Link analysis and ranking 7. AdaBoost: Classification 8. kNN: Classification and regression 9. Naive Bayes: Classification 10. CART: Classification and regression 19 What Is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • Motivation: Finding inherent regularities in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? • Applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 20 Basic Concepts: Frequent Patterns Customer buys both Customer buys diaper Customer buys beer itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold 21 Basic Concepts: Association Rules Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 50 Nuts, Eggs, Milk Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Nuts, Coffee, Diaper, Eggs, Milk Customer buys both Customer buys diaper Let minsup = 50%, minconf = 50% Customer buys beer Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%) 22 Association Rule Mining • Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold • Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! 23 Mining Association Rules • Observations – All the association rules are binary partitions of the same itemset: {Milk, Diaper, Beer} – Rules originating from the same itemset have identical support but can have different confidence – Thus, we may decouple the support and confidence requirements TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) 24 Frequent Itemset Mining Given d items, there are 2d possible candidate itemsets null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 25 Frequent Itemset Mining • Brute-force approach 1. Each itemset in the lattice is a candidate frequent itemset 2. Count the support of each candidate by scanning the database • Match each transaction against every candidate – Time complexity: O(N*M*w) Expensive since M = 2d Transactions N TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke List of Candidates M w 26 Computational Complexity of Association Rule Mining • Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules d d k R k j 3 2 1 d 1 d k k 1 j 1 d d 1 If d=6, R = 602 rules 27 Closed Patterns and Max-Patterns • A long pattern contains a combinatorial number of sub-patterns – e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset X is closed – If X is frequent and there exists no super-pattern Y כX, with the same support as X • An itemset X is a max-pattern – If X is frequent and there exists no frequent super-pattern Y כX • Closed pattern is a lossless compression of freq. patterns – Reducing the # of patterns and rules 28 Closed Patterns and Max-Patterns • Exercise. DB = {<a1, …, a100>, < a1, …, a50>} – Min_sup = 1. • What is the set of closed itemset? – <a1, …, a100>: 1 – < a1, …, a50>: 2 • What is the set of max-pattern? – <a1, …, a100>: 1 • What is the set of all patterns? – !! 29 Maximal Frequent Itemset • An itemset is maximal frequent if none of its immediate supersets is frequent null A B C D E Maximal Itemsets AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ABCD E ACDE BCDE Border 30 Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset TID 1 2 3 4 5 Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D} Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 31 Maximal vs. Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE Transaction Ids null 124 123 A 12 124 AB 12 24 AC ABC ABD ABE AE 345 D 2 3 BC BD 4 ACD 245 C 123 4 24 2 Not supported by any transactions B AD 2 1234 BE 2 4 ACE ADE E 24 CD 34 CE 3 BCD 45 DE 4 BCE BDE CDE 4 ABCD ABCE ABDE ACDE BCDE ABCDE 32 Maximal vs Closed Frequent Itemsets Minimum support = 2 124 123 A 12 124 AB 12 ABC 24 AC AD ABD ABE 1234 B AE 345 D 2 3 BC BD 4 ACD 245 C 123 4 24 2 Closed but not maximal null 24 BE 2 4 ACE E ADE CD Closed and maximal 34 CE 3 BCD 45 DE 4 BCE BDE CDE 4 2 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 33 Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets 34 Mining Frequent Itemsets • The downward closure property of frequent patterns – Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! – A Candidate Generation & Test Approach 1. Initially, scan DB once to get frequent 1-itemset 2. Generate length (k+1) candidate itemsets from length k frequent itemsets 3. Test the candidates against DB 4. Terminate when no frequent or candidate set can be generated 35 Illustrating Apriori Principle null A Found to be B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Infrequent ABCD Pruned supersets ABCE ABDE ACDE BCDE ABCDE 36 The Apriori Algorithm—An Example Database Supmin = 2 L1 C1 1st scan C2 C2 2nd scan L2 C3 3rd scan L3 37 The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; // candidate generation for each transaction t in database do // frequency counting increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 38 Implementation of Apriori • How to generate candidates? – Step 1: self-joining Lk – Step 2: pruning • Example of Candidate-generation – L3={abc, abd, acd, ace, bcd} – Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace – Pruning: • acde is removed because ade is not in L3 – C4 = {abcd} 39 Candidate Generation: SQL • SQL Implementation of candidate generation – Suppose the items in Lk-1 are listed in an order – Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 – Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck 40 Factors Affecting Complexity • Choice of minimum support threshold – lowering support threshold results in more frequent itemsets – this may increase number of candidates and max length of frequent itemsets • Dimensionality (number of items) of the data set – more space is needed to store support count of each item – if number of frequent items also increases, both computation and I/O costs may also increase • Size of database – since Apriori makes multiple passes, run time of algorithm may increase with number of transactions • Average transaction width – transaction width increases with denser data sets – this may increase max length of frequent itemsets, and the number of subsets in a transaction increases with its width 41 Further Improvement of Apriori • Major computational challenges – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates • Improving Apriori: general ideas – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates 42 Partition: Scan Database Only Twice • Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns DB1 sup1(i) < σDB1 + DB2 sup2(i) < σDB2 + + DBk supk(i) < σDBk = DB sup(i) < σDB 43 Generating Association Rules • How to efficiently generate rules from frequent itemsets? – In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) – But confidence of rules generated from the same itemset has an anti-monotone property • e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) – Confidence is anti-monotone w.r.t. number of items on the RHS of the rule 44 Example • If {A,B,C,D} is a frequent itemset, candidate rules: A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, ABC D, ABD C, ACD B, BCD A, • If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) 45 Generating Association Rules • Given a frequent itemset Z, we look at all proper subsets X ⊂ Z to compute rules of the form X Y, where Y = Z\X • The rule must be frequent – s = sup(XY) = sup(Z) >= minsup • We compute the confidence as – c = sup(X Y)/sup(X) = sup(Z)/sup(X) – If c >= minconf, this rule is a strong association rule – Otherwise, conf(WZ\W)<c for all subsets W ⊂ X, because sup(W) >= sup(X). We can thus avoid checking subsets of X 46 Example Lattice of rules ABCD=>{ } Low Confidence Rule BCD=>A CD=>AB BD=>AC D=>ABC ACD=>B BC=>AD C=>ABD ABD=>C AD=>BC B=>ACD ABC=>D AC=>BD AB=>CD A=>BCD Pruned Rules 47 Association Rule Mining Algorithm F: the set of frequent itemsets A: all proper subsets of Z X: start from the largest subset A strong association rule If X fails, all its subsets fail as well 48 Finding Similar Patterns • Many data mining problems can be expressed as finding “similar” patterns: 1. 2. Web pages with similar words, e.g., for classification by topic NetFlix users with similar tastes in movies, for recommendation systems. • Dual: movies with similar sets of fans 3. • The best techniques depend on whether you are looking for items that are somewhat similar – • Images of related things Shingling & Minhashing Or very similar – Locality sensitive hashing 49 Example • Problem: comparing documents – Goal: common text, not common topic – Special case: identical documents, or one document contained character-by-character in another – General case: many small pieces of one doc appear out of order in another – Applications: • Mirror sites, or approximate mirrors: Don’t want to show both in a search • Plagiarism, including large quotations • Similar news articles at many news sites: cluster articles by “same story.” 50 Three Essential Techniques for Similar Pattern Discovery • Shingling : convert documents, emails, etc., to sets • Minhashing : convert large sets to short signatures, while preserving similarity • Locality-sensitive hashing: focus on pairs of signatures likely to be similar Document Signatures : The set of Strings of length K that appear in the document short integer vectors that represent the sets, and reflect their similarity Localitysensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity 51 Shingles • A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document – Example: k=2; doc = abcab. Set of 2-shingles = {ab, bc, ca} – Option: regard shingles as a bag, and count ab twice – Represent a doc by its set of k-shingles • Assumption – Documents that have lots of shingles in common have similar text, even if the text appears in different order • You must pick k large enough, or most documents will have most shingles – k = 5 is OK for short documents; k = 10 is better for long documents 52 Jaccard Similarity • The Jaccard similarity of two sets is the size of their intersection divided by the size of their union. – Sim (C1, C2) = |C1C2|/|C1C2| • Boolean Matrices – Rows = elements of the universal set – Columns = sets – 1 in row e and column S if and only if e is a member of S – Column similarity is the Jaccard similarity of the sets of their rows with 1 53 Example A B C D E F C1 0 1 1 0 1 0 C2 1 0 1 * 0 1 * 1 * * * Sim (C1, C2) = 2/5 = 0.4 * * 54 Signatures • Problem: 1. When the sets are so large or so many that they cannot fit in main memory 2. When there are so many sets that comparing all pairs of sets takes too much time. • Key idea: “hash” each column C to a small signature Sig (C), such that: 1. Sig (C) is small enough that we can fit a signature in main memory for each column 2. Sim (C1, C2) is the same as the “similarity” of Sig (C1) and Sig (C2) 55 MinHashing 1. Imagine the rows permuted randomly 2. Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1 3. Use several (e.g., 100) independent hash functions to create a signature • Surprising Property – The probability (over ALL permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2) 56 Example Matrix Signature C1 C2 C3 R1 1 0 1 R2 0 1 1 R3 1 0 0 R4 1 0 1 R5 0 1 0 S1 S2 S3 Perm 1 12345 1 2 1 Perm 2 54321 4 5 4 Perm 3 34512 3 5 4 Similarities 1-2 1-3 2-3 Col-Col 0 0.5 0.25 Sig-Sig 0 0.67 0 57 Locality Sensitive Hashing • Checking All Pairs is Hard – While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns • General idea – Use a function f(x,y) that tells whether or not x and y is a candidate pair – For minhash matrices: hash columns to many buckets, and make elements of the same bucket candidate pairs 58 Partition Into Bands r rows per band b bands One signature Matrix M 59 Partition into Bands • Divide matrix M into b bands of r rows – For each band, hash its portion of each column to a hash table with k buckets • Make k as large as possible • Candidate column pairs are those that hash to the same bucket for ≥ 1 band – Tune b and r to catch most similar pairs, but few non-similar pairs 60 Example Buckets Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Matrix M r rows b bands 61 Example • Suppose 100,000 columns, and signatures of 100 integers. Therefore, signatures take 40MB. We want all 80%-similar pairs – 5,000,000,000 pairs of signatures can take a while to compare – If choose 20 bands of 5 integers/band, and suppose C1, C2 are 80% similar • Probability C1, C2 identical in one particular band: (0.8)5 = 0.328 • Probability C1, C2 are not similar in any of the 20 bands: (10.328)20 = .00035 • i.e., about 1/3000th of the 80%-similar column pairs are false negatives 62 Parameter Setting At least one band identical P ~ (1/b)1/r Probability of sharing a bucket s No bands identical 1 - (1 - s r )b Some row All rows of a band of a band unequal are equal Similarity s of two sets 63 Example: b = 20; r = 5 s = (1/b)1/r =0.5493 64