Tutorial Document of ITM638 Data Warehousing and Data Mining Dr. Chutima Beokhaimook 24th March 2012 1 DATA WAREHOUSES AND OLAP TECHNOLOGY 2 What is Data Warehouse? Data warehouse have been defined in many ways “A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process” – W.H. Inmon The four keywords :- subject-oriented, integrated, timevariant and non-volatile 3 So, what is data warehousing ? A process of constructing and using data warehouses Data integration Data cleaning Constructing A Data Warehouse Data consolidation The utilization of a data warehouse necessitates a collection of decision support technologies This allow knowledge workers (e.g. managers, analysts and executives) to use the data warehouse to obtain an overview of the data and make decision based on information in the warehouse Term “warehouse DBMS” – refer to the management and utilization of data warehouse 4 Operational Database vs. Data Warehouses Operational DBMS OLTP (on-line transaction processing) Day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, etc. Data warehouses OLAP (on-line analytical processing) Serve users or knowledge workers in the role of data analysis and decision making The system can organize and present data in various formats 5 OLTP vs. OLAP Feature OLTP OLAP Characteristic Operational Processing Informational processing users Clerk, IT professional Knowledge worker Orientation transaction Analysis Function Day-to-day operation Long-term informational requirements, DSS DB design ER based, applicationoriented Star/snowflake, subjectoriented Data Current, guaranteed up-todate Historical; accuracy maintained over time Summarization Primitive, highly detailed Summarized, consolidated # of record access Tens Millions # of users Thousands Hundreds DB size 100 MB to GB 100GB to TB 6 Why Have a Separate Data warehouse? High performance for both systems An operational database – tuned for OLTP: access methods, indexing, concurrency control, recovery A data warehouse – tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data DSS require historical data, whereas operational DB do not maintain historical data DSS require consolidation (such as aggregation and summarization) of data from heterogeneous sources, resulting in high-quality, clean, and integrated data, whereas operational DB contain only detailed raw data, which need to be consolidate before analysis 7 A Multidimensional Data Model (1) Data warehouses and OLAP tools are based on a multidimensional data model – views data in the form of a data cube A data cube allows data to be modeled and viewed in multiple dimension Dimensions are the perspectives of entities with respect to which an organization wants to keep records Example A sales data warehouse keep records of the store’s sales with respect to the dimensions time, item, branch and location. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. Ex. item(item_name, brand, type) Dimension tables can be specified by users or experts, or automatically adjusted based on data distribution 8 A Multidimensional Data Model (2) A multidimensional model is organized around a central theme, for instance, sales, which is represented by a fact table Facts are numerical measures such as quantities :dolar_sold, unit_sold, amount_budget 9 Example: A 2-D view Table 2.1 A 2-D view of sales data according to the dimension time and item, where the sales are from braches located in Vancouver. The measure shown is dollar_sold (in thousands) 10 Example: A 3-D View Table 2.2 A 3-D view of sales data according to the dimension time and item and location. The measure shown is dollar_sold (in thousands) 11 Example: A 3-D data cube A 3-D data cube represent the data in table 2.2 according to the dimension time and item and location. The measure shown is dollar_sold (in thousands) 12 Star Schema The most common modeling paradigm, in which the data warehouse contains 1. 2. A large central table (fact table) containing the bulk of the data, no redundancy A set of smaller attendant table (dimension table), one for each dimension 13 Example: star schema of a data warehouse for sales • A central fact table is sales • that contains keys to each of the four dimensions, • along with 2 measures: dollars_sold and unit_sold. 14 Example: snowflake schema of a data warehouse for sales 15 Example: fact constellation schema of a data warehouse for sales and shipping 2 fact models 16 A Concept Hierarchies A concept hierarchy defines a sequence of mapping form a set of low-level concepts to higher-level, more general concepts (Example below is location) 17 A Concept Hierarchies (2) Many concept hierarchies are implicit within the database schema location which is described by attributes number, street, city, province_or_state, zipcode and country time which is described by attributes day, week, month, quarter and year year country province_or_state city street Total order hierarchy quarter week month day Partial order hierarchy 18 Typical OLAP Operations for multidimensional data (1) Roll-up (drill-up): climbing up a concept hierarchy or dimension reduction – summarize data Drill down(roll-down): stepping down a concept hierarchy or introducing additional dimensions reverse of roll-up Navigate from less detailed data to more detailed data Slice and dice: Slice operation perform a selection on one dimension of the given cube, resulting in subcube. Dice operation defines a subcube by performing a selection on two or more dimensions 19 Typical OLAP Operations for multidimensional data (2) Pivot (rotate): A visualization operation that rotate data axes in view in order to provide an alternative presentation of the data Other OLAP operations: such as drill-across – execute queries involving more than one fact table drill-through 20 USA Canada Q2 Q3 Q4 2000 Q1 1000 Q1 605 825 14 400 Time (Quarter) Time (Quarter) Chicago 440 New York 1560 Toronto 395 Vancouver Q2 Q3 Q4 Computer Computer Home Entertainment Security Home Entertainment Phone Security Phone item (type) item (type) roll-up on location (from cities to countries) 21 Time (Quarter) Chicago New York Toronto Vancouver Time (Quarter) Chicago 440 New York 1560 Toronto 395 Vancouver Q1 605 825 14 400 Q2 Jan 150 Feb 100 Mar 150 Apr May Q3 Jun Q4 Computer Home Entertainment Jul Security Phone item (type) Aug drill-down on time (from quarters to months) Sep Oct Nov Dec Computer Home Entertainment Security Phone 22 item (type) Time (Quarter) Chicago 440 New York 1560 Toronto 395 Vancouver Toronto 395 Vancouver Q1 605 825 14 400 Q1 605 Q2 Q2 Q3 Computer Home Entertainment Q4 Computer Home Entertainment Security item (type) Phone item (type) dice for (location=“Toronto”or“Vancouver”) and (time = “Q1” or “Q2”) and (item=“home entertainment” or “computer”) 23 Chicago New York Q1 605 825 14 400 Toronto Q2 Q3 slice for (time=“Q1”) Q4 Computer Home Entertainment Vancouver 605 825 Security 400 Security Phone item (type) Phone item (type) 14 Computer Home Entertainment Home Entertainment item (type) Time (Quarter) Chicago 440 New York 1560 Toronto 395 Vancouver 440 1560 395 605 825 Computer pivot 14 Phone Security 400 New York Chicago Vancouver Toronto Location (cities) 24 MINING FREQUENT PATTERN, ASSOCIATIONS 25 What is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Applications: Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc. Rule form: “Body Head [support, confidence]” buys(x, “diapers”) buys(x, “beers”) [0.5%,60%] major(x, “CS”) take (x, “DB”) grade (x, “A”) [1%,75%] 26 A typical example of association rule mining is market basket analysis. 27 The information that customers who purchase computer also tend to buy antivirus software at the same time is represented in Association Rule below: computer antivirus_software [support = 2%, confidence = 60%] Rule support and confidence are two measures of rule interestingness Support= 2% means that 2% of all transactions under analysis show that computer and antivirus software are purchased together Confidence=60% means that 60% of the customers who purchased a computer also bought the software Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold Such threshold can be set by users of domain experts 28 Rule Measure: Support and Confidence TransID Items Bought T001 A,B,C T002 A,C T003 A,D T004 B,E,F •Find all the rule A B C with minimum confidence and support •Let min_sup=50%, min_conf.=50% Support: probability that a transaction contain {ABC} Confidence: Condition probability that a transaction having {AB} also contain {C} Typically association rules are considered interesting if they satisfy both a minimum support threshold and a mininum confidence threshold Such thresholds can be set by users or domain experts 29 Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong A set of items is referred to as an itemset An itemset that contains k items is a k-itemset The occurrence frequency of an itemset is the number of transactions that contain the itemset An itemset satisfies minimum support if the occurrence frequency of the itemset >= min_sup * total no. of transaction An itemset satisfies minimum support it is a frequent itemset 30 Two Steps in Mining Association Rules Step1 :Find all frequent itemsets A subset of a frequent itemset must also be a frequent itemset i.e. if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemset with cardinality from 1 to k (kitemset) Step2 : generate strong association rules from the frequent itemsets 31 Mining Single-Dimensional Boolean Association Rules From Transaction Databases Methods for mining the simplest form of association rules: single-dimensional, single-level, boolean association rules Apriori algorithm The Apriori algorithm : Finding frequent itemset for boolean association rules 1. 2. Lk : frequent k- itemset is used to explore Lk+1 Consists of join and prune step The join step: A set of candidate k-itemset (Ck) is generated by joining Lk-1 with itself The prune step: Determine Lk as : any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset 32 The Apriori Algorithm Pseudo-code: Ck: Candidate itemset of size k Lk: Frequent itemset of size k L1= {frequent 1-itemsets}: for (k=1; Lk!=; k++) do begin Ck+1=candidates generated from Lk; for each transaction t in database D do increment count of all candidates in Ck+1 that are contained in t Lk+1=candidate in Ck+1 with min_support end Return kLk; 33 Example: Finding frequent itemsets in D Transaction database D |D| = 9 1. Each item is a member of the set of candidate 1itemsets (C1), count the number of occurrences of each item 2. Suppose the minimum transaction support count = 2, the set of L1 = candidate 1-itemsets that satisfy minimum support 3. Generate C2 = L1L1 4. Continue the algo. Until C4= 34 35 Example of Generating Candidates L3={abc, abd, acd, ace, bcd} Self-joining: L3L3 C4 ={abcd acde} Pruning: acde is remove because ade is not in L3 C4={abcd} 36 Generating Association Rule from frequent Itemsets confidence(AB)= P(B|A)=support_count(AB) support_count(A) support_count (AB) is the no. of transaction containing the itemsets AB support_count (A) is the no. of transaction containing the itemsets A Association rules can be generated as For each frequent itemset l, generate all nonempty subset of l For every nonempty subset s of l, output the rule s(l-s) if support_count(l) min_conf. support_count(s) Min_conf. is the minimum confidence threshold 37 Example Suppose the data contain the frequent itemset l={I1,I2,I5} What are the association rules that can be generated from l? The nonempty subsets of l are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5} The resulting association rules are 1 l1l2l5 confidence=2/4=50% 2 l1l5l2 confidence=2/2=100% 3 l2l5l1 confidence=2/2=100% 4 l1l2l5 confidence=2/6=33% 5 l2l1l5 confidence=2/7=29% 6 l5l1l2 confidence=2/2=100% If minimun confidence threshold = 70% Output are the rule no. 2,3 and 6 38 CLASSIFICATION AND PREDICTION 39 976-451 Data Warehousing and Data Mining Lecture 5 Classification and Prediction Chutima Pisarn Faculty of Technology and Environment Prince of Songkla University 40 What Is Classification? Case A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank A marketing manager at AllElectronics needs data analysis to help guess whether a customer with a given profile will buy a new computer A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient receive The data analysis task is classification, where the model or classifier is constructed to predict categorical labels, such as “safe” or “risky” for the loan application data “yes” or “no” for the marketing data “treatment A”, “treatment B” or “treatment C” for the medical data 41 What Is Prediction? Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at AllElectronics This data analysis task is numeric prediction, where the model constructed predicts a continuous value or ordered values, as opposed to a categorical label This model is a predictor Regression analysis is a statistical methodology that is most often used for numeric prediction 42 How does classification work? Data classification is a two-step process In the first step, -- learning step or training phase a model is built describing a predetermined set of data classes or concepts The model is constructed by analyzing database tuples described by attributes Each tuple is assumed to belong to a predefined class, as determined by the class label attribute Data tuples used to build the model are called training data set The individual tuples in a training set are referred to as training samples If the class label is provided, this step is known as supervised learning, otherwise called unsupervised learning (or clustering) The learned model is represented in the form of classification rules, decision trees or mathematical formulae 43 How does classification work? In the second step, The model is used for classification First, estimate the predictive accuracy of the model The holdout method is a technique that uses a test set of classlabeled samples which are randomly selected and are independent of the training samples The accuracy of a model on a given test set is the percentage of test set correctly classified by model If the accuracy of the model were estimate based on the training data set -> the model tends to overfit the data If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is unknown 44 How is prediction different from classification? Data prediction is a two step process, similar to that of data classification For prediction, the attributefor which values are being predicted is continuous-value (ordered) rather than categorical (discrete-value and unordered) Prediction can also be viewed as a mapping or function, y=f(X) 45 Classification by Decision Tree Induction A decision tree is a flow-chart-like tree structure, each internal node denotes a test on an attribute, each branch represents an outcome of the test leaf node represent classes Top-most node in a tree is the root node The decision tree represents the concept buys_computer Age? <=30 student? no no yes yes 31…40 yes >40 Credit_rating? excellent no fair yes 46 Attribute Selection Measure The information gain measure is used to select the test attribute at each node in the tree Information gain measure is referred to as an attribute selection measure or measure of the goodness of split The attribute with the highest information gain is chosen as the test attribute for the current node Let S be a set consisting of s data samples, the class label attribute has m distinct value defining m distinct classes, Ci (for i=1,...,m) Let si be the no of sample of S in class Cim The expected information I(s1,s2,…sm)=- p log2(pi), i=1 i where pi is the probability that the sample belongs to class, pi =si/s 47 Attribute Selection Measure (cont.) Find an entropy of attribute A Let A have distinct value {a1,a2,…,a} which can partition S into {S1,S2,….S} For each Sj, sij is the number of samples Sj of class Ci The entropy or expected information based on attribute A is given by E(A)= s1j+…+smj I(s ,…,s ) j=1 1j mj s Gain(A)=I(s1,s2,…sm)-E(A) The algorithm computes the information gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set S 48 Example the class label attribute: 2 classes RID age income student Credit_rating Class:buys_computer 1 <=30 High No Fair No 2 <=30 High No Excellent No 3 31…40 High No Fair Yes 4 >40 Medium No Fair Yes 5 >40 Low Yes Fair Yes 6 >40 Low Yes Excellent No 7 31…40 Low Yes Excellent Yes 8 <=30 Medium No Fair no 9 <=30 Low Yes Fair Yes 10 >40 Medium Yes Fair Yes 11 <=30 Medium Yes Excellent Yes 12 31…40 Medium No Excellent Yes 13 31…40 High Yes Fair Yes 14 >40 medium no Excellent No I(s1,s2) = I(9,5) = -9/14 log2 (9/14)5/14log2(5/14) =0.940 49 I(s1,s2) = I(9,5) = -9/14 log2(9/14)- 5/14 log2(5/14) =0.940 Compute the entropy of each attribute For attribute “age” For age=“<=30” s11=2, s21=3 For age=“31…40” s12=4, s22=0 For age=“>40” s13=3, s23=2 Gain(age) = I(s1,s2) – E(age) = 0.940 –[(5/14)I(2,3)+(4/14)I(4,0)+(5/14)I(3,2) = 0.246 For attribute “income” For income=“high” s11=2, s21=2 For income=“medium” s12=4, s22=2 For income=“low” s13=3, s23=1 Gain(income) = I(s1,s2) – E(income) = 0.940 –[(4/14)I(2,2)+(6/14)I(4,2)+(4/14)I(3,1) = 0.029 50 For attribute “student” For student=“yes” s11=6, s21=1 For student =“no” s12=3, s22=4 Gain(student) = I(s1,s2) – E(student) = 0.940 –[(7/14)I(6,1)+(7/14)I(3,4) = 0.151 For attribute “credit_rating” For credit_rating =“fair” s11=6, s21=2 For credit_rating =“excellent” s12=3, s22=3 Gain(credit_rating) = I(s1,s2) – E(credit_rating) = 0.940 –[(8/14)I(6,2)+(6/14)I(3,3) = 0.048 Since age has the highest information gain, age is selected as the test attribute -A node is created and labeled with age -Braches are grown for each of the attribute’s values 51 Age? <=30 >40 30…40 income student Credit_rating Class medium No Fair Yes No Low Yes Fair Yes Fair No low Yes Excellent No Yes Fair Yes medium Yes Fair Yes Yes Excellent Yes medium No Excellent No income student Credit_rating Class High No Fair No High No Excellent Medium No Low Medium S3 S1 income student Credit_rating Class High No Fair Yes Low Yes Excellent Yes Medium No Excellent Yes High Yes Fair Yes S2 52 For the partition age=“<=30” Find information gain for each attribute in this partition, then select the attribute with the highest information gain as a test node (call generate_decision_tree(S1, {income, student, credit_rating})) student have the highest information gain Age? <=30 >40 31…40 student? yes no income Credit_rating Class income Credit_rating Class Low Fair Yes High Fair No Medium Excellent Yes High Excellent No Medium Fair No All sample belong to class yes create leaf node and label with “yes” All sample belong to class no create leaf node and label with “no” 53 Age? <=30 31…40 student? no no yes yes yes >40 income student Credit_rating Class medium No Fair Yes Low Yes Fair Yes low Yes Excellent No medium Yes Fair Yes medium No Excellent No For the partition age=“30…40” All sample belong to class no create leaf node and label with “no” For the partition age=“>40” Consider credit rating and income credit rating has higher information gain 54 Age? <=30 student? no no yes yes 31…40 yes >40 Credit_rate? excellent no fair yes Attribute left is income but sample is empty terminate generate_decision_tree Assignment 1 แสดงการสร้าง Decision Tree นี้ อย่างละเอียด แสดงการคานวณด้วย 55 Example : generate rules from decision tree Age? <=30 student? no no yes yes 31…40 yes >40 Credit_rate? excellent no fair yes 1. 2. IF age=“<=30” AND student=“no” THEN buys_computer =“no” IF age=“<=30” AND student=“yes” THEN buys_computer =“yes” 3. 4. IF age=“31…40” THEN buys_computer =“yes” IF age=“>40” AND credit_rate=“excellent” THEN buys_computer =“no” IF age=“>40” AND credit_rate=“fair” THEN buys_computer =“yes” 5. 56 Naïve Bayesian Classification 1. Naïve Bayesian classifier also called simple Bayesian classifier, works as follows: Each data sample is represented by an n-dimensional feature, X=(x1,x2,…,xn) from n attributes, repectively, A1,A2,…,An A1,A2,…,An X Outlook Temperature Humidity Windy Play Rainy Mild Normal False Y Overcast Cool Normal True Y Sunny Hot High True N Overcast Hot High False Y Sunny High False Hot X=(sunny,hot,high, false) unknown class 57 Naïve Bayesian Classification (cont.) 2. Suppose that there are m clases, C1,C2,…Cm Given an unknown data sample X The calssifier will predict that X belongs to the class having the highest posterior probability, condition on X The naïve Bayesian will assigns an unknown X to class Ci if and only if P(Ci|X) > P(Cj|X) for 1 j m, jI That is, it will find the maximum posterior probability among P(C1|X), P(C2|X), ….,P(Cm|X) The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis 58 m =2 C1: Play=“Y” and C2: Play=“N” A1,A2,…,An Training Samples X Outlook Temperature Humidity Windy Play Rainy Mild Normal False Y Overcast Cool Normal True Y Sunny Hot High True N Overcast Hot High False Y Sunny High False Y Hot X=(sunny,hot,high,false) unknown class If (Play=“Y” | X) > (Play=‘N’| X) 59 Naïve Bayesian Classification (cont.) 3. By Bayes theorem, P(Ci|X) = P(X|Ci) P(Ci) P(X) As P(X) is constant for all classes, only P(X|Ci) P(Ci) need to be maximized If P(Ci) are not known, it is commonly assume that P(C1) = P(C2) = … = P(Cm), therefore only P(X|Ci) need to be maximized Otherwise, we maximize P(X|Ci) P(Ci) , # of training sample of Class Ci where P(Ci) = si s Total # of training sample 60 m =2 C1: Play=“Y” and C2: Play=“N” A1,A2,…,An Training Samples X Outlook Temperature Humidity Windy Play Rainy Mild Normal False Y Overcast Cool Normal True Y Sunny Hot High True N Overcast Hot High False Y Sunny High False Hot X=(sunny,hot,high,false) unknown class (Play=“Y”|X) = P(X|Play=“Y”) P(Play=“Y”) = P(X|Play=“Y”) (3/4) (Play=“N”|X) = P(X|Play=“N”) P(Play=“N”) = P(X|Play=“N”) (1/4) 61 Naïve Bayesian Classification (cont.) 4. Given a data sets with many attribute it is expensive to compute P(X|Ci) To reduce computation, naïve made an assumption of class conditional independence (there are no dependence relationship among the attribute) n P(X|Ci) = P(xk|Ci) = P(x1|Ci)* P(x2|Ci)*…* P(xk|Ci) k=1 If Ak is categorical, then P(xk|Ci) = sik si # of training sample of Class Ci having the value xk for Ak Total # of training sample belong to class Ci If Ak is continuous values perform Gaussian distribution (not focus in this class) 62 Naïve Bayesian Classification (cont.) 5. In order to classify an unknown X, P(X|Ci) P(Ci) is evaluated for each class Ci Sample X is assign to the class Ci for which P(X|Ci) P(Ci) is the maximum 63 Example: Predicting a class label using naïve Bayesian classification Unknown sample RID age income student Credit_rating Class:buys_computer 1 <=30 High No Fair No 2 <=30 High No Excellent No 3 31…40 High No Fair Yes 4 >40 Medium No Fair Yes 5 >40 Low Yes Fair Yes 6 >40 Low Yes Excellent No 7 31…40 Low Yes Excellent Yes 8 <=30 Medium No Fair no 9 <=30 Low Yes Fair Yes 10 >40 Medium Yes Fair Yes 11 <=30 Medium Yes Excellent Yes 12 31…40 Medium No Excellent Yes 13 31…40 High Yes Fair Yes 14 >40 medium no Excellent No 15 <=30 medium yes fair 64 C1: buys_computer =“Yes” , C2: buys_computer =“No” The unknown sample we wish to classify is X=(age=“<=30”, income=“medium”, student=“yes”, credit_rating=“fair”) We need to maximize P(X|Ci) P(Ci) , for i=1,2 i=1 P(X|buys_computer=“yes”) P(buys_computer=“yes”) P(buys_computer=“yes”) = 9/14 = 0.64 P(X|buys_computer=“yes”) = P(age=“<=30“|buys_computer=“yes”) * P(income=“medium“|buys_computer=“yes”) * P(student=“yes“|buys_computer=“yes”) * P(credit_rating=“fair“|buys_computer=“yes”) = 2/9 * 4/9 * 6/9 * 6/9 = 0.044 P(X|buys_computer=“yes”) P(buys_computer=“yes”) = 0.64*0.044 = 0.028 65 i=2 P(X|buys_computer=“no”) P(buys_computer=“no”) P(buys_computer=“no”) = 5/14 = 0.36 P(X|buys_computer=“no”) = P(age=“<=30“|buys_computer=“no”) * P(income=“medium“|buys_computer=“no”) * P(student=“yes“|buys_computer=“no”) * P(credit_rating=“fair“|buys_computer=“no”) = 3/5 * 2/5 * 1/5 * 2/5 = 0.019 P(X|buys_computer=“yes”) P(buys_computer=“yes”) = 0.36*0.019= 0.007 Therefore, X=(age=“<=30”, income=“medium”, student=“yes”, credit_rating=“fair”) should be in class buys_computer= “yes” 66 Assignment2: Outlook Temperature Humidity Windy Play Sunny Hot High False N Sunny Hot High True N Overcast Hot High False Y Rainy Mild High False Y Rainy Cool Normal False Y Rainy Cool Normal True N Overcast Cool Normal True Y Sunny Mild high False N Sunny Cool Normal False Y Rainy Mild Normal False Y Sunny Mild Normal True Y Overcast Hot Normal False Y Overcast Mild High True Y Rainy Mild High True N Sunny Cool Normal False Rainy Mild High False Using naïve Bayesain classifier to predict those unknown data samples Unknown data samples 67 Prediction: Linear Regression The prediction of continuous values can be modeled by statistical technique of regression The linear regression is the simplest form of regression Y=+X Y is called a response variable X is called a predictor variable and are regression coefficient specifying the Y-intercept and slop of the line These coefficients can be solved method of least squares, which minimizes the error between the actual data and the estimate of the line 68 Example : Find the linear regression of salary data Salary data X Year experience Y Salary (in $1000s) Y=+X 3 30 8 57 = i=1 (xi-x)(yi-y) = 3.5 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83 10 23.6+3.5(10) = 58.6 x = 9.1 and y = 55.4 s s i=1 (xi-x)2 = y - x = 23.6 Predicted line is estimated by Y = 23.6 + 3.5X 69 Classifier Accuracy Measures The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier Recognition rate – for pattern recognition literature The error rate or misclassification rate of the classifier M is simply 1- Acc(M) where Acc(M) is the accuracy of M If we were to use the training set to estimate the error rate of the model resubstitution error Confusion matrix is a useful tool for analyzing how well the classifier can recognize 70 Confusion matrix: Example Class Buys_computer Buys_computer total =yes =no Recognition (%) Buys_computer =yes 6,954 46 7,000 99.34 Buys_computer =no 412 2,588 3,000 86.27 Total 7,366 2,634 10,000 95.52 No. of tuple of class buys_computer=yes that were labeled by the classifier as class buys_computer=yes Predicted Class Actual Class C1 C1 C2 True positives False negative C2 False positive True negative 71 Are there alternatives to the accuracy measure? Are there alternatives to the accuracy measure? Sensitivity refer to true positive (recognition) rate = the porportion of positive tuples that are correctly idenitfied Specificity is the true negative rate = the proportion of negative tuples that are correctly identified No of positive tuples sensitivity = t_pos / pos specificity = t_neg / neg precision = t_pos / (t_pos+f_pos) Accuracy = sensitivity pos + specificity neg (pos+neg) (pos+neg) 72 Predictor Error Measure Loss functions measure the error between yi and the predicted value yi’ The most common loss functions are Absolute error: |yi-yi’| Squared error: (yi-yi’)2 Based on the above, the test error (rate) or generalization error, is the average loss over the test set Thus, we get the following error rates d Mean absolute error: y i 1 d i yi d Mean squared error: yi yi 2 i 1 73 d Evaluating the Accuracy of a Classifier or Predictor How can we use those measures to obtain a reliable estimate of classifier accuracy (or predictor accuracy) Accuracy estimates can help in comparison of different classifiers Common techniques to assessing accuracy based on randomly sampled partitions of the given data Holdout, random subsamplig, cross-validation, bootstrap 74 Evaluating the Accuracy of a Classifier or Predictor Holdout method The given data are randomly partitioned into 2 independent sets, a training data and a test set Typically, 2/3 are training set, 1/3 are test set Training set: used to derive the classifier Test set: used to estimate the derived classifier Training set Derived model Estimate accuracy Data Test set 75 Evaluating the Accuracy of a Classifier or Predictor Random subsampling The variation of the hold out method Repeat hold out method k times The overall accuracy estimate is the average of the accuracies obtained from each iteration 76 Evaluating the Accuracy of a Classifier or Predictor k-fold cross validation The initial data are randomly partitioned into k equal sized subsets (“folds”) S1, S2, ...,Sk Training and testing are performed k times In iteration i, the subset Si is the test sets, and the remaining subset are collectively used to train the classifier Accuracy = overall no. of correct classifiers from the k iterations total no. of samples in the initial data s1 s2 Iteration 1 Iteration 2 … 77 Evaluating the Accuracy of a Classifier or Predictor Bootstrap method The training tuples are sampled uniformly with replacement Each time a tuple is selected, it is equally likely to be selected again and readded to the training set There are several bootstrap method – the commonly used one is .632 bootstrap which works as follows Given a data set of d tuples The data set is sampled d times, with replacement, resulting bootstrap sample of training set of d samples It is very likely that some of the original data tuples will occur more than once in this sample The data tuples that did not make it into the training set end up forming the test set Suppose we try this out several times – on average 63.2% of original data tuple will end up in the bootstrap, and the remaining 36.8% will form the test set 78 CLUSTER ANALYSIS 79 What is Cluster Analysis Clustering: the process of grouping the data into classes or clusters which the objects within a cluster have high similarity in comparison to one another, but very dissimilarity to objects in other clusters What are some typical applications of clustering? In business, discovering distinct groups in their customer bases and characterize customer groups based on purchasing patterns In biology, derive plant and animal taxonomies, categorize genes Etc. Clustering is also call data segmentation in some application because clustering partition large data set into groups according to their similarity 80 What is Cluster Analysis Clustering can also be used for outlier detection, where outliers (value that are “far away” from any cluster) may be more interesting than common cases Application- the detection of credit card fraud, monitoring of criminal activities in electronic commerce In machine learning, clustering is an example of unsupervised learning (do not rely on predefined classes) 81 How to compute the dissimilarity between object? The dissimilarity (or similarity) between objects described by interval-scaled variables is typically compute based on the distance between each pair of objects Euclidean distance d(i,j) = (|xi1-xj1|2 + |xi2-xj2|2 +… + |xip-xjp|2) q=2 Manhattan (or city block) distance d(i,j) = (|xi1-xj1| + |xi2-xj2| +… + |xip-xjp|) q=1 Minkowski distance, a generalization of both Euclidean and Manhattan distance d(i,j) = (|xi1-xj1|q + |xi2-xj2|q +… + |xip-xjp|q)1/q 82 Centroid-Based Technique: The K-Means Method Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity 83 The k-means algorithm Input parameter: the number of cluster k and a database containing n objects Output: A set of k clusters that minimizes the squarederror criterion 1. Randomly selects k of the objects, each of which initially represents a cluster mean or center 2. For each remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean Compute the new mean for each cluster The process iterates until the criterion function converges 84 The K-Means Method The criterion used is called square-error criterion E=i=1pC |p-mi|2 k i where E is the sum of square-error for all objects in the database, p is the point representing a given object, and mi is the mean of cluster Ci Assignment 3: suppose that the data mining task to cluster the following eight points into three clusters A1(2,10) A2(2,5) A3(8,4) B1(5,8) B2(7,5) B3(6,4) C1(1,2) C2(4,9) The distance function is Euclidean distance Suppose A1, B1 and C1 is assigned as the center of each cluster 85