UGBA104: Intro to Business Analytics R04 Pre Unsupervised Learning Frequent Items and Association Rules Professors Terry Hendershott and Thomas Lee Operations and Information Technology Management Overview Week 04 • Monday – Why: – What: • Wednesday – How: D M D Associations as Opportunities Descriptive Models (unsupervised learning, Camm 4.3) Frequent Item Sets Association Rules Evaluating Rules Transforming Observations Association Rules (Camm 4.3) Next Week • What: Evaluating Predictive Models • Why: Discrimination and Diagnosis; Relevance and Recommendation • How: Classification Hendershott and Lee UC Berkeley, Haas School Slide 2 Association Rules Define Support Count of A: The number of transactions (receipts) in which item (set) A appears. Define Support of A: ππ’πππππ‘ πΆππ’ππ‘ π΄ π π΄ πππ‘ππ πΆππ’ππ‘ ππ πππππ πππ‘ππππ Review Grocery Receipts: 1. Bread, Milk 2. Bread, Diapers, Beer, Eggs 3. Milk, Diapers, Beer, Cola 4. Bread, Milk, Diapers, Beer 5. Bread, Milk, Diapers, Cola For Item Set(s) A and B, the set AB. lift π π΄ πππ π΅ Lift π π΄ ∗π π΅ unfits . 250 1.875 : Define Confidence in Rule A → B: π π΄ πππ π΅ π π΄ Support A = P(A) Support B = P(B) Support = P(A & B) Lift AB Conf Cola → Diapers 2/5 4/5 2/5 10/8 1 Diapers → Cola 4/5 2/5 2/5 10/8 1/2 Cola, Diapers, Milk Diapers Milk → Cola 3/5 2/5 2/5 10/6 2/3 Cola Diapers → Milk 2/5 4/5 2/5 10/8 1 Cola → Diapers Milk 2/5 3/5 2/5 10/6 1 Item Set (AB) Cola & Diapers Hendershott and Lee A→B UC Berkeley, Haas School Slide 3 Association Rules XL Miner 1. Copy-and-Paste the receipts into Excel. In XL Miner’s “List format”: • Each row corresponds to one observation. • Each column entry corresponds to one item in the observation. Notice: The order of items does not matter in List format. In row 3, Milk is the first item in the observation. In row 4, Milk is the second item in the observation. Although Beer always follows Diapers in this example, XL Miner does not require it. . Hendershott and Lee UC Berkeley, Haas School Slide 4 Association Rules XL Miner 2. Check the model inputs: • Where is the data table. • Do you have all of the rows and columns • In this example, the first row does not contain headers. 3. Set the format of the input data: “List format” in this example. 4. Set the parameters for Association Rule Mining • Set the minimum support threshold so that XL miner will only show rules made of item sets that appear in a “meaningful” number of transactions. • Recall that “confidence” refers to a conditional probability; the probability of the consequent conditioned on the antecedent. Notice: XL Miner calls it “Minimum support” but is asking for the “support count” as defined and used in the text. Notice: In this example, we deliberately lowered the threshold of support and confidence so that we could compare XL Miner’s ’s rules to some that we counted by hand. In general, 20% Support and 80% Confidence is a reasonable starting-point for analysis. Hendershott and Lee UC Berkeley, Haas School Slide 5 Association Rules XL Miner Confirm model settings – which variables did you use, how many are there, etc. “First Row Contains Headers” was unchecked. Confirm: Min Support = 2 Min Confidence = 50 Hendershott and Lee UC Berkeley, Haas School Slide 6 Association Rules XL Miner XL Miner generates the Association Rules in a new worksheet. Support A = P(A) Support B = P(B) Support = P(A & B) Lift AB Conf Cola, Diapers, Milk Diapers Milk Cola 3/5 2/5 2/5 10/6 2/3 Cola Diapers Milk 2/5 4/5 2/5 10/8 1 Cola Diapers Milk 2/5 3/5 2/5 10/6 1 Item Set (AB) Hendershott and Lee A B UC Berkeley, Haas School Slide 7 Association Rules XL Miner PMML Predictive Model Markup Language An XML-based markup language for specifying the parameter-setting and design of model configuration. Enables portability across different platforms (e.g. from ASP in Excel to R or Python. Hendershott and Lee UC Berkeley, Haas School Slide 8 Transforming Data* From Text to Data in List-Format 1. Start with data in list form (from text); our grocery receipts came from a slide in the file L05-Pre.ppt. Copy data from the source file. 2. Select a single cell (A1) and paste into the worksheet. Notice that the data pastes, one Excel row for each line of text. * You are not responsible for this in UGBA104, but this is a useful, text- and data-processing feature of Excel. Hendershott and Lee UC Berkeley, Haas School Slide 9 Transforming Data From Text to Data in List-Format 3. Select the column of data (where you pasted the list values. Hendershott and Lee 4. Data Tab | Data Tools Group | Text to Columns UC Berkeley, Haas School Slide 10 Transforming Data From Text to Data in List-Format 5. Hendershott and Lee In this example, the term “delimited” means that each “column” (or variable value) is separated by punctuation: a period or a comma. Alternatively, fixed-width columns are either truncated (if a data value is too long (or padded with white space if a data value is too short) so that each column is always the same width. UC Berkeley, Haas School Slide 11 Transforming Data From Text to Data in List-Format 6a. In Step.2 of 3. of the Text to Columns Wizard, we specify that: • “Treat consecutive delimiters as one.” This means that a period and a white-space are groupedtogether as a single delimiter (separator) • Comma, Space, Other – Period (for after the row number. As an error check, scan the Data Preview to ensure that the data is separated into columns as you expect; pay especially close to attention to additional delimiters leading or trailing each data-value in a column. Hendershott and Lee UC Berkeley, Haas School Slide 12 Transforming Data From Text to Data in List-Format 6b. If we had done the Text to Column transformation but OMITTED the space and period as delimiters, look at what the data would look like. • Notice that the first column still retains the number and period in front of the text entry. • Notice the subtle difference in columns 2 through four; the leading white space. Hendershott and Lee UC Berkeley, Haas School Slide 13 Transforming Data From Text to Data in List-Format 7. Hendershott and Lee UC Berkeley, Haas School In Step 3 of 3 in the Text to Columns Wizard, we set the Excel data type for each newly-converted column. • Highlight each column in term and use the radio button to specify the data type. • The column heading indicates the selected data type. Slide 14 Transforming Data From Text to Data in List-Format Starting: What the spreadsheet looked like when we first copied the list into Excel. Ending: What the spreadsheet looks like after we finish the Text to Columns conversion. Hendershott and Lee UC Berkeley, Haas School Slide 15 Transforming Data* From Text to Data in Binary-Matrix Format 1. Suppose instead our data looks like a series of receipts: 2. Each receipt is a separate “Transaction ID (TxnID)” and every item on the receipt is a separate row. Receipt 1. Receipt 2. Receipt 3. Receipt 4. Receipt 5. * You are not responsible for this in UGBA104, but this is a useful, text- and data-processing feature of Excel. Hendershott and Lee UC Berkeley, Haas School Slide 16 Transforming Data From Text to Data in Binary-Matrix Format 3. Construct a pivot table. Hendershott and Lee UC Berkeley, Haas School Slide 17 Transforming Data From Text to Data in Binary-Matrix Format 4. Add the zero’s to the rows 4.1 PivotTable Tools 4.2. Analyze Tab 4.3. Pivot Table “Options” 4.4 Set empty cells in the Pivot Table to zero. Hendershott and Lee UC Berkeley, Haas School Slide 18 Association Rules XL Miner 2. Check the model inputs: • Where is the data table. • Update the Data range to reflect the pivot table. • In this example, the first row DOES contain headers. 3. Set the format of the input data: “Data in binary matrix format” Hendershott and Lee UC Berkeley, Haas School Slide 19 Recitation Exercises NOTE 1: All worksheets for the recitation assignment are already available to you in bCourses. From bCourses, use the Files Tool (left hand navigation bar). In the folder named “Week-by-Week" in the subfolder named “Week 5” see the file named “05-Excel-Pre.” This Excel workbook contains separate worksheets for every example shown in class and the data worksheets for the recitation assignment. NOTE 2: All recitation assignment solutions discussed in class will be posted at 5PM of the same day. From bCourses, use the Files Tool (left hand navigation bar). In the folder named “Week-by-Week" in the subfolder named “Week 5” see the file named “R05-Post.” This PowerPoint file includes the complete lecture/recitation including examples form the whiteboard as well as solutions. Hendershott and Lee UC Berkeley, Haas School Slide 20 Recitation Exercises: CookieMonster Problem 1. (Adapted from Camm Problem 4.20) Cookie Monster Inc. is a company that specializes in the development of software that tracks web browsing history of individuals. A sample of browser histories is provided in the worksheet CookieMonster in the workbook Wk05-Excel-Pre. Using binary matrix format, the entry in row i and column j indicates whether web site j was visited by user i. With a minimum support of 800 transactions and a minimum confidence of 50%, use XLMiner to generate a list of association rules. Review the top 14 rules. Question 1.1. Consider only users who visit CNN. What is the probability that a CNN viewer visits the Weather Channel? What is the probability that a randomly selected person from the population visits the Weather Channel. As How much more likely is the first situation than the second? Write your answer as an integer percentage (round-up to the nearest integer so no decimal answers) and do not include the percent sign (e.g. 12.25% is 12). peviaficn Hendershott and Lee = Lift Viewers expectation of CNN in - 5.575 UC Berkeley, Haas School I - only want ppl I = 4.573458 OR - divide bas venery pfwk) Pcw ) From Camm Problem 4.20 CNN PCWYI - Iase% ' - visiting g. confidence 51.2 - WI 20,000 = c- I¥oo Slide 21 Recitation Exercises: CookieMonster Question 1.2. Consider Rule 5 relating Amazon and Pinterest visits. What is the joint probability of visiting both Amazon and Pinterest? Round your answer 891 to five decimal places. support cant β³ , support of sfatmazanapntees.fi - - focal # - of transactions 20,000 Question 1.3. Continuing with Rule 5, assume independence between visits to Amazon and visits to Pinterest. Under the expectation of independence, what would you expect the joint probability of Amazon and Pinterest to be? Round 1690 1840 your answer to five decimal places. ) PC Amazon ) PC Pinterest Ef 20,000 → . - Question 1.4. What is your confidence in the rule that Amazon users visit Pinterest (as opposed to the rule that Pinterest users visit Amazon)? Round your answer to five decimal places. PC Pinterest 891 ) k Amazon # PC Amazon) - = 20,000M ← 0.4842 1840 €0000 From Camm Problem 4.20 Hendershott and Lee UC Berkeley, Haas School Slide 22 Recitation Exercises: CookieMonster Question 1.5. Suppose that we discover an error in our counting and realize that there are actually more users who visited Pinterest (more 1’s in the Pinterest column) than we originally thought. Without knowing anything else, which of the following measures will definitely change. a. The support for Pinterest O b. The support for Amazon c. The support for the frequent item set (Pinterest and Amazon) d. Lift for the rule Pinterest Amazon Question 1.6. What is the lift for the rule (which is not generated by ASP) that NBA Deadspin? Answer to five decimal places. PCNBA A Deadspin) - Pl NBA) PLDeadspin) - the same I lift → NO DIRECTIONALITY From Camm Problem 4.20 Hendershott and Lee UC Berkeley, Haas School Slide 23 Recitation Exercises: Unsupervised Learning Problem 2. 2.1. An analysis of items frequently co-occurring in transactions is known as: a. market segmentation. c. cluster analysis. b. market basket analysis. d. regression analysis. O 2.2. A ___________ refers to the number of times a collection of items occur together in a transaction data set. a. support count c. antecedent O b. validation count d. consequent 2.3. The lift ratio of an association rule with a confidence value of 0.45 and in PCB ) which the consequent occurs in 6 out of 10 cases is lift confidence β a. 1.00 b. 0.75 c. 1.40 d. 0.54 PCA AB) confidence t.IT?pTp, 0.75 ! , o.45.at ¥ 2.4. Suppose that the confidence of an association rule is 0.75 and the total number of transactions is 250. How many of those transactions support the consequent if the lift ratio is 1.875? a. 150 lift b. 125 c. 175 d. 100 0 n → - = Hendershott and Lee cont " " "" FAI p CARB) → ¥17437 = - UC Berkeley, Haas School = o.is - ftp., = I. AS I MBI : !g - 0.4 u fo ' 230 ' ' 11007 - Slide 24 Recitation Exercises: Unsupervised Learning 2.5. Data preparation includes all of the following except which task? a. calculating the confidence ratio for all association rules b. treating missing data c. identifying erroneous data and outliers d. defining the appropriate way to represent variables 2.6 Suppose we had a data set of from a call center where customers were asked to choose between the following three options: hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"? a. 000 b. 010 c. 001 d. 100 2.7. Which of the following is true for Euclidean distances? It is: a. Not affected by the scale on which variables are measured. b. Used to measure dissimilarity between categorical variable observations. c. Commonly used as a method of measuring dissimilarity between quantitative observations. d. It increases with the increase in similarity between variable values. Hendershott and Lee UC Berkeley, Haas School Slide 25 The next few slides contain optional problems based on this week’s material followed by the solutions. If time permits, we will begin to work these problems during Wednesday’s Recitation. The first question is an association rule mining exercise that we will not otherwise cover in recitation but you are still responsible for. The second question analyzes a dendrogram from hierarchical clustering. Be sure you are comfortable with this idea. On Wednesday at 5:00 PM, the solution to this problem will post to the bCourses Files Tool in the folder named “Week-by-Week" in the subfolder named “Week 5" in the files named “R05-Post.” If you do not understand the solutions at the end of this deck, then attend Friday’s Discussion Session, where a GSI will work this problem at the start of the Discussion Session (before discussing last week’s Homework Problems). Hendershott and Lee UC Berkeley, Haas School Slide 26 Friday GSI Discussion Section: AppleCart Apple Inc. tracks online transactions at its iStore and is interested in learning about the purchase patterns of its customers in order to provide recommendations as a customer browses its web site. A sample of the “shopping cart” data in binary matrix format resides in the worksheet AppleCart in the workbook Wk05-Excel-Pre. Each row indicates which iPad features and accessories a customer selected. Using a minimum support of 10% of the total number of transactions and a minimum confidence of 50%, use XLMiner to generate a list of association rules. a. Express, in words, what the rule with the largest lift ratio is saying about the relationship between the antecedent item set and consequent item set. b. Explain, in words, the meaning of the support count for the item set involved in the rule with the largest lift ratio. c. Apply the definition of confidence to explain, in words and numbers from the table, how to calculate the confidence of the rule with the largest lift ratio. d. For the rule with the largest lift ratio, use words to explain the meaning of the lift ratio for a customer who purchases the basket of items in the antecedent. e. Review the top 15 rules. As a business-person, what would these rules suggest to you about either pricing or inventory management. From Camm Problem 4.19 SOLUTION Hendershott and Lee UC Berkeley, Haas School Slide 27