Uploaded by Sherry H

R04 Pre.pdf

advertisement
UGBA104: Intro to Business Analytics
R04 Pre Unsupervised Learning
Frequent Items and Association Rules
Professors Terry Hendershott and Thomas Lee
Operations and Information Technology Management
Overview
Week 04
• Monday
– Why:
– What:
• Wednesday
– How:
D
M
D
Associations as Opportunities
Descriptive Models (unsupervised learning, Camm 4.3)
Frequent Item Sets
Association Rules
Evaluating Rules
Transforming Observations
Association Rules (Camm 4.3)
Next Week
• What: Evaluating Predictive Models
• Why: Discrimination and Diagnosis; Relevance and Recommendation
• How: Classification
Hendershott and Lee
UC Berkeley, Haas School
Slide 2
Association Rules
Define Support Count of A: The number of
transactions (receipts) in which item (set) A
appears.
Define Support of A: π‘†π‘’π‘π‘π‘œπ‘Ÿπ‘‘ πΆπ‘œπ‘’π‘›π‘‘ 𝐴
𝑃 𝐴
π‘‡π‘œπ‘‘π‘Žπ‘™ πΆπ‘œπ‘’π‘›π‘‘ π‘œπ‘“ π‘‡π‘Ÿπ‘Žπ‘›π‘ π‘Žπ‘π‘‘π‘–π‘œπ‘›π‘ 
Review
Grocery Receipts:
1. Bread, Milk
2. Bread, Diapers, Beer, Eggs
3. Milk, Diapers, Beer, Cola
4. Bread, Milk, Diapers, Beer
5. Bread, Milk, Diapers, Cola
For Item Set(s) A and B, the set AB. lift
𝑃 𝐴 π‘Žπ‘›π‘‘ 𝐡
Lift
𝑃 𝐴 ∗𝑃 𝐡
unfits
.
250
1.875
:
Define Confidence in Rule A → B: 𝑃 𝐴 π‘Žπ‘›π‘‘ 𝐡
𝑃 𝐴
Support
A = P(A)
Support
B = P(B)
Support =
P(A & B)
Lift
AB
Conf
Cola → Diapers
2/5
4/5
2/5
10/8
1
Diapers → Cola
4/5
2/5
2/5
10/8
1/2
Cola, Diapers, Milk Diapers Milk → Cola
3/5
2/5
2/5
10/6
2/3
Cola Diapers → Milk
2/5
4/5
2/5
10/8
1
Cola → Diapers Milk
2/5
3/5
2/5
10/6
1
Item Set (AB)
Cola & Diapers
Hendershott and Lee
A→B
UC Berkeley, Haas School
Slide 3
Association Rules
XL Miner
1.
Copy-and-Paste the receipts into Excel. In XL Miner’s “List format”:
•
Each row corresponds to one observation.
•
Each column entry corresponds to one item in the observation.
Notice: The order of items does not matter in List format. In row 3, Milk is the first item in the
observation. In row 4, Milk is the second item in the observation. Although Beer always
follows Diapers in this example, XL Miner does not require it. .
Hendershott and Lee
UC Berkeley, Haas School
Slide 4
Association Rules
XL Miner
2.
Check the model inputs:
•
Where is the data
table.
•
Do you have all of
the rows and
columns
•
In this example,
the first row does
not contain
headers.
3.
Set the format of the
input data: “List format”
in this example.
4.
Set the parameters for Association Rule Mining
•
Set the minimum support threshold so that XL miner will only show rules made of item sets that
appear in a “meaningful” number of transactions.
•
Recall that “confidence” refers to a conditional probability; the probability of the consequent
conditioned on the antecedent.
Notice: XL Miner calls it “Minimum support” but is asking for the “support count” as defined and used in
the text.
Notice: In this example, we deliberately lowered the threshold of support and confidence so that we could
compare XL Miner’s ’s rules to some that we counted by hand. In general, 20% Support and 80%
Confidence is a reasonable starting-point for analysis.
Hendershott and Lee
UC Berkeley, Haas School
Slide 5
Association Rules
XL Miner
Confirm model settings – which
variables did you use, how many are
there, etc.
“First Row Contains Headers” was unchecked.
Confirm:
Min Support = 2
Min Confidence = 50
Hendershott and Lee
UC Berkeley, Haas School
Slide 6
Association Rules
XL Miner
XL Miner generates the Association
Rules in a new worksheet.
Support
A = P(A)
Support
B = P(B)
Support =
P(A & B)
Lift
AB
Conf
Cola, Diapers, Milk Diapers Milk Cola
3/5
2/5
2/5
10/6
2/3
Cola Diapers Milk
2/5
4/5
2/5
10/8
1
Cola Diapers Milk
2/5
3/5
2/5
10/6
1
Item Set (AB)
Hendershott and Lee
A B
UC Berkeley, Haas School
Slide 7
Association Rules
XL Miner
PMML Predictive Model
Markup Language
An XML-based markup
language for specifying the
parameter-setting and design
of model configuration.
Enables portability across
different platforms (e.g. from
ASP in Excel to R or Python.
Hendershott and Lee
UC Berkeley, Haas School
Slide 8
Transforming Data*
From Text to Data in List-Format
1.
Start with data in list form (from text);
our grocery receipts came from a
slide in the file L05-Pre.ppt. Copy
data from the source file.
2.
Select a single cell (A1) and paste
into the worksheet. Notice that the
data pastes, one Excel row for each
line of text.
* You are not responsible for this in UGBA104, but this is a useful, text- and data-processing feature of Excel.
Hendershott and Lee
UC Berkeley, Haas School
Slide 9
Transforming Data
From Text to Data in List-Format
3.
Select the column of data (where you
pasted the list values.
Hendershott and Lee
4.
Data Tab | Data Tools Group | Text to Columns
UC Berkeley, Haas School
Slide 10
Transforming Data
From Text to Data in List-Format
5.
Hendershott and Lee
In this example, the term
“delimited” means that each
“column” (or variable value) is
separated by punctuation: a
period or a comma. Alternatively,
fixed-width columns are either
truncated (if a data value is too
long (or padded with white space
if a data value is too short) so that
each column is always the same
width.
UC Berkeley, Haas School
Slide 11
Transforming Data
From Text to Data in List-Format
6a. In Step.2 of 3. of the Text to Columns
Wizard, we specify that:
•
“Treat consecutive delimiters as
one.” This means that a period
and a white-space are groupedtogether as a single delimiter
(separator)
•
Comma, Space, Other – Period (for
after the row number.
As an error check, scan the Data
Preview to ensure that the data is
separated into columns as you expect;
pay especially close to attention to
additional delimiters leading or trailing
each data-value in a column.
Hendershott and Lee
UC Berkeley, Haas School
Slide 12
Transforming Data
From Text to Data in List-Format
6b. If we had done the Text to Column
transformation but OMITTED the space and
period as delimiters, look at what the data
would look like.
•
Notice that the first column still retains
the number and period in front of the
text entry.
•
Notice the subtle difference in
columns 2 through four; the leading
white space.
Hendershott and Lee
UC Berkeley, Haas School
Slide 13
Transforming Data
From Text to Data in List-Format
7.
Hendershott and Lee
UC Berkeley, Haas School
In Step 3 of 3 in the Text to Columns
Wizard, we set the Excel data type for
each newly-converted column.
•
Highlight each column in term and
use the radio button to specify the
data type.
•
The column heading indicates the
selected data type.
Slide 14
Transforming Data
From Text to Data in List-Format
Starting: What the spreadsheet looked like
when we first copied the list into Excel.
Ending: What the spreadsheet looks like after
we finish the Text to Columns conversion.
Hendershott and Lee
UC Berkeley, Haas School
Slide 15
Transforming Data*
From Text to Data in Binary-Matrix Format
1.
Suppose instead our data looks like
a series of receipts:
2.
Each receipt is a separate
“Transaction ID (TxnID)” and every
item on the receipt is a separate row.
Receipt 1.
Receipt 2.
Receipt 3.
Receipt 4.
Receipt 5.
* You are not responsible for this in UGBA104, but this
is a useful, text- and data-processing feature of Excel.
Hendershott and Lee
UC Berkeley, Haas School
Slide 16
Transforming Data
From Text to Data in Binary-Matrix Format
3.
Construct a pivot table.
Hendershott and Lee
UC Berkeley, Haas School
Slide 17
Transforming Data
From Text to Data in Binary-Matrix Format
4. Add the zero’s to the rows
4.1 PivotTable Tools
4.2. Analyze Tab
4.3. Pivot Table “Options”
4.4 Set empty cells
in the Pivot
Table to zero.
Hendershott and Lee
UC Berkeley, Haas School
Slide 18
Association Rules
XL Miner
2.
Check the model inputs:
•
Where is the data
table.
•
Update the Data
range to reflect the
pivot table.
•
In this example,
the first row DOES
contain headers.
3.
Set the format of the
input data: “Data in
binary matrix format”
Hendershott and Lee
UC Berkeley, Haas School
Slide 19
Recitation Exercises
NOTE 1: All worksheets for the recitation assignment are already
available to you in bCourses. From bCourses, use the Files Tool (left
hand navigation bar). In the folder named “Week-by-Week" in the
subfolder named “Week 5” see the file named “05-Excel-Pre.” This
Excel workbook contains separate worksheets for every example
shown in class and the data worksheets for the recitation assignment.
NOTE 2: All recitation assignment solutions discussed in class will be
posted at 5PM of the same day. From bCourses, use the Files Tool (left
hand navigation bar). In the folder named “Week-by-Week" in the
subfolder named “Week 5” see the file named “R05-Post.” This
PowerPoint file includes the complete lecture/recitation including
examples form the whiteboard as well as solutions.
Hendershott and Lee
UC Berkeley, Haas School
Slide 20
Recitation Exercises: CookieMonster
Problem 1. (Adapted from Camm Problem 4.20) Cookie Monster Inc. is a
company that specializes in the development of software that tracks web
browsing history of individuals. A sample of browser histories is provided in the
worksheet CookieMonster in the workbook Wk05-Excel-Pre. Using binary
matrix format, the entry in row i and column j indicates whether web site j was
visited by user i.
With a minimum support of 800 transactions and a minimum confidence of
50%, use XLMiner to generate a list of association rules. Review the top 14
rules.
Question 1.1. Consider only users who visit CNN. What is the probability that
a CNN viewer visits the Weather Channel? What is the probability that a
randomly selected person from the population visits the Weather Channel. As
How much more likely is the first situation than the second? Write your answer
as an integer percentage (round-up to the nearest integer so no decimal
answers) and do not include the percent sign (e.g. 12.25% is 12).
peviaficn
Hendershott and Lee
=
Lift
Viewers
expectation of CNN
in
-
5.575
UC Berkeley, Haas School
I
-
only want ppl
I
=
4.573458
OR
-
divide bas
venery
pfwk)
Pcw )
From Camm Problem 4.20
CNN
PCWYI
-
Iase%
'
-
visiting
g.
confidence
51.2
-
WI
20,000
=
c-
I¥oo
Slide 21
Recitation Exercises: CookieMonster
Question 1.2. Consider Rule 5 relating Amazon and Pinterest visits. What is
the joint probability of visiting both Amazon and Pinterest? Round your answer
891
to five decimal places.
support cant
↳
,
support of
sfatmazanapntees.fi
-
-
focal #
-
of transactions
20,000
Question 1.3. Continuing with Rule 5, assume independence between visits to
Amazon and visits to Pinterest. Under the expectation of independence, what
would you expect the joint probability of Amazon and Pinterest to be? Round
1690
1840
your answer to five decimal places.
)
PC Amazon ) PC Pinterest
Ef
20,000
→
.
-
Question 1.4. What is your confidence in the rule that Amazon users visit
Pinterest (as opposed to the rule that Pinterest users visit Amazon)? Round
your answer to five decimal places.
PC Pinterest
891
)
k Amazon
#
PC Amazon)
-
=
20,000M
←
0.4842
1840
€0000
From Camm Problem 4.20
Hendershott and Lee
UC Berkeley, Haas School
Slide 22
Recitation Exercises: CookieMonster
Question 1.5. Suppose that we discover an error in our counting and realize
that there are actually more users who visited Pinterest (more 1’s in the
Pinterest column) than we originally thought. Without knowing anything else,
which of the following measures will definitely change.
a. The support for Pinterest
O
b. The support for Amazon
c. The support for the frequent item set (Pinterest and Amazon)
d. Lift for the rule Pinterest Amazon
Question 1.6. What is the lift for the rule (which is not generated by ASP) that
NBA Deadspin? Answer to five decimal places.
PCNBA
A
Deadspin)
-
Pl NBA) PLDeadspin)
-
the same
I
lift
→
NO DIRECTIONALITY
From Camm Problem 4.20
Hendershott and Lee
UC Berkeley, Haas School
Slide 23
Recitation Exercises: Unsupervised Learning
Problem 2.
2.1. An analysis of items frequently co-occurring in transactions is known as:
a. market segmentation.
c. cluster analysis.
b. market basket analysis.
d. regression analysis.
O
2.2. A ___________ refers to the number of times a collection of items occur
together in a transaction data set.
a. support count
c. antecedent
O
b. validation count
d. consequent
2.3. The lift ratio of an association rule with a confidence value of 0.45 and in
PCB )
which the consequent occurs in 6 out of 10 cases is
lift
confidence
β‘ 
a. 1.00
b. 0.75
c. 1.40
d. 0.54
PCA AB)
confidence
t.IT?pTp,
0.75
! , o.45.at
¥
2.4. Suppose that the confidence of an association rule is 0.75 and the total
number of transactions is 250. How many of those transactions support the
consequent if the lift ratio is 1.875?
a. 150 lift
b. 125
c. 175
d. 100
0
n
→
-
=
Hendershott and Lee
cont
" " ""
FAI
p CARB)
→
¥17437
=
-
UC Berkeley, Haas School
=
o.is
-
ftp.,
=
I. AS
I
MBI
:
!g
-
0.4
u
fo
'
230
'
'
11007
-
Slide 24
Recitation Exercises: Unsupervised Learning
2.5. Data preparation includes all of the following except which task?
a. calculating the confidence ratio for all association rules
b. treating missing data
c. identifying erroneous data and outliers
d. defining the appropriate way to represent variables
2.6 Suppose we had a data set of from a call center where customers were
asked to choose between the following three options: hear account
information, billing questions, and customer service. Using the given order
of the three options, and using 0-1 dummy variables to encode the
categorical variables, which of the following combinations would yield an
entry "customer service"?
a. 000
b. 010
c. 001
d. 100
2.7. Which of the following is true for Euclidean distances? It is:
a. Not affected by the scale on which variables are measured.
b. Used to measure dissimilarity between categorical variable
observations.
c. Commonly used as a method of measuring dissimilarity between
quantitative observations.
d. It increases with the increase in similarity between variable values.
Hendershott and Lee
UC Berkeley, Haas School
Slide 25
The next few slides contain optional problems based on this week’s material
followed by the solutions.
If time permits, we will begin to work these problems during Wednesday’s
Recitation.
The first question is an association rule mining exercise that we will not
otherwise cover in recitation but you are still responsible for.
The second question analyzes a dendrogram from hierarchical clustering.
Be sure you are comfortable with this idea.
On Wednesday at 5:00 PM, the solution to this problem will post to the
bCourses Files Tool in the folder named “Week-by-Week" in the subfolder
named “Week 5" in the files named “R05-Post.”
If you do not understand the solutions at the end of this deck, then attend
Friday’s Discussion Session, where a GSI will work this problem at the start
of the Discussion Session (before discussing last week’s Homework
Problems).
Hendershott and Lee
UC Berkeley, Haas School
Slide 26
Friday GSI Discussion Section: AppleCart
Apple Inc. tracks online transactions at its iStore and is interested in learning
about the purchase patterns of its customers in order to provide
recommendations as a customer browses its web site. A sample of the
“shopping cart” data in binary matrix format resides in the worksheet AppleCart
in the workbook Wk05-Excel-Pre. Each row indicates which iPad features and
accessories a customer selected.
Using a minimum support of 10% of the total number of transactions and a
minimum confidence of 50%, use XLMiner to generate a list of association
rules.
a. Express, in words, what the rule with the largest lift ratio is saying about the
relationship between the antecedent item set and consequent item set.
b. Explain, in words, the meaning of the support count for the item set involved in the
rule with the largest lift ratio.
c. Apply the definition of confidence to explain, in words and numbers from the table,
how to calculate the confidence of the rule with the largest lift ratio.
d. For the rule with the largest lift ratio, use words to explain the meaning of the lift ratio
for a customer who purchases the basket of items in the antecedent.
e. Review the top 15 rules. As a business-person, what would these rules suggest to
you about either pricing or inventory management.
From Camm Problem 4.19
SOLUTION
Hendershott and Lee
UC Berkeley, Haas School
Slide 27
Download