Uploaded by Athithya Raagul

L1-L11 Notes

advertisement
Lecture
1
Introduction to Data Analytics
and Descriptive Analytics
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:





What is Data Analytics
What is Descriptive Analytics?
Exploratory Data Analysis
Descriptive Statistical Measures:




1
Measures of Location
Measures of Dispersion
Measures of Shape
Measures of Association
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
What is Data Analytics?
Data analytics:



Use data, computing technology, statistical analysis, quantitative
methods, and mathematical or computer-based models.
Gain improved insights about business operations and make
better, fact-based decisions.
Scope of data analytics:




2
Descriptive analytics – Uses data to understand past and
present performance.
Predictive analytics – Use data to predict future
performance.
Prescriptive analytics – Improve what we know from
descriptive and predictive analytics.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
What is Data Analytics? (cont.)
Hypothetical example of data analytics in a typical
business scenario:






3
Retail markdown decisions – Most department stores clear
seasonal inventory by reducing prices.
The question is – When to reduce the price and by how
much?
Descriptive analytics – Examine historical data for similar
products (prices, units sold, advertising, …)
Predictive analytics – Predict sales based on price.
Prescriptive analytics – Find the best sets of pricing and
advertising to maximise sales revenue.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
What is Data Analytics? (cont.)
Real-world actual example of data analytics:



Harrah’s Entertainment – Harrah’s owns numerous hotels and
casinos
Uses predictive analytics to:



Uses prescriptive analytics to:



4
Forecast demand for rooms.
Segment customers by gaming activities.
Set room rates.
Allocate rooms.
Offer perks and rewards to customers.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
What is Data Mining?
Data mining:



Refers to the process of finding anomalies, patterns and
correlations within large data sets to predict outcomes.
Most prevalent form of predictive analytics in use in modern
business environment.
Data in data mining:




5
A collection of facts usually obtained as the result of
experiences, observations or experiments.
May consist of numbers, words and images.
Lowest level of abstraction (from which information and
knowledge are derived).
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
What is Data Mining?
Taxonomy of data:

6
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Categorical Data
Categorical Data:



Represent the labels of multiple classes used to divide a
variable into specific groups.
Examples:



7
Race, sex, age group and educational level.
Certain variables such as age group and educational level may be
represented in numerical format using their exact values.
Often more informative to categorise them into small number of
ordered classes.
Age Group
Age Range
Primary School Students
7 – 12
Secondary School Students
13 – 16
Pre-tertiary Students
17 – 19
Tertiary Students
>= 20
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Categorical Data (cont.)

Discrete in nature since there are a finite number of values
with no continuum between them.
Nominal data:




Categorical variables without natural ordering.
E.g., Marital status can be categorized into (1) single, (2)
married and (3) divorced.
Nominal data can be represented as either:


8
Binomial values with two possible values, e.g., yes/no or true/false.
Multinomial values with three or more possible values, e.g., marital
status or ethnicity (white/black/Latino/Asian).
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Categorical Data (cont.)
Ordinal data:


Categorical variables that lend themselves to natural ordering.



But it makes no sense to calculate the differences or ratios
between values.
Examples:




9
I.e., the codes assigned to objects represent the rank order among
them.
Credit score can be categorized as (1) low, (2) medium and (3) high.
Education level can be categorized as (1) high school, (2) college and
(3) graduate school.
Age group.
The additional rank-order information is useful in certain data
mining algorithms for building a better model.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Numerical Data

Numerical data:


Represent the numerical values of specific variables.
Examples:



Continuous in nature since the variable contains continuous
measures on a specific scale that allows insertion of interim
value.


10
Age, number of children, total household income (in SGD), travel
distance (in kilometers) and temperature (in Celsius).
Numerical values can be integer (whole number only) or real
(including the fractional number).
Discrete variable represents finite, countable data.
Continuous variable represents scalable measures and may contain
infinite number of fractional values.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Numerical Data (cont.)

Interval data:


Numerical variables that are measured on interval scales.
Example:



Temperature on the Celsius scale is 1/100 of the difference between
the melting temperature and boiling temperature of water.
There is no notion of absolute zero value.
Ratio data:


Numerical variables that are measured on ratio scales.
Examples:


Common measurements in physical science and engineering such as
mass, length and time.
A ratio scale has a non-arbitrary zero value defined.

11
Kelvin temperature scale has a non-arbitrary zero value of absolute zero,
which is equal to -273.15 degrees Celsius.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Other Data Types

Data mining may involve many other data types:



Unstructured text, image and audio.
These data types need to be converted into some form of
categorical or numerical representation before they can be
processed by data mining algorithms.
Data may also be classified as:


12
Static.
Dynamic, i.e., temporal or time series.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Data Type and Data Mining


Some data mining methods are particular about the data
types that they can handle.
Providing incompatible data types may lead to:



Incorrect models.
Halting of the model development process.
Examples:

Neural networks, support vector machines and logistic
regression require numerical variables:


13
Categorical variables are converted into numerical representations
using 1-of-N pseudo variables.
Both nominal and ordinal variables with N unique values can be
transformed into N pseudo variables with binary values – 1 or 0.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Data Type and Data Mining (cont.)

ID3 (a class decision tree algorithm) and rough sets (a
relatively new rule induction algorithm) require categorical
variables:


Numeric variables are discretized into categorical representations.
Multiple regression with categorical predictor variables:

Binomial categorical variable:


Multinomial categorical variable:



14
Simply represent the variable with a binary value – 1 or 0.
Need to perform the process of dummy coding.
A multinomial variable with k possible values may be represented by k-1
binomial variables each coded with a binary value – 1 or 0.
Similar concept to 1-of-N pseudo variables but need to exclude one value.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Data Type and Data Mining (cont.)

The good news is:



15
Some, but not all, software tools accepts a mix of numeric and
nominal variables.
Internally make the necessary conversion before processing
the data.
For learning purpose, you need to know when and how to use
the correct data types.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Tukey’s Exploratory Data Analysis


Exploratory Data Analysis (EDA) was first proposed
by John Tukey.
EDA is more of a mindset for analysis rather than an
explicit set of techniques and models:



EDA emphasises data observation, visualisation and the careful
application of technique to make sense of data.
EDA is NOT about fitting data into analytical model BUT
rather about fitting a model to the data.
Two main EDA approaches:


16
Descriptive Analytics – Use of descriptive
statistical measures.
Data Visualisation.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Why Study Exploratory Data Analysis?

Visually examining data to understand patterns and
trends:



Raw data should be examined to learn the trends and patterns
over time, and between dimensions in the data.
Visual examination can help to frame what analytical methods
are possible to apply to the digital data.
Using the best possible methods to gain insights into not
just the data, but what the data is saying:

17
Going beyond the details of data to understand the data is
saying in the context of answering our question.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Why Study Exploratory Data Analysis?
(cont.)

Identifying the best performing variables and model for
the data:



There are so many big data, how do we decide what is useful?
EDA helps ascertain what variables are influential and
important.
Detecting anomalous and suspicious outlier data:


18
Outliers and anomalies in the data may be important, highly
relevant, and meaningful to the business.
Can also be just random noise that can be ignored and
excluded from analytical consideration.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Why Study Exploratory Data Analysis?
(cont.)

Testing hypothesis and assumptions:




Using insights derived from data to create hypotheses.
Test hypothesis-driven changes.
The gist is to use data to test hypotheses.
Finding and applying the best possible model to fit the
data:

19
Predictive modelling and analysis requires an EDA approach
that is more focused on the data rather than the model.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
General Process of Exploratory Data
Analysis
Identify the key dimensions and metrics of the dataset.
Use analytical software to visualise the data using the
techniques that we will review in this lecture.
Apply the appropriate statistical model to generate
value.
1.
2.
3.

The general idea is to:


20
Use tool and pattern recognition abilities to observe the data
relationships and unusual data movements.
By visually examining and exploring the data, it becomes
possible to focus the approach to the analysis work.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Data Quality Report



To ascertain how good a dataset is, we can evaluate the
dataset with a data quality report.
Use Panda’s DataFrame to calculate the required
statistics.
See src01 for a sample script.
21
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
More About Exploratory Data Analysis
and Descriptive Analytics

Three main phases of descriptive analytics:


Univariate analysis – Investigate the properties of each
single variable in the dataset.
Bivariate analysis:



22
Measure the intensity of the relationship existing between pairs of
variables.
For supervised learning models, focus is on the relationships between
the predictive variables and target variable.
Multivariate analysis – Investigate the relationships holding
within a subset of variables.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
More About Exploratory Data Analysis
and Descriptive Analytics (cont.)

Visual illustration of the differences among the three
phases:
Variable A
Target
Variable
Variable B
Target
Variable
Variable A
Variable C
Target
Variable
Variable B
Variable A
Variable B
Variable C
Variable A
Variable C
Target
Variable
Variable B
Variable C
Univariate
23
Bivariate
Important for supervised
learning models
Variable A
Variable B
Variable C
Target
Variable
Multivariate
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Univariate Analysis


Study the behavior of each variable as an entity
independent of other variables in the dataset.
Three characteristics that are of interest:




Location – Tendency of the values of a given variable to
arrange themselves around a specific central value.
Dispersion – Propensity of the variable to assume a more or
less wide range of values.
Underlying probability distribution.
Objectives of univariate analysis:

Verify validity of assumptions regarding variable distribution:


24
E.g.,Values of a variable are exactly or approximately normal.
Regression assumes that variables have normal distributions.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Univariate Analysis (cont.)

Verify the information content of each variable:




E.g., A variable that assumes the same value for 95% of the available
records may be considered fairly constant for practical purposes.
The information it provides is of little value.
Identifying outliers – Anomalies and non-standard values.
For the remaining discussion, let’s suppose the dataset D
contains m records and denote by a j the generic variable
being analyzed:


25
The subscript j will be suppressed for clarity.
The vector (x1 j , x xj ,..., xmj ) of m records corresponding to a j will
simply be denoted by:
a = (x1 , x2,..., xm )
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Location

Mean:

Sample arithmetic mean is defined as:
x1 + x2 + ...xm 1 m
µ=
= ∑ xi
m
m i =1

The sum of the differences between each value and the sample
mean, i.e., the deviation or spread, is equal to 0.
∑ (x − µ ) = 0
m
i
i =1

It is also the value c that minimises the sum of squared
deviations:
∑ (x − µ )
m
i =1
26
2
i
m
c
= min
∑ (xi − c )
2
i =1
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Location (cont.)

Weighted sample mean:

At times, each value xi is found to be associated with a numerical
coefficient wi , i.e., the weight of that value.
w1 x1 + w2 x2 + ...wm xm
=
µ=
w1 + w2 + ... + wm

i =1
m
wi xi
i =1
wi
Example:
 xi is the unit sale price of the ith shipping lot consisting of w product
i

27
∑
∑
m
units.
The average sale price would be the weighted sample mean.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Location (cont.)

Median:




Suppose x1 + x2 + ...xm are m observations arranged in a nondecreasing way.
If m is an odd number, the median is the observation occupying
position (m + 1) / 2 , i.e., xmed = x(m +1)/ 2 .
If m is an even number, the median is the middle point in the
interval between the observations of position m / 2 and (m + 2) / 2 .
Example of m as odd number:



Example of m as even number:


28
-2, 0, 5, 7, 10
Median is 5.
50, 85, 100, 102, 110, 200
Median is (100 + 102 ) / 2 = 101 , i.e., using the 3rd and 4th observations.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Location (cont.)

Mode:

The value that corresponds to the
peak of the empirical density curve for
the variable a.


The value that appears most often in the
dataset for variable a.
If the empirical density curve has been
calculated by partitioning continuous
values into intervals, each value of the
interval that corresponds to the
maximum empirical frequency is the
mode.
(Source: Science Buddies)
29
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Location (cont.)

Midrange:

The midpoint in the interval between the minimum value and
the maximum value.
x


x max + x min
=
, where x max = max xi and x min = min xi
i
i
2
Not robust with respect to extreme values, i.e., presence of
outliers (just like the mean).
Example:

50, 85, 100, 102, 110, 200
200 + 50
x midr =
= 125
2

µ = 107.8333

30
midr
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Location (cont.)

Geometric mean:

mth root of the product of the m observations of the variable a.
µ
geom
m
= m x1 x2 ...xm = m ∏ xi
i =1

Example (Source: University of Toronto Mathematics Network):

Suppose you have an investment which earns 10% the first year, 60%
the second year, and 20% the third year. What is its average rate of
return?




31
After one year, $1 investment becomes $1.10.
After two years, this investment becomes (1.10 + 0.6×1.10) = $1.76.
After three years, we have (1.76 + 0.2×1.76) = $2.122.
By what constant factor would your investment need to be multiplied
by each year in order to achieve the same effect as multiplying by 1.10
one year, 1.60 the next, and 1.20 the third?
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Location (cont.)
3
1.1×1.6 ×1.2 = 1.283
1.2833 ≈ 2.112



The average rate of return is about 28% (not 30% which is what the
arithmetic mean of 10%, 60% and 20% would return).
Arithmetic mean is calculated for observations which are
independent of each other.
Geometric mean is used to calculate mean for
observations that are dependent on each other.
32
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion



Location measures provide an indication of the central
part of the observed values for a numerical variable.
Need measures of dispersion to represent the variability
expressed by the observations w.r.t. the central value.
The two empirical density curves below have mean,
median and mode equal to 50.
Density
0.1
0.02
0.08
0.015
0.06
0.01
Density
0.04
Density
0.005
0.02
0
0
0
33
Density
0.025
20
40
60
80
-50
0
50
100
150
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion (cont.)

In most applications, it is desirable to have data with small
dispersion (left).


In some applications, a higher dispersion may be desired, for
example, for the purpose of classification/discriminating
between classes.

34
E.g., In manufacturing process, a wide variation in a critical measure
might point to an undesirable level of defective items.
E.g., In the case of test results intended to discriminate between the
abilities of candidates, it is desirable to have a fairly wide spectrum of
values.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion (cont.)
Range:


The simplest measure of dispersion.
x range = x max − x min , where x max = max xi and x min = min xi
i


Useful to identify the interval in which the values of a variable
fall.
Unable to catch the actual dispersion of the values.

5
35
i
The densities below have the same range but the dispersion for the
density on the right is greater than that on the left.
6
7
8
9
10
5
6
7
8
9
10
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion (cont.)

Mean absolute deviation (MAD):

The deviation, or spread, of a value is defined as the signed
difference from the sample arithmetic mean:
si = xi − µ , i ∈ M and

36
∑s
i =1
i
=0
MAD is a measure of dispersion of observations around their
sample mean through the sum of the absolute values of the
spreads:
1 m
1 m
MAD = ∑ si = ∑ xi − µ
m

m
i =1
m
i =1
The lower the MAD, the more the values fall in proximity of
their sample mean and the lower the dispersion is.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion (cont.)

Variance:


More widely used than MAD.
Sample variance is defined as:
(
1 m 2
1 m
σ =
si =
xi − µ
∑
∑
m − 1 i =1
m − 1 i =1
2


37
)
2
A lower sample variance implies a lower dispersion of the
values around the sample mean.
As size of the sample increases:
 Sample mean µ approximates the population mean µ .
2
 Sample variance σ approximates the population variance σ 2 .
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion (cont.)

To bring the measure of dispersion back to the original
scale in which the observations are expressed, the
sample standard deviation is defined as:
σ= σ
38
2
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion (cont.)

Normal distribution:

If the distribution of variable a is normal or approximately
normal:
 The interval (µ ± σ ) contains approximately 68% of the observed



39
values.
The interval µ ± 2σ contains approximately 95% of the observed
values.
The interval µ ± 3σ contains approximately 100% of the observed
values.
(
(
)
)
Values that falls outside (µ ± 3σ ) can
be considered as suspicious outliers.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Dispersion (cont.)

Coefficient of variation:



40
Defined as the ratio between the sample standard deviation
and the sample mean, expressed in percentage terms as:
σ
CV = 100
µ
Provides a relative measure of dispersion.
Used to compare two or more groups of data usually obtained
from different distributions.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Relative Measures of Dispersion


Relative measures of dispersion are used to examine the
localisation of a value w.r.t. to other values in the sample.
Quantiles:






41
Suppose we arranged the m values {x1 , x2 ,..., xm } of a variable a
in non-decreasing order.
Given any value p, with 0 ≤ p ≤ 1, the p-order quantile is the
value q p such that pm observations will fall on the left of q p
and the remaining (1-p)m on its right.
Sometimes, p quantiles are called 100 pth percentiles.
0.5-order quantile coincides with the median.
qL = 0.25-order quantile, also called lower quartile.
qU = 0.75-order quantile, also called upper quartile.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Relative Measures of Dispersion (cont.)

Interquartile range (IQR) is defined as the difference
between the upper and lower quartiles:
Dq = qU − qL = q0.75 − q0.25
(Source: http://www.cdc.gov/osels/scientific_edu/ss1978/Lesson2/Section7.html)
42
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Identification of Outliers for Numerical
Variables

z-index:




𝑧𝑧𝑖𝑖 =
𝑥𝑥𝑖𝑖 − 𝑥𝑥̅
𝜎𝜎
Calculated as
Identify outliers in most cases.
Data values which are outside 3-standard deviations from the
sample mean are considered to be suspicious:

ziind > 3 – Suspicious

ziind >> 3 – Highly suspicious
Box plots or box-and-whisker plots:

Median and the lower and upper quartiles are represented on
the axis where the observations are placed:

43
Length of the box represents the interquartile range (the distance
between the 25th and the 75th percentiles).
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Identification of Outliers for Numerical
Variables (cont.)





An observation is identified as an outlier if it falls outside four
thresholds:




44
Dot in the box interior represents the mean.
Horizontal line in the box interior represents the median.
Vertical lines issuing from the box extend to the minimum and
maximum values of the analysis variable.
At the end of these lines are the whiskers.
External lower edge =
Internal lower edge =
Internal upper edge =
External upper edge =
qL − 3Dq = −101.5
qL − 1.5 Dq = −34.75
qU + 1.5 Dq = 143.25
qU + 3Dq = 210
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Shape

Skewness refers to the lack of symmetry in the
distribution of the data values.

Refers to the empirical density of empirical relative frequencies
that is calculated by dividing the raw frequency by m, i.e., the
number of records.
Negatively (left) skewed
Mean < Median < Mode
45
Positively (right) skewed
Mode < Median < Mean
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Shape (cont.)

An index of asymmetric, the sample skewness, based on
the third sample moment may be used to measure shape:

The third sample moment is defined as:
(
1 m
µ 3 = ∑ xi − µ
m i =1

The sample skewness is defined as:
µ3
I skew =

3
σ
Sample skewness is interpreted as follow:



46
)
3
Density curve is symmetric then I skew = 0
Density curve is skewed to the right then I skew > 0
Density curve is skewed to the left then I skew < 0
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Shape (cont.)

Kurtosis:

The kurtosis index expresses the level of approximation of
an empirical density to the normal curve using the fourth
sample moment defined as:
(
1 m
µ 4 = ∑ xi − µ
m i =1

The kurtosis is defined as:
𝐼𝐼𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 =
47
)
4
𝜇𝜇4
𝜎𝜎
4
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Analysis of the Empirical Density (cont.)

Kurtosis is interpreted as follow:



If the empirical density perfectly fits a normal density then I kurt = 0
Empirical density is hyponormal, it shows greater dispersion than the
normal density, i.e., assigns lower frequencies to values close to mean
I kurt < 0
Empirical density is hypernormal, it shows lower dispersion than the
normal density, i.e., assigns higher frequencies to values close to mean
I kurt > 0
(Source: Vercellis 2009, pp. 133)
48
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Descriptive Statistics

DataFrame.describe() can generate the descriptive
statistics for each Series (i.e., column) in the
DataFrame:






This works only for numeric columns.
Summarise the central tendency, dispersion and shape of a
dataset’s distribution, excluding missing values.
DataFrame.skew() can generate the skewness index.
DataFrame.kurtosis() can generate the kurtosis.
We have discussed histogram earlier.
The empirical density can be generated with
DataFrame.plot.density().
49
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Descriptive Statistics (cont.)
src02
Empirical density curve for hours
50
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Association


Summary indicators may be used to express the nature
and intensity of relationship between numerical variables.
Covariance:
 Given a pair of variables a j and ak , let µ and µ be the
j
k

corresponding sample means.
Sample covariance is defined as:
(
)(
1 m
v jk = cov(a j , ak ) =
xij − µ j xik − µ k
∑
m − 2 i =1

51
)
Concordance:
 Values of a j greater(lower) than mean µ j are associated with
values of ak also greater(lower) than mean µ .
k
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Association (cont.)


Discordance:
 Values of a j greater(lower) than mean µ j are associated with
values of ak lower(greater) than mean µ k .



Two elements of the product in the summation agree in sign and will
provide a positive contribution to the sum.
Two elements of the product in the summation will not agree in sign
and will provide a negative contribution to the sum.
Positive(negative) covariance values indicate that variables a j
and ak are concordant(discordant).
Limitation of covariance:


52
Covariance is usually a number ranging on a variable scale and
thus it is inadequate to assess the intensity of the relationship.
Correlation is more useful.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Association (cont.)

Correlation:


Correlation refers to the strength and direction of the
relationship between two numerical variables.
Interpreting correlation value:




1 means strong correlation in the positive direction.
-1 means a strong correlation in the negative direction.
0 means no correlation.
Some machine learning algorithms do not work optimally
due to multicollinearity:


53
Multicollinearity refers to the presence of correlations among
the independent variables.
Important to review all pair-wise correlations among the
independent variables.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Association (cont.)

Pearson R Correlation:





54
Most common measure of the relationships between two
numerical variables which are linearly related and seem to be
normally distributed.
Defined as:
v
rjk = corr (a j , ak ) =
jk
σ jσ k
where σ j and σ k are sample standard deviations of a j and ak
rjk always lies in the interval [-1,1].
rjk represents a relative index expressing the intensity of a
possible linear relationship between variables a j and ak .
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Association (cont.)

The main properties of Pearson coefficient:



55
If r jk > 0 , the attributes are concordant and the pairs of
observations will tend to lie on a line with positive slope.
If r jk < 0 , the attributes are discordant and the pairs of
observations will tend to lie on a line with negative slope.
If r jk = 0 or r jk ≈ 0 , no linear relationship exists.
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Association (cont.)

Kendall rank correlation:


A non-parametric test to determine the strength and direction
of relationship between two numeric variables that follow a
distribution other than the normal distribution.
Spearman rank correlation:



56
Also a non-parametric test.
Does not make any assumptions about the distribution of the
data.
Most suitable for ordinal data (i.e., categorical).
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Measures of Association (cont.)


DataFrame.corr() can be used to compute the
pairwise correlation of columns, excluding null values.
The following three methods are supported:



pearson – Standard correlation coefficient.
kendall – Kendall correlation coefficient.
spearman – Spearman rank correlation.
Scatter plot
of grade
against hours
src03
57
CG DADL (June 2023) Lecture 1 – Introduction to Data Analytics and Descriptive Analytics
Lecture
2
Data Visualization and
Predictive Analytics
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:




1
Data visualisation.
More predictive analytics and machine learning.
Different types of data mining pattern.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Matplotlib Architecture
Matplotlib provides a set of functions and tools that allow
the representation and manipulation of a Figure (the
main object) and its associated internal objects.
Other than graphics, Matplotlib also handles the events
and graphics animation.
Thus, Matplotlib can produce interactive charts that can
respond to events triggered by keyboard or mouse
movement.
The architecture of Matplotlib is structured into three
layers with unidirectional communication:





2
Each layer can communicate with the underlying layer only.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Matplotlib Architecture (cont.)
Backend layer – Matplotlib API and a set of classes to
represent the graphic elements.
Artist layer – An intermediate layer representing all the
elements that make up a chart, e.g., title, axis, labels,
markers, etc.
Scripting layer – Consists of the pyplot interface for
actual data calculation, analysis and visualisation.



3
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Matplotlib Architecture (cont.)
The three main artist objects in the hierarchy of the Artist layer.
4
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Matplotlib Architecture (cont.)
Each instance of a chart corresponds to an instance of Artist structured in a hierarchy
5
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
pyplot
The pyplot module is a collection of command-style
functions that allow the data analyst to operate or make
changes to the Figure.


E.g., Create a new Figure.
A simple interactive chart can be created using the pyplot
object’s plot() function.
The plot() function uses a default
configuration that does not have
title, axis label, legend, etc.


src01
6
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
pyplot (cont.)
The default configuration can be changed to obtain the
desired chart.

src02
7
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Line Plot
To create a simple line plot from data in a Pandas
DataFrame, we can use DataFrame.plot(), which
relies on matplotlib:


By default, you will get a line plot.
src03
8
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Line Plot (cont.)
To customise the line plot, we need to use matplotlib
directly:

src04
9
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Bar Plot




To create a bar plot, we need to set the kind attribute in
DataFrame.plot() to “bar”.
By default, the index of the DataFrame is a zero-based
number and thus x-axis are the numbers 0 to 4.
We can change the index of the DataFrame to the Name
column and replot the bar plot.
See the sample script in src05.
10
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Box Plot

A box plot can be created using
DataFrame.boxplot() and specifying the required
column, in this case, “Grade”.
src06
11
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Box Plot (cont.)

A box plot is a method for graphically depicting groups of
numerical data through their quartiles:



The box extends from the Q1 to Q3 quartile values of the
data, with a line at the median (Q2).
The whiskers extend from the edges of box to show the range
of the data.
The position of the whiskers can represent several values:


If there is no outlier, the whiskers show the minimum and maximum
values of the data.
If there are outliers, the whiskers show the:
Lowest datum still within 1.5 IQR of the lower quartile.
 Highest datum still within 1.5 IQR of the upper quartile.
IQR -> interquartile range
 IQR = Q3 – Q1.
 Outlier points are those past the end of the whiskers.

12
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Box Plot (cont.)
src06
• IQR = 18
• Outliers will fall outside ±1.5
* IQR from Q1 and Q3, i.e.,
50 and 122
• The min and max values are
within this range.
In this example, the whiskers shows the minimum and maximum values since there is no outlier.
13
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Box Plot (cont.)
src07
Adding one extra observation with a Grade of 130 will lead to an outlier
• IQR = 20.75
• Outliers will fall outside ±1.5 * IQR
from Q1 and Q3, i.e., 46.125 and
129.125
• The max value is outside this range.
In this example, the whiskers shows the lowest datum still within 1.5 IQR of the lower quartile
(i.e., 76), and the highest datum still within 1.5 IQR of the upper quartile (i.e., 99).
14
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Box Plot (cont.)


We can create a box plot with categorisation, in this case
by Gender.
We can also adjust the y-axis so that it runs from 0 to
100.
src08
15
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Box Plot (cont.)
src08
Box plots with categorisation showing. The one on the right has the y-axis limit set to 0-100.
16
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Histogram



A histogram can be created using DataFrame.hist()
By default, the histograms for the columns with numeric
values will be generated.
To generate a histogram for a particular column, you can
specify the name of the required column in the column
attribute, e.g., df.hist(column='hours').
src09
17
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Histogram (cont.)


We can also generate histogram for a numerical column
grouped by another column, typically a categorical
column, using the by attribute.
See src09 for a sample script grouped by “gender”.
18
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Pie Chart


A box plot can be created using pyplot.pie().
Customisations can be made to the pie chart to make it
more meaningful.
src10
19
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Scatter Plot

A scatter plot can be created using pyplot.scatter().
src11
20
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Scatter Plots of the
Iris Flower Dataset




This dataset set contains data from three different species
of iris (Iris silky, virginica Iris and Iris versicolor).
The variables include the length and width of the sepals,
and the length and width of the petals.
This dataset is widely used for classification problems.
150 observations with 4 independent
attributes and one target attribute.
src12
21
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Visualisation – Scatter Plots of the
Iris Flower Dataset (cont.)

Which variables are better for predicting iris species?


22
src13 – Scatterplot of sepal sizes.
src14 – Scatterplot of petal sizes.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Cross Tabulation

What is a crosstab?


23
A crosstab is a table showing the relationship between two or
more categorical variables.
Compares the results for one or more variables with the
results of another variable.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Cross Tabulation (cont.)

In Pandas:



By default, a crosstab computes a frequency table of the
variables.
To override the default, we can pass in an array of values and
provide a suitable Numpy aggregation function.
Refers to src15 for some examples with the
automobiles dataset.
24
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Pivot Table


Pivot table is generally similar to a crosstab but there are
important differences.
Data type:



Display format:


Crosstab is used for categorical variables only whereas pivot
table works with both categorical and numerical variables.
Crosstab analyzes relationship between two variables whereas
pivot table works with more than two variables.
Pivot table is generally more flexible than crosstab.
Aggregation:

25
Pivot table goes beyond raw frequency and can perform
statistics calculation.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Pivot Table (cont.)


In Pandas, pivot table and crosstab are very similar.
Refers to src16 for some examples with the
automobiles dataset.
26
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
More About Machine Learning


Recall that machine learning employs statistical and
mathematical techniques to build a model based on
sample data
The objective is to identify patterns among variables in
the data, i.e., data mining:

Data mining involves three main patterns:
Prediction
Data Classification
Mining
Regression
Segmentation
Association
27
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
More About Data Mining

Prediction:



Forecast the outcome of future event or unknown
phenomenon.
Intuitively, we can think of prediction as learning an AB
mapping, where A is the input and B is the output.
Classification – Predict weather outlook:



Regression – Predict rainfall:


28
A are weather data such as temperature and humidity.
B is a class label representing the weather
outlook such as “Rain” or “No Rain”.
A are weather data such as temperature and humidity.
B is a real number representing the amount of rainfall in millimeter.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
More About Data Mining (cont.)

Segmentation:


Partition a collection of things (e.g., objects, events) in a dataset
into natural groupings.
Clustering:



Create groups so that the members within each group have maximum
similarity.
Members across groups have minimum similarity.
Examples:


29
Segment daily temperature data into hot day or cold day.
Segment customers based on
their demographics and past
purchase behaviors.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
More About Data Mining (cont.)

Association:


Discover interesting relationships among variables in a large
database.
Market Basket Analysis:

Discover regularities among products in largescale transactions
recorded by point-of-sale systems in supermarkets:


30
I.e., each product purchased in a transaction being a variable.
Identify products that are commonly
purchased together, e.g., beer and diaper.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
More About Data Mining (cont.)



Data mining tasks to extract the different types of
patterns rely on learning algorithms.
Learning algorithms can be classified according to the way
patterns are extracted from historical data.
Supervised learning method :


Training data include both independent variables and
dependent variable
Unsupervised learning method:

31
Training data include only the independent variables.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
More About Data Mining (cont.)
• Prediction involve A  B
mapping in which we know the
B to help determine the pattern.
• This is known as supervised
learning.
• Association and Segmentation
involve just A without the B.
• We determine the pattern
without the help of the B.
• This is known as unsupervised
learning.
Source: Sharda et. al. (2020) – Analytics, Data Science, & Artificial
Intelligence: Systems for Decision Support, pp. 206, Figure 4.2
32
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Hands-on with Supervised Learning
versus Unsupervised Learning (cont.)
The Zoo animals dataset consists of 101 animals and 18
variables:

Variable
Variable Type
Data Type
Description
animal
Identifier
Text
•
Name of the zoo animal
type
Dependent
Variable or
Class Label
Categorical
•
•
•
Type of the zoo animal
This is the B
7 types – amphibian, bird,
fish, insect, invertebrate,
mammal, and reptile
hair, feathers, eggs, milk,
airborne, aquatic, predator,
toothed, backbone,
breathes, venomous, fins,
tail, domestic, catsize
Independent
Variables
Boolean
•
Various characteristics of
zoo animal
These are the A
15 such attributes
legs
•
•
Numeric
33
• Number of legs
• This is also part of the A
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Hands-on with Supervised Learning
versus Unsupervised Learning (cont.)
Identifier
34
Independent Variables
Dependent
Variable
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Classifying Zoo Animals


Use decision tree classifier to learn the function or
mapping between the characteristics of animals and their
membership to each type.
This is a supervised learning process.
Produces milk
Yes
No
Has feather
mammal
Yes
bird
35
No
???
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Classifying Zoo Animals (cont.)

Use a common two-step methodology known as split
validation:




Model training and then follow by model testing.
In this case, we are using a simple split validation.
70% of sample is used to train the model and derive a decision
tree.
30% of sample is used to test the model.
101 animals
Training
72 animals
36
Testing
29 animals
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Classifying Zoo Animals (cont.)

Compute the predictive accuracy, i.e., the model’s ability
to correctly predict the class label of new or previously
unseen data:
Actual Animal Type
amphibian
amphibian
Predicted Animal
Type
bird
bird
......
a
b
......
......
reptile


37
reptile
g
Testing accuracy = (a + b + ...... + g) / 101
In this example, the predictive accuracy is about 90%:
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Classifying Zoo Animals (cont.)

Actual testing accuracy may vary as we are using random
stratified sampling to perform the split validation:

Example:


10% of all animals are invertebrate.
10% of animals in each of training dataset and testing dataset would
be invertebrate.
101 animals
Training
7 invertebrate in training
72 animals
38
Testing
3 invertebrate in
testing
29 animals
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Classifying Zoo Animals (cont.)

Refer to sample source file src17 for the
example.
Confusion matrix for training and testing
Decision tree for classifying zoo animal
39
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Clustering Zoo Animals




Uses k-means clustering algorithm to segment the zoo
animals into natural groupings.
This is an unsupervised learning process.
The class attribute, i.e., animal type, is NOT used by the
algorithm.
The clustering process groups animals together using
some statistical measure of distance:

40
The distance between animals are calculated using the
descriptive variables, i.e., the A.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Clustering Zoo Animals (cont.)


Animals who are closer to each other (i.e., shorter distance)
are grouped together.
Therefore, animals in different groups are further part (i.e.,
greater distance).
Cluster 1 – Fish
Cluster 5 – Bird
bass
carp
flamingo
catfish
Cluster 4 – Amphibian
duck
crow
dove
frog
newt
41
toad
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Clustering Zoo Animals (cont.)



•
•
We cannot compute the predictive accuracy because we
are NOT supposed to know the actual class attribute.
The quality of a clustering model is evaluated using other
statistical measures.
Refer to sample code src18 and src19 for the example.
Observe that zoo animals of the same type are generally segmented into the same cluster.
We do NOT observe a case in which zoo animals of the same type are randomly segmented into
different clusters.
42
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Making Sense of Zoo Animals



Our general conclusion is that unsupervised learning is
useful in the real-world.
Without knowing the actual animal type, our clustering
model is able to segment the zoo animals into natural
groupings that are close to the actual types.
In real-world ML problem solving, we can usefully apply
unsupervised learning if:


43
We do not have a B.
It is not possible to obtain a labelled training dataset that
contains a B.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Mining Process


Systematic approach to carry out a data mining project
that maximizes the chance of success.
Various standardized processes based on best practices:



CRISP-DM (Cross-Industry Standard Process for Data Mining)
SEMMA (Sample, Explore, Modify, Model, and Assess)
KDD (Knowledge Discovery in Databases)
Source: KDNuggets.com,
August 2007
44
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
CRISP-DM


Proposed by a consortium of European companies in the
mid-1990s as a nonproprietary standard.
Consists of a sequence of six steps:




The steps are sequential in nature but usually involve
backtracking.
Whole process is iterative and could be time consuming.
The outcome from each step feeds into the next step and thus
each step must be carefully conducted.
Data mining is largely driven by experience and
experimentation – Is it an art or science?
45
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
CRISP-DM (cont.)
46
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 1 – Business Understanding

Know the purpose of the data mining study:



Thorough understanding of managerial need for new
knowledge.
Explicit specification of the business objective.
Examples of specific business questions:


47
What are the common characteristics of the customers we
have lost to our competitors recently?
What are typical profiles of our customers, and how much
value does each of them provide to us?
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 2 – Data Understanding


Identify the relevant data from many available data
sources – internal databases and external sources.
Key consideration factors in identifying and selecting data:


Clear and concise description of data mining task.
Develop an in-depth understanding of the data and variables:



Use different techniques to help understand data:


48
Is there any synonymous and/or homonymous variables.
Are the variables independent of each other?
Statistical techniques: Statistical summaries and correlation
analysis.
Graphical techniques: Scatter plots, histograms and box plots.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 2 – Data Understanding (cont.)

Common data sources:



49
Demographic data – E.g., income, education, number of
households and age.
Sociographic data – E.g., hobby, club membership and
entertainment.
Transactional data – Sales record, credit card spending and
issued checks.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 3 – Data Preparation
50
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 3 – Data Preparation (cont.)

Prepare the data identified in Step 2 for data mining
analysis,, i.e., data preprocessing.



Data consolidation:



Consumes the most time and effort, roughly 80% of total.
Four main steps to convert raw, real-world data into minable
datasets.
Relevant data are collected from identified sources.
Required records and variables are selected and integrated.
Data cleaning:



51
Missing data are imputed or ignored.
Noisy data, i.e., outliers, are identified and smoothened out.
Inconsistent data are handled with domain knowledge or
expert opinion.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 3 – Data Preparation (cont.)

Data transformation:





Data are transformed for better processing.
Normalization – Data may be normalized between certain
minimum and maximum values to mitigate potential bias.
Discretization – Numeric variables are converted to
categorical values.
Aggregation – A nominal variable’s unique value range may
be reduced to a smaller set using concept hierarchies.
Construct new variables:


52
Derive new and more informative variables from existing ones.
E.g., use a single variable blood-type match instead of separate
multinomial values for blood type of both donor and recipient.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 3 – Data Preparation (cont.)

Data reduction:


A dataset that is too large can also cause problems.
Too many variables or dimensions:




Too many records:


53
Reduce the number of variables into a more manageable and most
relevant subset.
Use findings from extant literature or consult domain experts.
Run appropriate statistical tests such as principle component analysis.
Processing large number of records may not be practical or feasible.
Use sampling to obtain a subset of the data that reflects all relevant
patterns of the complete dataset.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 3 – Data Preparation (cont.)

Skewed data:


54
Potentially bias in the analysis output.
Oversample the less represented or undersample the more
represented.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 4 – Modeling Building


Various modeling techniques are selected and applied to
the dataset prepared in Step 3.
Assessment and comparative analysis of various models:



No universally known “best” method or algorithm for a data
mining task.
Use a variety of viable model types with a well-defined
experimentation and assessment strategy to identify the “best”
method for a given purpose.
Different methods may have specific requirements for
data format and it may be necessary to go back to Step 3.
55
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 5 – Testing and Evaluation


The models developed in Step 4 need to be evaluated for
their accuracy and generality.
Two options in general:




Assess the degree to which the selected model(s) meets the
business objectives.
Test the developed model(s) in a real-world scenario if time
and budget constraints permit.
Bottom line is business value should be obtained from
discovered knowledge patterns.
Close interaction between data analysts, business analysts
and decision makers is required.
56
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Step 6 – Deployment


End user needs to be able to understand the knowledge
gained from the data mining study and benefit from it.
Knowledge needs to be properly organized and
presented:



Simple approach involves generating a report.
Complex approach involves implementing a repeatable data
mining process across the organization.
Deployment also includes maintenance activities:


57
Data reflecting business activities may change.
Models built on old data may become obsolete, irrelevant or
misleading.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Mining Goes to Hollywood


Decision situation – Predicting the box-office receipt (i.e.,
financial success) of a particular movie.
Problem:

Traditional approach:



Sharda and Delen’s approach:



58
Frames it as a forecasting (or regression) problem.
Attempts to predict the point estimate of a movie’s box-office receipt.
Convert into a multinomial classification problem.
Classify a movie based on its box-office receipts into one of nine
categories, ranging from “flop” to “blockbuster”.
Use variables representing different characteristics of a movie to train
various classification models.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Mining Goes to Hollywood (cont.)

Proposed solution:

Dataset consists of 2,632 movies released between 1998 and
2006.



59
Training set – 1998 to 2005
Test set – 2006
Movie classification based on box-office receipts:
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Mining Goes to Hollywood (cont.)

Independent variables:

A variety of data mining methods were used – neural
networks, decision trees, support vector machines (SVM) and
three types of ensembles.
60
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Mining Goes to Hollywood (cont.)
Data mining process map in IBM SPSS Modeler
61
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Mining Goes to Hollywood (cont.)

Results:
62
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Data Mining Goes to Hollywood (cont.)

Performance measures:




Among individual prediction models – SVM > ANN > CART
Ensemble models performed better than the individual
prediction models:


63
Bingo – Percent correct classification rate.
1-Away correct classification rate – Within one category.
Fusion algorithm being the best.
Ensemble models also have significantly low standard deviation
compared to individual prediction models.
CG DADL (June 2023) Lecture 2 – Data Visualization and Predictive Analytics
Lecture
3
Simple Linear Regression
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:




1
Structure of regression models.
Simple linear regression.
Validation of regression models.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Overview of Regression Analysis
In data mining, we are interested to predict the value of a
target variable from the value of one or more
explanatory variables.


Example – Predict a child’s weight based on his/her height.


Weight is the target variable.
Height is the explanatory variable.
Regression analysis builds statistical models that
characterize relationships among numerical variables.
Two broad categories of regression models:




Cross-sectional data – Focus of this lecture.
Time-series data – Focus of subsequent lecture:

2
Independent variables are time or some function of time.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Structure of Regression Models
Purpose of regression models is to identify functional
relationship between the target variable and a subset of
the remaining variables in the data.
Goal of regression models is twofold:




Highlight and interpret dependency of the target variable on
other variables.
Predict the future value of the target variable based upon the
functional relationship identified and future value of the
explanatory variables.
Target variable is also known as dependent, response
or output variable.
Explanatory variable is also known as independent or
predictory variables.


3
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Structure of Regression Models (cont.)
Suppose dataset D composed of m observations, a target
variable and n explanatory variables:






4
Explanatory variables of each observation may be represented
by a vector x i , i ∈ M in the n-dimensional space n.
Target variable is denoted by yi .
The m vectors of observation is written as a matrix X having
dimension m x n.
Target variable is written as y = ( y1 , y2 ,..., ym )
Let Y be the random variable representing the target attribute
and X j , j ∈ N , the random variables associated with the
explanatory variables.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Structure of Regression Models (cont.)

Regression models conjecture the existence of a function
f: n →  that expresses the relationship between the target
variable Y and the n explanatory variables X j :
Y = f ( X 1 , X 2 ,..., X n )
5
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Linear Regression Models
If we assume that the functional relationship f: n →  is
linear, we have linear regression models.
This assumption may be restrictive but most nonlinear
relationships may be reduced to a linear one by applying
appropriate preliminary transformation:



6
A quadratic relationship of the form
Y = b + wX + dX 2
can be linearized through the transformation Z = X 2 into a
linear relationship with two explanatory variables:
Y = b + wX + dZ
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Linear Regression Models (cont.)


7
An exponential relationship of the form
Y = e b + wX
can be linearized through a logarithmic transformation Z = log Y ,
which converts it into the linear relationship:
Z = b + wX
A simple linear regression model with one explanatory variable
is of the form:
Y = α + βX + ε
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Simple Linear Regression
Bivariate linear regression models a random variable Y as
linear function of another random variable X:



Y = α + βX + ε
ε is a random variable, referred to as error, which indicates the
discrepancy between the response Y and the prediction
f ( X ) = α + βX .
When the regression coefficients are determined by minimizing
the sum of squared errors SSE, ε must follow a normal
distribution with 0 mean and standard deviation σ :
Ε(ε i | Χ i ) = 0
var(ε i | Χ i ) = σ 2
Note: Standard deviation is the square root of variance and variance is the average of the
squared differences from the mean.
8
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Simple Linear Regression (cont.)
The preceding model is known as simple linear regression
where there is only one explanatory variable.
When there are multiple explanatory variables, the model
would be a multiple linear regression model of the form:


Y = α + β1 X 1 + β 2 X 2 + ... + β n X n + ε
9
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Simple Linear Regression (cont.)

For a simple linear regression model of the form:
Y = α + βX + ε


given the data samples (x1 , y1 ), (x2 , y2 ),..., (xs , ys )
The error for the prediction is:
𝜀𝜀𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑓𝑓 𝑥𝑥𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝛼𝛼 − 𝛽𝛽𝑥𝑥𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
The regression coefficients α and β can be computed by
the method of least squares which minimizes the sum of
the squared errors SSE:
𝑠𝑠
𝑆𝑆𝑆𝑆𝑆𝑆 = � 𝜀𝜀𝑖𝑖 2
𝑖𝑖=1
10
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Simple Linear Regression (cont.)

To find the regression coefficients α and β that minimize
S
SSE:
(x − x )( y − y )
β=
∑
i
i =1
i
S
2
)
(
x
x
−
∑ i
i =1
α = y − βx
S
x=
∑x
i =1
i
S
S
y=
11
∑y
i =1
i
S
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Simple Linear Regression (cont.)

Suppose we have a linear equation 𝑦𝑦 = 2 + 3𝑥𝑥 in which
𝑆𝑆𝑆𝑆𝑆𝑆 = 0:
12
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Example of Simple Linear Regression

Predict a child’s weight based on height:





The dataset contains 19 observations.
There are four variables altogether – Name, Weight, Height
and Age.
Note that for linear regression, we are using both Scikit
Learn and StatsModels.
StatsModels provide more summary statistics as
compared to Scikit Learn.
We could also manually calculate the required statistics...
13
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Example of Simple Linear Regression
(cont.)
src01
14
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Example of Simple Linear Regression
(cont.)
Scikit Learn’s output
15
StatsModels’ output
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Example of Simple Linear Regression
(cont.)




The regression equation is 𝑦𝑦 = 42.5701 + 0.1976𝑥𝑥 + 𝜀𝜀
𝛽𝛽 = 0.1976: A one unit increase in height leads to an expected increase of 0.1976 unit in
weight.
𝛼𝛼 = 42.5701: When 𝑥𝑥 = 0, the expected 𝑦𝑦 value is 42.5701 (danger of extrapolation)
𝑁𝑁 = 19, number of observations: Most of the dots, i.e., actual 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 values, are close to the
fitted line.
16
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Validation of Model – Coefficient of
Determination

R-Square = R 2 = 0.7705: 77.05% of the variation in yi is
explained by the model:

R2 =
This value is the proportion of total variance explained by
the predictive variable(s):
Model Sum of Squares
Error Sum of Squares
= 1−
Corrected Total Sum of Squares
Corrected Total Sum of Squares
𝑆𝑆
2
Model Sum of Squares = � 𝑦𝑦�𝑖𝑖 − 𝑦𝑦̄
𝑖𝑖=1
S
2
Error Sum of Squares = ∑ ( yi − yˆ i )
i =1
S
Corrected Total Sum of Squares = ∑ ( yi − y )
2
i =1
17
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Validation of Model – Coefficient of
Determination (cont.)




R 2 near zero indicates very little of the variability in yi is
explained by the linear relationship of Y with X.
R 2 near 1 indicates almost all of the variability in yi is explained
by the linear relationship of Y with X.
R 2 is known as the coefficient of determination or multiple RSquared.
Root Mean Squared Error (RMSE) =




18
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝑀𝑀𝑀𝑀𝑀𝑀 =
𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆
=
2
∑𝑆𝑆
𝑖𝑖=1 𝑒𝑒𝑖𝑖
𝑠𝑠
=
∑𝑆𝑆
𝑦𝑦𝑖𝑖 2
𝑖𝑖=1 𝑦𝑦𝑖𝑖 −�
𝑠𝑠
Recall that in linear regression, the goal is to minimize 𝑆𝑆𝑆𝑆𝑆𝑆.
So smaller value of 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅, i.e., close to 0.0, is better.
Smaller 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 indicates a model with better fit.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Validation of Model – Coefficient of
Determination (cont.)
• R-Square = Model MS/Corrected Total SS
= 0.771
• 77.1% of the variance in weight can be explained by
the simple linear model with height as the
independent variable.
• Adjusted R-Square = 0.757
= 1 – (1 – R2) (m – 1)/(m – n – 1)
= 1 – (1 – 0.771)(19-1)/
(19 – 1 – 1)
= 0.757
• R-Square always increases when a new term is added
to a model, but adjusted R-Square increases only if the
new term improves the model more than would be
expected by chance.
Root MSE =
𝑀𝑀𝑀𝑀𝑀𝑀𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸
= 2.3906
This is an estimate for the standard error of the
residuals σ.
19
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Validation of Model – Coefficient of
Determination (cont.)
• DF Model = 1, only one independent
variable in this model
• DF Corrected Total = S-1=18, because
19
∑ (y
i =1
i
− y) = 0
• Knowing 18 of them, we will know the
value of the 19th difference.
Analysis of Variance
• F-value = MSEModel/MSEError = 57.08
• F-value has n and m-n-1 DF
• The corresponding p-value is < .0001,
indicating that at least one of the
independent variables is useful for predicting
the dependent variable.
• In this case, there is only 1 independent
variable: the value of height is useful for
predicting the value of weight.
20
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Validation of Model – Significance of
Coefficient of Determination
• Is the value of β significantly different from zero?
• Hypothesis test: H 0 : β = 0
versus
Ha : β ≠ 0
• The t-value for the test is (0.1976/0.026) = 7.555 with corresponding p-value of < 0.0001.
• Since the p-value is lower than 5%, we may conclude with 95% confidence to reject H 0 that
β is 0.
• Note for simple linear regression models: t-value of the β parameter is the square root of
the F-value. In this example, 7.555× 7.555 ≈ 57.08
21
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Validation of Model – Significance of
Coefficient of Determination (cont.)
• The area under the curve to the left of -7.555 and to the right of +7.555 is less than 0.0001.
• We reject the null hypothesis and conclude that the slope β is not 0, i.e. the variable height is
useful for predicting the dependent variable weight.
22
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Validation of Model – Coefficient of
Linear Correlation
• In a simple linear regression model, the coefficient of determination = the squared of the
coefficient of linear correlation between X and Y.
• In our example: X = Height; Y = Weight
• r = 0.877785
• R2 = 0.7705 = 0.877785 x 0.877785
23
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Assumptions of Linear Regression






Linear regression has five key assumptions.
Linear relationship – Relationship between the
independent and dependent variables is linear.
Homoscedasticity – Residuals are equal across the
regression line.
No auto-correlation – Residuals must be independent
from each other.
Multivariate normality – All variables being normally
distributed on a univariate level.
No or little multicollinearity – Independent variables are
not correlated with each other.
24
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression

Linearity:



Relationship between the independent and dependent variables
is linear.
The linearity assumption can best be tested with scatter plots.
Recall the scatter and line plot of the linear regression line that
we have created earlier.
The 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 values appear to be linear.
25
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)

Homoscedasticity:




Residuals are equal across the regression line.
Scatter plots between residuals and predicted values are used
to confirm this assumption.
Any pattern would result in a violation of this assumption and
point toward a poor fitting model.
See the sample script in src02.
In the child’s weight example, no regular pattern/trend is observed.
26
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)


We can also check the normality of the residuals using a Q-Q
plot.
See the sample script in src03.
• QQ plot on the left shows the residuals in the child’s weight example.
• Data points must fall (approximately) on a straight line for normal distribution.
27
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)

Auto-correlation:



Residuals must be independent from each other.
Residuals are randomly distributed with no pattern in the
scatter plot from src02.
We can also use the Durbin-Watson test to test the null
hypothesis that the residuals are not linearly auto-correlated:



28
While d can assume values between 0 and 4, values around 2 indicate
no autocorrelation.
As a rule of thumb values of 1.5 < d < 2.5 show that there is no autocorrelation in the data.
StatsModels’ will report the Durbin-Watson’s d value, which is
2.643 in the child’s
weight example.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)

Multivariate normality:



All variables in the linear regression model are normally
distributed on a univariate level.
We can perform visual/graphical test to check for normality of
the data using Q-Q plot and also histogram.
See the sample script in src04.
• Top (left to right) – Q-Q plot and histogram of
Height.
• Bottom (left to right) – Q-Q plot and histogram of
Weight.
29
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)

No or little multicollinearity:




Independent variables are not correlated with each other.
For simple linear regression, this is obviously not a problem 
We will revisit the multicollinearity assumption in the multiple
linear regression model.
Child’s weight example:


30
We may conclude that the residuals are normal and
independent.
The linear regression model fits the data well.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Confidence in the Linear Regression
Model

Linear regression is considered as a low variance/high bias
model:




Under repeated sampling, the line will stay roughly in the same
place (low variance).
But the average of those models will not do a great job in
capturing the true relationship (high bias).
Note that low variance is a useful characteristic when you do
not have a lot of training data.
A closely related concept is
confidence intervals:

31
StatsModels calculates 95% confidence
intervals for our model coefficients.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Confidence in the Linear Regression
Model (cont.)

We can interpret the confidence intervals as follows:


32
If the population from which this sample was drawn was sampled 100
times.
Approximately 95 of those confidence intervals would contain the
“true” coefficient.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Confidence in the Linear Regression
Model (cont.)

We can compare the true relationship to the predictions
by using StatsModels to calculate the confidence intervals
of the predictions:

33
See the sample script in src05.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Can We Assess the Accuracy of a Linear
Regression Model?

A linear regression model is intended to perform point
predictions:




It is difficult to make an exact point prediction of the actual
continuous numerical value.
Thus, it is not viable to assess accuracy in a conventional way.
Other than the various measures of goodness, a model
with a tight 95% confidence interval is preferred.
But we can perform split validation to assess model
overfitting:

34
The model from the testing data should return comparable
values for the measures of goodness.
CG DADL (June 2023) Lecture 3 – Simple Linear Regression
Lecture
4
Multiple Linear Regression
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:





1
Multiple linear regression.
Standardization and normalization of data.
Selection of predictive variables.
Treatment of categorical variables.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Limitations of Simple Linear Regression
Simple linear regression only allows us to examine the
relationship between two variables.
In many real-world scenarios, there are likely more than
one independent variables that are correlated with the
dependent variable.
Multiple linear regression provides a tool that allows us
to examine the relationship between two or more
explanatory variable and a response variable.
Multiple linear regression is especially useful when trying
to account for potential confounding factors in
observational studies.




2
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Multiple Linear Regression
A multiple linear regression model is of the form:

Y = α + β1 X 1 + β 2 X 2 + ... + β n X n + ε


The regression coefficient β j expresses the marginal effect of
the variable X j on the target, conditioned on the current value
of the remaining predictive variables.
Scale of the values influences the value of the corresponding
regression coefficient and thus it might be useful to standardize
the predictive variables.
There are different techniques to standardize or
normalize data:


3
E.g., decimal scaling, min-max normalization and z-index
normalization.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Multiple Linear Regression (cont.)
MLR scenarios:


Product Sales and Advertisement Channels:



Relationship between product sales and investments made in
advertisement through several media communication channels.
The regression coefficients indicate the relative advantage afforded by
the different channels.
Colleges and Universities Graduation Rate:

Predict the percentage of students who eventually graduate
(GraduationPercent) using a set of 4 explanatory variables:




4
MedianSAT – Median SAT score of students.
AcceptanceRate – Acceptance rate of applicants.
ExpendituresPerStudent – Education budget per student.
Top10PercentHS – % of students in the top 10% of their high school.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Transformation
Data transformation is a general approach to
improving the accuracy of data analytics models such as
multiple linear regression.
Preventive standardization or normalization of data:







5
Expressing a variable in smaller units will lead to a larger value
range for that variable.
Tendency to give such a variable greater effect or “weight”.
To help avoid dependence on the choice of measurement units,
the data should be normalized or standardized.
This involves transforming the data to fall within a smaller or
common range such as [-1, 1] or [0.0, 1.0].
May be done in three ways.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Transformation (cont.)
Decimal scaling:

'
ij

Based on the transformation: x =

where h is a given parameter to shift the decimal point by h
positions toward the left.
h is fixed at a value that gives transformed values in the range
[-1,1].
Example:




10 h
Recorded values of variable A range from -986 to 917.
Maximum absolute value of A is 986.
Divide each value by 1000 (i.e., h=3):


6
xij
-986 normalizes to -0.986
917 normalizes to 0.917.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Transformation (cont.)
Min-max normalization:


Based on the following transformation:
xij − xmin, j
'
'
'
'
(
xij =
xmax,
j − xmin, j ) + xmin, j
xmax, j − xmin, j
where
xmin, j = min xij , xmax ,j = max xij
i
i
are the minimum and maximum values of the attribute a j
'
'
x
x
before transformation while min, j and max, j are the minimum
and maximum values that we wish to obtain after
transformation.

7
In general, we use either [-1,1] or [0,1].
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Transformation (cont.)

Example:



Suppose that the minimum and maximum values for the variable
income are $12,000 and $98,000, respectively.
We would like to map income to the range [0.0, 1.0].
A value of $73,600 for income is transformed to:
73,600 − 12,000
(1.0 − 0.0) + 0.0 = 0.716
98,000 − 12,000
8
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Transformation (cont.)
z-index normalization:




xij − µ j
_
σj
where µ j and σ j are the sample mean and sample standard
deviation of attribute a j .
If the distribution of the values is approximately normal, the
transformed values will fall within the range (-3,3).
Useful if actual minimum and maximum values are unknown, or
there are outliers dominating the min-max normalization.
Example:


_
Suppose the mean and standard deviation of the values for the
variable income are $54,000 and $16,000, respectively.
With z-score normalization, a value of $73,600 for income is
transformed to: 73,600 − 54,000
16,000
9
'
ij
Based on the transformation: x =
_

_
= 1.225
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Transformation (cont.)

Standardization techniques replace original values of a
variable with transformed values:


This will affect the way the multiple linear regression model is
interpreted.
In Python, the Scikit Learn library provides support for
data transformation via the sklearn.preprocessing
package:


10
For example, z-index normalization can be performed using the
preprocessing.scale() function.
See the sample script src01 for an example.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Multiple Linear Regression

Predict colleges and universities graduation rate:





11
The source of the dataset is attributed to Evans (2013).
Predicting the percentage of students accepted into a college
program whom would eventually graduate:
The dataset contains 49 observations.
There are five possible independent variables altogether – Type,
MedianSAT, AcceptanceRate, ExpendituresPerStudent,
Top10PercentHS.
For now, we will only use the latter four numerical variables,
i.e., excluding Type.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Multiple Linear Regression
(cont.)
src02
12
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Multiple Linear Regression
(cont.)
• Scikit-Learn’s output
• Top: Correlation matrix.
• Bottom: Regression model
13
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Multiple Linear Regression
(cont.)
StatsModels’ output
14
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Multiple Linear Regression
(cont.)


The regression equation is:
GraduationPercent = 17.9210 +
0.0720×MedianSAT – 24.8592×AcceptanceRate
– 0.0001×ExpendituresPerStudent –
0.1628×Top10PercentHS
Interpreting the regression coefficients:



15
Higher median SAT scores and lower acceptance rates suggest
higher graduation rates.
1 unit increase in median SAT scores increase graduation rates
by 0.0720 unit, all other things being equal.
1 unit increase in acceptance rates decrease graduation rates
by 24.8592 unit, all other things being equal.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Multiple Linear Regression
(cont.)

Overall:




16
Do the regression coefficients make sense?
R2 is only 0.53.
R2 and Adjusted R2 are quite similar.
RMSE is 5.03
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Evaluating the Assumptions of Linear
Regression

Residuals analysis:


17
The scatter plot of residuals against predicted values and the
Q-Q plot of the residuals shows that the residuals are normal
and independent.
See the sample script src03.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)

Multicollinearity:


Occurs when significant linear correlation exists between two
or more predictive variables.
Potential problems:




18
Regression coefficients are inaccurate.
Compromises overall significance of the model.
Possible that the coefficient of determination is close to 1 while the
regression coefficients are not significantly different from 0.
Pairwise linear correlation coefficients may be calculated.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)


19
We can observe mild correlation among the four independent
variables.
Consequently, you would see that even though 𝑅𝑅2 is
moderately high, two of the independent variables, i.e.,
ExpenditurePerStudent and Top10PercentHS, are just below
the 0.05 (or 95%) significance threshold.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Evaluating the Assumptions of Linear
Regression (cont.)

To identify multiple linear relationships or
multicollinearity among predictive variables:



20
Calculate the variance inflation factor (VIF) for each
predictor X j as:
1
VIF j =
1 − R 2j
2
where R j is the coefficient of determination for the
model that explains X j , treated as a response, through
the remaining independent variables.
VIF j > 5 indicates multicollinearity.
VIF can be calculated in StatsModels –
See the sample script src04.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Selection of Predictive Variables

In multiple linear regression model:




We typically select a subset of all predictive variables that are
most effective in explaining the response variable.
Relevance analysis identifies variables that do not contribute to
the prediction process.
This is known as feature selection.
Rationales:


21
Model based on all predictive variables may not be significant
due to multicollinearity.
Model with too many variables tend to overfit the training data
samples and they may not predict new/unseen data with good
accuracy.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Selection of Predictive Variables (cont.)

General methods for variables selection:



Forward Selection – Starts with no variables in the model
and adds variables using the F statistics.
Backward Elimination – Starts with all variables in the
model and deletes variables using the F statistics.
Stepwise:



22
Similar to forward selection except the variables in the model do not
necessarily stay there.
A hybrid of forward selection and backward elimination.
A variable that was initially added could be deleted if the addition of a
new variable negatively affects its level of significance.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Selection of Predictive Variables (cont.)

However, variables selection suffers from some potential
problems:




May introduce bias in the selection of variables.
May lack theoretical grounding on the final set of selected
variables.
May require significant computing power, and thus time, if the
dataset is large.
Variables selection with Scikit Learn:


23
Does NOT support variables selection in its linear regression
routine.
Supports a standalone routine to calculate the F-score and
pvalues corresponding to each of the regressors
(sklearn.feature_selection.f_regression).
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Selection of Predictive Variables (cont.)

Supports a standalone recursive feature elimination
(sklearn.feature_selection.RFE).
src05
24
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Selection of Predictive Variables (cont.)
• Based on the F-scores, we would select MedianSAT and AcceptanceRate.
• Based on the recursive elimination ranking, we would select AcceptanceRate and
Top10PercentHS
25
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Variables Selection

Predict the newsprint consumption in US cities:


39 observations altogether.
Six independent variables:







X1: Number of newspaper in the city.
X2: Proportion of the city population under the age of 18.
X3: Median school year completed (city resident).
X4: Proportion of city population employed in white collar
occupation.
X5: Logarithm of the number of families in the city.
X6: Logarithm of total retail sales.
See the sample script src06 (similar to src05).
26
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Variables Selection (cont.)
• Based on the F-scores, we would select X5 and X6. (Note that the F-scores of X1 to
X4 are not significant at the 0.95 level.)
• Based on the recursive elimination ranking, we would also select X5 and X6.
27
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Treatment of Categorical Predictive
Variables


Categorical variables may be included as predicators in a
regression model using dummy variables.
A nominal categorical variable X j with H distinct values
denote by V = {v1 , v2 ,..., vH } may be represented in two
ways:

Using arbitrary numerical values:



28
Regression coefficients will be affected by the chosen scale.
Compromises significance of the model.
Using H − 1 binary variables D j1 , D j 2 ,..., D j , H −1 , called dummy
variables
 Each binary variable D jh is associated with level vh of X j .
 D jh takes the value of 1 if xij = vh .
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Encoding Examples with
Multiple Linear Regression

Nominal categorical data:




29
Housing estate – Multinomial variable with 5 different possible
values represented using 4 binomial variables I1 to I4:
Housing Estate
I1
l2
l3
l4
Ang Mo Kio
1
0
0
0
Bishan
0
1
0
0
Clementi
0
0
1
0
Dover
0
0
0
1
East Coast
0
0
0
0
Predict price of HDB flats (y) based on the size (x1), floor level
(x2) and distance to MRT station (x3).
Model: y = β 0 + β1 x1 + β 2 x2 + β 3 x3 + β 4 I1 + β 5 I 2 + β 6 I 3 + β 7 I 4
In East Coast, the model is actually: y = β 0 + β1 x1 + β 2 x2 + β 3 x3
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Encoding Examples with
Multiple Linear Regression (cont.)



30
In Dover, the model is: y = β 0 + β1 x1 + β 2 x2 + β 3 x3 + β 7
Suppose β 7 = −12000 : The price of a HDB flat in Dover is
expected to be $12000 less than another HDB flat in East
Coast if both HDB flats have the same size, are on the same
level and have the same distance to the nearest MRT station.
Interpretation of the dummy variables always with respect to
East Coast.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Encoding Examples with
Multiple Linear Regression (cont.)

Ordinal categorical data:




31
Education level – Multinomial variable with 4 different possible
values represented using 3 binomial variables I1 to I3:
Education Level
I1
l2
l3
Elementary School
0
0
0
High School
0
0
1
College
0
1
1
Graduate School
1
1
1
Predict starting salary (y) based on education level.
Model: y = β 0 + β1 I1 + β 2 I 2 + β 3 I 3
Suppose β1 = β 2 = β 3 = 0 : No difference in starting salary.
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Data Encoding Examples with
Multiple Linear Regression (cont.)

Suppose β1 = 0, β 2 > 0, β 3 = 0 :



32
Model: y = β 0 + β 2 I 2
When education level is high school or lower, y = β 0
When education level is college or higher, y = β 0 + β 2
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Categorical Predictive
Variables

Predict colleges and universities graduation rate:






33
We will now include Type as one of the independent variables.
Type is a nominal categorical variable with two values.
Values of the variable Type are Lib Arts, University , i.e., 𝐻𝐻 =
2
So we use one dummy variable University with a numeric 1
representing University and a numeric 0 representing Lib Arts.
A University has a 1.43632 unit higher graduation rate
compared to Lib Arts, all other things being equal.
But the regression coefficient of University is not significant in
this case (𝑝𝑝 = 0.524).
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Categorical Predictive
Variables (cont.)
Dummy coding of “University”
from line 11 to 15
src07
34
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Example of Categorical Predictive
Variables (cont.)
35
CG DADL (June 2023) Lecture 4 – Multiple Linear Regression
Lecture
5
Introduction to Classification
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives

At the end of this lecture, you should understand:





Overview of Machine Learning
Limitations of linear regression models.
Definitions of classification problem and classification models.
Evaluation of classification models.
Usefulness of classification and clustering.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Overview of Classification
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Limitation of Linear Regression Models


Regression analysis is useful but suffers from an important
limitation.
In linear regression models, the numerical dependent
variable must be continuous:




The dependent variable can take on any value, or at least close
to continuous.
In some data analytics scenarios, the dependent variable may
not be continuous.
In other scenarios, it may be unnecessary to make a point
prediction.
It is possible to convert a regression problem into a
classification problem.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Limitations of Linear Regression Models
(cont.)

Linear regression requires a linear relationships between
the dependent and independent variables:


The assumption that there is a straight-line relationship
between them does not always hold.
Linear regression models only look at the mean of the
dependent variable:

E.g., in the relationship between the birth weight of infants and
maternal characteristics such as age:


Linear regression will look at the average weight of babies born to
mothers of different ages.
But sometimes we need to look at the extremes of the dependent
variable, e.g., babies are at risk when their weights are low.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Limitations of Linear Regression Models
(cont.)

Linear regression is sensitive to outliers:


Outliers can have huge effects on the regression.
Data must be independent:

Linear regression assumes that the data are independent:


I.e., the scores of one subject (such as a person) have nothing to do
with those of another.
This assumption does not always make sense:


E.g., students in the same class tend to be similar in many ways such as
coming from the same neighborhoods, taught by the same teachers,
etc.
In the above example, the students are not independent.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Parametric versus Non-parametric

Linear regression is parametric:



Parametric ML algorithms:


Assumes that sample data comes from a population that can
be adequately modelled by a probability distribution that has a
fixed set of parameters.
Assumptions can greatly simplify the learning process, but can
also limit what can be learned.
Algorithms that simplify the function to a known form.
Non-parametric ML algorithms:


Algorithms that do not make strong assumptions about the
form of the mapping function.
Free to learn any functional form from the training data.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Parametric versus Non-parametric
(cont.)

Non-parametric ML methods are good when:



You have a lot of data and no prior knowledge.
You do not want to worry too much about choosing just the
right features.
Classification algorithms include both parametric and
non-parametric:


Parametric – Logistic Regression, Linear Discriminant Analysis,
Perceptron, Naive Bayes, Simple Neural Networks
Non-parametric – k-Nearest Neighbors, Decision Trees,
Support Vector Machines
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Data Mining Goes to Hollywood


Data mining scenario – Predicting the box-office receipt
(i.e., financial success) of a particular movie.
Problem:

Traditional approach:



Frames it as a forecasting (or regression) problem.
Attempts to predict the point estimate of a movie’s box-office receipt.
Sharda and Delen’s (2006) approach:



Convert the regression problem into a multinomial classification
problem.
Classify a movie based on its box-office receipts into one of nine
categories, ranging from “flop” to “blockbuster”.
Use variables representing different characteristics of a movie to train
various classification models.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Overview of Classification

Classification models:



Aim of classification models:



Supervised learning methods for predicting value of a
categorical target variable.
In contrast, regression models deal with numerical (or
continuous) target variable.
Generate a set of rules from past observations with known
target class.
Rules are used to predict the target class of future
observations.
Classification holds a prominent position in learning
theory.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Overview of Classification (cont.)

From a theoretical viewpoint:


Classification algorithm development represents a fundamental
step in emulating inductive capabilities of the human brain.
From a practical viewpoint:


Classification is applicable in many different domains.
Examples:






Selection of target customers for a marketing campaign.
Fraud detection.
Image recognition.
Early diagnosis of disease.
Text cataloguing.
Spam email recognition.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Classification Problems



We have a dataset D containing m observations
described in terms of n explanatory variables and a
categorical target variable (a class or label).
The observations are also termed as examples or
instances.
The target variable takes a finite number of values:


Binary classification – The instances belong to two classes
only.
Multiclass or multicategory classification – There are
more than two classes in the dataset.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Classification Problems (cont.)




A classification problem consists of defining an
appropriate hypothesis space F and an algorithm AF that
identifies a function f * ∈ F that can optimally describe the
relationship between the predictive variables and the
target class.
n
F is a class of functions f ( x ) : R ⇒ H called hypotheses
that represent hypothetical relationship of dependence
between yi and xi .
R n is the vector of values taken by the predictive variables
for an instance.
H could be {0,1} or {− 1,1} for a binary classification
problem.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Components of a Classification Problem



Generator – Extract random vectors Χ of data
instances.
Supervisor – For each vector Χ , return the value of the
target class.
Classification algorithm (or classifier) – Choose a
function f * ∈ F in the hypothesis space so as to minimize
a suitably defined loss function.
Generator
x
y
Supervisor
Classification
Algorithm
f(x)
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Development of a Classification Model


Development of a classification model consists of three
main phases.
Training phase:




The classification algorithm is applied to the instances
belonging to a subset T of the dataset D .
T is called the training data set.
Classification rules are derived to allow users to predict a class
to each observation Χ.
Test phase:


The rules generated in the training phase are used to classify
observations in D but not in T .
Accuracy is checked by comparing the actual target class with
the predicted class for all instances in V = D − T .
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Development of a Classification Model
(cont.)



Observations in V form the test set.
The training and test sets are disjoint: V ∩T = ∅ .
Prediction phase:


The actual use of the classification model to assign target class
to completely new observations.
This is done by applying the rules generated during the training
phase to the variables of the new instances.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Development of a Classification Model
(cont.)
Training
Training
Tuning
Training Data
Test Data
New Data
Test
Prediction
Rules
Accuracy Assessment
Knowledge
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Taxonomy of Classification Models

Heuristic models:


Classification is achieved by applying simple and intuitive
algorithms.
Examples:

Classification trees – Apply divide-and-conquer technique to obtain
groups of observations that are as homogenous as possible with
respect to the target variable.



Also known as decision trees.
Nearest neighbor methods – Based on the concept of distance
between observations.
Separation models:


Divide the variable space into H distinct regions.
All observations in a region are assigned the same class.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Taxonomy of Classification Models
(cont.)




How to determine these regions? Neither too complex or
many, nor too simple or few.
Define a loss function to take into account the misclassified
observations and apply an optimization algorithm to derive a
subdivision into regions that minimizes the total loss.
Examples – Discriminant analysis, perceptron methods, neural
networks (multi-layer perceptron) and support vector
machines (SVM).
Regression model:


Logistic regression is an extension of linear regression suited
to handling binary classification problems.
Main idea – Convert binary classification problem via a proper
transformation into a linear regression problem.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Taxonomy of Classification Models
(cont.)

Probabilistic models:




A hypothesis is formulated regarding the functional form of the
conditional probabilities PΧ| y (Χ | y ) of the observations given
the target class.
This is known as class-conditional probabilities.
Based on an estimate of the prior probabilities Py ( y ) and using
Bayes’ theorem, calculate the posterior probabilities Py|Χ ( y | Χ )
of the target class.
Examples – Naive Bayes classifiers and Bayesian networks.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Evaluation of Classification Models

In a classification analysis:



It is advisable to develop alternative classification models.
The model that affords the best prediction accuracy is then
selected.
To obtain alternative models:



Different classification methods may be used.
The values of the parameters may also be modified.
Accuracy:


The proportion of the observations that are correctly
classified by the model.
Usually, one is more interested in the accuracy of the model on
the test data set V .
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Evaluation of Classification Models
(cont.)

If yi denotes the class of the generic observation Χ i ∈ V and
f (Χ i ) the class predicted through the function f ∈ F identified
by the learning algorithm A = AF , the following loss function
can be defined:
0, if yi = f (Χ i )
L( yi , f (Χ i )) = 
1, if yi ≠ f (Χ i )
The accuracy of model A can be evaluated as:
1 v
acc A (V ) = acc AF (V ) = 1 − ∑ L( yi , f (Χ i ))
v i =1
where v is the number of observations.
The proportion of errors made is defined as:
1 v
errA (V ) = errAF (V ) = 1 − acc AF (V ) = ∑ L( yi , f (Χ i ))
v i =1
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Evaluation of Classification Models
(cont.)

Speed:


Robustness:



The method is robust if the classification rules generated and
the corresponding accuracy do not vary significantly as the
choice of training data and test datasets varies.
It must also be able to handle missing data and outliers well.
Scalability:


Long computation time on large datasets can be reduced by
means of random sampling scheme.
Able to learn from large datasets.
Interpretability:

Generated rules should be simple and easily understood by
knowledge workers and domain experts.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Holdout Method





Divide the available m observations in the dataset D into
training dataset T and test dataset V .
The t observations in T is usually obtained by random
selection.
The number of observations in T is suggested to be
between one half and two thirds of the total number of
observations in D .
The accuracy of the classification algorithm via the
holdout method depends on the test set V .
In order to better estimate accuracy, different strategies
have been recommended.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Repeated Random Sampling


Simply replicate the holdout method r times.
For each repetition k = 1,2,..., r :



A random training dataset Tk having t observations is
generated.
Compute acc AF (Vk ) , the accuracy of the classifier on the
corresponding test set Vk , where Vk = D − Tk .
Compute the average accuracy as:
acc A = acc AF

1 r
= ∑ acc AF (Vk )
r k =1
Drawback – No control over the number of times each
observation may appear, outliers may cause undesired
effects on the rules generated and the accuracy.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Cross-validation


Divide the data into r disjoint subsets, L1 , L2 ,..., Lr of
(almost) equal size.
For iterations k = 1,2,..., r :




Let the test set be Vk = Lk
And the training set be Tk = D − Lk
Compute acc AF (Vk )
Compute the average accuracy:
acc A = acc AF

1 r
= ∑ acc AF (Vk )
r k =1
Usual value for r is r = 10 (i.e., ten-fold cross-validation).
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Cross-validation (cont.)

Also known as k-fold cross-validation or rotation
estimation.
L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
Illustration of ten-fold cross-validation
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Leave-One-Out




Cross-validation method with the number of iterations r
being set to m.
This means each of the m test sets consists only of 1
sample and the corresponding training data set consists of
m − 1 samples.
Intuitively, every observation is used for testing once on
as many models developed as there are number of data
points.
Time consuming methodology but a viable option for
small dataset.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Stratified Random Sampling


Instead of random sampling to partition the dataset 𝐷𝐷
into training set 𝑇𝑇 and test set 𝑉𝑉, stratified random
sampling could be used to ensure the same proportion of
observations belonging to each target class is the same in
both 𝑇𝑇 and test set 𝑉𝑉.
In cross-validation, each subset 𝐿𝐿𝑘𝑘 should also contain the
same proportion of
L
L
L
L
L
L
L
L
L
L
observations belonging to
L
L
L
L
L
L
L
L
L
L
each target class.
In this example:
• Purple/Blue – Class 0
• Red – Class 1
1
2
3
4
5
6
7
8
9
1
0
1
2
3
4
5
6
7
8
9
1
0
L1
L2
L3
L4
L5
L6
L7
L8
L9
L1
L1
L2
L3
L4
L5
L6
L7
L8
L9
L1
0
CG DADL (June 2023) Lecture 5 – Introduction to Classification
0
Confusion Matrices

In many situations, just computing the accuracy of the
classifier may not be sufficient:

Example 1 – Medical Domain:



The value of 1 means the patient has a given medical condition and -1
means the patient does not.
If only 2% of all patients in the database have the condition, then we
achieve an accuracy rate of 98% by having the trivial rule that “the
patient does not have the condition”.
Example 2 – Customer Retention:


The value of 1 means the customer has cancelled the service, 0 means
the customer is still active.
If only 2% of the available data correspond to customers who have
cancelled the service, the trivial rule “the customer is still active” has
an accuracy rate of 98%.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Confusion Matrices (cont.)
Confusion matrix for a binary target variable encoded
with the class values {− 1,+1} :

Accuracy – Among all instances, what is the proportion that
are correctly predicted?
p+v
p+v
acc =
=
m
p+q+u +v
Predictions
Instances

-1
(Negative)
+1
(Positive)
Total
-1
(Negative)
p
q
p+q
+1
(Positive)
u
v
u+v
Total
p+u
q+v
m
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Confusion Matrices (cont.)

True negative rate – Among all negative instances,
p
proportion of correct predictions:
tn =
p+q

False negative rate – Among all positive instances,
proportion of incorrect predictions:
u
fn =
u+v
False positive rate – Among all negative instances,
q
proportion of incorrect predictions:
fp =
p+q


True positive rate – Among all positive instances, proportion
of correct predictions (also known as recall):
v
tp =
u+v
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Confusion Matrices (cont.)

Precision – Among all positive predictions, the proportion of
v
actual positive instances:
prc =
q+v

Geometric mean is defined as:
and sometimes also as:

gm = tp × tn
F-measure is defined as:
F=
gm = tp × prc
(β
2
)
recall x precision
+ 1 tp × prc
β 2 prc + tp
where β ∈ [0, ∞ ] regulates the relative importance of the
precision w.r.t. the true positive rate. The F-measure is also
equal to 0 if all the predictions are incorrect.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
ROC Curve Charts

Receiver operating characteristic (ROC) curve
charts:



Allow the user to visually evaluate the accuracy of a classifier
and to compare different classification models.
Visually express the information content of a sequence of
confusion matrices.
Allow the ideal trade-off between:


Number of correctly classified positive observations – True Positive
Rate on the y-axis.
Number of incorrectly classified negative observations to be assessed
– False Positive Rate on the x-axis.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
ROC Curve Charts (cont.)
true
positive
false positive
CG DADL (June 2023) Lecture 5 – Introduction to Classification
ROC Curve Charts (cont.)

ROC curve chart is a two dimensional plot:








fp on the horizontal axis and tp on the vertical axis.
The point (0,1) represents the ideal classifier.
The point (0,0) corresponds to a classifier that predicts class
{− 1} for all samples.
The point (1,1) corresponds to a classifier that predicts class
{1} for all samples.
Parameters in a classifier may be adjusted so that tp can be
increased, but at the same time increasing fp .
A classifier with no parameters to be (further) tuned yields
only 1 point on the chart.
The area beneath the ROC provides means to compare the
accuracy of various classifiers.
The ROC curve with the greatest area is preferable.
CG DADL (June 2023) Lecture 5 – Introduction to Classification
Lecture
6
Decision Tree
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:





1
Overview of decision tree or classification tree.
Components of a decision tree.
Splitting rules and criteria.
Using decision tree in Python with Scikit Learn
CG DADL (June 2023) Lecture 6 – Decision Tree
Decision Trees
The best known and most widely used learning methods
in data mining applications.
Reasons for its popularity include:







2
Conceptual simplicity.
Ease of usage.
Computational speed.
Robustness with respect to missing data and outliers.
Interpretability of the generated rules.
CG DADL (June 2023) Lecture 6 – Decision Tree
Decision Trees (cont.)
The development of a decision tree involves
recursive, heuristic, top-down induction:

1.
2.
3.
Initialization phase – All observations are placed in the root of
the tree. The root is placed in the active node list L .
If the list L is empty, stop the procedure. Otherwise, node J ∈ L
is selected, removed from the list and used as the node for
analysis.
The optimal rule to split the observations in J is then
determined, based on an appropriate preset criterion:



3
If J does not need to be split, node J becomes a leaf, target class is
assigned according to majority class of observations.
Otherwise, split node J , its children are added to the list.
Go to Step 2.
CG DADL (June 2023) Lecture 6 – Decision Tree
Components of Decision Trees
Components of the top-down induction of decision trees:




Splitting rules – Optimal way to split a node (i.e., assigning
observations to child nodes) and for creating child nodes.
Stopping criteria – If the node should be split or not. If not,
this node becomes a leaf of the tree.
Pruning criteria – Avoid excessive growth of the tree (prepruning) during tree generation phase, and reduce the number
of nodes after the tree has been generated (post-pruning).
Exam ≥ 80?
A simple decision tree:
• Left branch for “No”
• Right branch for “Yes”
4
Assignment
≥ 80%?
C
B
Assignment
≥ 50%?
B
A
CG DADL (June 2023) Lecture 6 – Decision Tree
Splitting Rules
Two splitting rules based on variable value selection:


Binary split – Each node has at most two branches:




Multi-split classification trees – Each node has an arbitrary
number of branches:


5
Example – Customers who agree to a mailing campaign are placed on
the right child node, those who do not agree on the left child node.
Example – Customers residing in areas {1,2} are on the right child
node, {3,4} on the left.
Example – Customers who are 45 years old or younger are on the
right, others on the left.
It is easier to handle multi-valued categorical variables.
For numerical variables, it is necessary to group together adjacent
values. This can be achieved by discretization.
CG DADL (June 2023) Lecture 6 – Decision Tree
Splitting Rules (cont.)

Empirical evidence suggests no significant difference in
performance of classification trees with regards to the number
of children nodes.
Two splitting rules based on the number of variables
selected:


Univariate:


Based on the value of a single explanatory variable X j .
Examples:
Authorized
communication
No
Xj ≤ 45
Yes
Univariate split for a
binary variable
6
Area of
residence
Age
Young
Xj > 45
Old
Univariate split for a
numerical variable
Xj = 1
North
Xj = 2
Center
Xj = 3
South
Xj =4
Island
Univariate split for a
nominal variable
CG DADL (June 2023) Lecture 6 – Decision Tree
Splitting Rules (cont.)

Univariate trees are also called axis-parallel trees.

Example:


X1 ≤ 5
X2
X2 ≤ 2

X2 > 2

X1 > 5
X2≤0
X2





X 2 >0
X2
X2≤ 4.5
7





X2 = 4.5



X2 = 2








X2 > 4.5
X2 = 0











X1 = 5

X1


CG DADL (June 2023) Lecture 6 – Decision Tree
Splitting Rules (cont.)

Multivariate trees – Also called oblique decision trees:

Observations are separated based on the expression:
n
∑w x
j =1
j
j
≤b
where the threshold value b and the coefficients w1 , w2 ,..., wn of the
linear combination have to be determined, for example by solving an
optimization problem for each node.
X2
X1
8
CG DADL (June 2023) Lecture 6 – Decision Tree
Univariate Splitting Criteria
Let ph be the proportion of instances of target class vh ,
h ∈ H , at a given node q and let Q be the total number of
instances at q .

We have

H
∑p
h =1
h
=1
The heterogeneity index I (q ) is a function of the relative
frequencies ph , h ∈ H of the target class values for the
instances at the node.
The index must satisfy the following 3 criteria:



9
It must be maximum when the instances in the node are
distributed homogenously among all the classes.
CG DADL (June 2023) Lecture 6 – Decision Tree
Univariate Splitting Criteria (cont.)
It must be minimum when all the instances at the node belong
to only one class.
 It must be a symmetric function w.r.t. the relative frequencies
ph , h ∈ H .
 The heterogeneity indices of a node q that satisfy the

three impurity/inhomogeneity criteria:

Misclassification index:
Miscl (q ) = 1− max ph
h


Entropy index:
Gini index:
H
Entropy (q ) = −∑ ph log 2 ph
h =1
H
Gini (q ) = 1 − ∑ ph2
h =1
10
CG DADL (June 2023) Lecture 6 – Decision Tree
Univariate Splitting Criteria (cont.)

In a binary classification problem, the three impurity
measures defined preceding reach:


11
Their maximum value when p1 = p2 = 1 − p1 = 0.5 .
Their minimum value, i.e., 0, when p1 = 0 or p1 = 1 .
CG DADL (June 2023) Lecture 6 – Decision Tree
Univariate Splitting Criteria (cont.)

Selection of variable for splitting:
Node q
Node q
X1=0
Child node
q1


12
X1=1
Child node
q2
X2=0
Child node
q3
X2=1
Child node
q4
Suppose node q needs to be split – Is it better to split
according to the values of variable X 1 or variable X 2 ? Or any
other variable?
Choose a split that minimizes the impurity of the child nodes
q1 , q2 ,..., qk :
K
Q
I (q1 , q2 ,...qk ) = ∑ k I (qk )
k =1 Q
CG DADL (June 2023) Lecture 6 – Decision Tree
Univariate Splitting Criteria (cont.)
where:
k
= the number of child nodes
Q
= the number of observations in node q
Qk = the number of observations in child node qk :
Q1 + Q2 + Q3 + ... + Qk = Q
I (qk ) = the impurity of the observations in child node q .
k

The information gain is defined as the reduction in the
impurity after splitting:
∆(q1 , q2 ,..., qk ) = I (q ) − I (q1 , q2 ,..., qk )


The best split is the one with the largest information gain.
This is clearly equivalent to the split that results in the
smallest impurity of the partition I (q1 , q2 ,...qk ) .
13
CG DADL (June 2023) Lecture 6 – Decision Tree
Example of a Decision Tree


Given the dataset:
Observation #
Income
Credit Rating
Loan Risk
0
23
High
High
1
17
Low
High
2
43
Low
High
3
68
High
Low
4
32
Moderate
Low
5
20
High
High
The task is to predict Loan-Risk.
14
CG DADL (June 2023) Lecture 6 – Decision Tree
Example of a Decision Tree (cont.)






Given the data set D , we start building the tree by
creating a root node.
If this node is sufficiently “pure”, then we stop.
If we do stop building the tree at this step, we use the
majority class to classify/predict.
In this example, we classify all patterns as having LoanRisk = “High”.
Correctly classify 4 out of 6 input samples to achieve
classification accuracy of: (4 / 6)×100% = 66.67%
This node is split according to impurity measures:


15
Gini Index (used by CART)
Entropy (used by ID3, C4.5, C5)
Loan-Risk = High
Acc = 66.67%
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index

CART (Classification and Regression Trees) uses the Gini
index to measure the impurity of a dataset:

Gini index for the observations in node q is:
H
Gini (q ) = 1 − ∑ ph2
h =1

16
where
q is the node that contains Q examples from H classes
ph is a relative frequency of class h in node q
In our dataset, there are 2 classes High and Low, H = 2 .
2
1
4
2
pLow =
=
pHigh =
=
4+2 3
4+2 3
 2 2 1 1 4
Gini (q ) = 1 −  ×  −  ×  = = 0.4444
 3 3  3 3 9
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)



Should Income be used as the variable to split the root
node?
Income is a variable with continuous values.
Sort the data according to Income values:
Observation #
Income
Credit Rating
Loan Risk
1
17
Low
High
5
20
High
High
Split 1
0
23
High
High
Split 2
4
32
Moderate
Low
Split 3
2
43
Low
High
3
68
High
Low
17
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)

We consider 3 possible splits when there are changes in
the value of Loan-Risk.

Case 1 – Split condition Income ≤ 23 versus Income > 23
Impurity after split:
Loan Risk = High
Acc = 66.67%
Gini(q) = 4/9
Income ≤ 23
3 High Loan-Risks
0 Low Loan Risk
Gini(q1) = 0
18
3  3 4 2
(
)
I G q1 , q2 =  × 0  +  ×  = = 0.2222
6
Income > 23
1 High-Loan Risk
2 Low Loan Risk
Gini(q2) = 4/9
 6 9
𝐼𝐼𝐺𝐺 𝑞𝑞1
9
𝐼𝐼𝐺𝐺 𝑞𝑞2
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)




Case 2 – Split condition Income ≤ 32 versus Income > 32:
 4 3  2 1  5
= 0.41667
I G (q1 , q2 ) =  ×  +  ×  =
 6 8   6 2  12
Case 3 – Split condition Income ≤ 43 versus Income > 43:
5 8  1  4
I G (q1 , q2 ) =  ×  +  × 0  = = 0.26667
 6 25   6  15
Case 1 is the best.
Loan Risk = High
Acc = 66.67%
Instead of splitting between
Gini (q) = 4/9
Income ≤ 23 versus Income > 23,
Income ≤ 27.5
Income > 27.5
the midpoint is selected as actual
splitting point: (23 + 32)/2.
3 High Loan-Risks
0 Low Loan Risk
Gini(q1) = 0
19
1 High-Loan Risk
2 Low Loan Risk
Gini(q2) = 4/9
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)

Apply the tree generating method recursively to nodes
that are still not “pure”.
Loan Risk = High
Acc = 66.67%
Gini(q) = 4/9
Income ≤ 27.5
Loan Risk = High


Income > 27.5
Loan Risk = ?
Develop a subtree by examining the variable CreditRating.
Credit-Rating is a discrete variable with ordinal values, i.e.,
they can be ordered in a meaningful sequence.
20
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)


Possible values are {Low, Moderate, High} .
Check for best split:


Case 1 – Low versus (Moderate or High)
Case 2 – (Low or Moderate) versus High
Compute the Gini index for splitting the node:

Loan Risk = ?
21
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)

Case 1 – Split Credit-Rating = Low versus Credit-Rating =
Moderate or High:
Loan Risk = ?
Credit-Rating=Low
0 Low Loan-Risk
1 High Loan Risk
Gini(q1) = 0
Credit-Rating=Moderate or High
2 Low Loan-Risk
0 High Loan Risk
Gini(q2) = 0
1   2 
I G (q1 , q2 ) =  × 0  +  × 0  = 0
3  3 
22
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)

Case 2 – Split Credit-Rating = Low or Moderate versus
Credit-Rating = High:
Loan Risk = ?
Credit-Rating=Low or moderate
1 Low Loan-Risk
1 High Loan Risk
Gini(q1) = 1/2
Credit-Rating=High
0 Low Loan-Risk
1 High Loan Risk
Gini(q2) = 0
 2 1 1  1
I G (q1 , q2 ) =  ×  +  × 0  =
3 2 3  3

23
Case 2 split is not as good as Case 1 split.
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)
Complete tree:

Root
Income ≤ 27.5
Leaf
Income > 27.5
Intermediate
Loan Risk = High
Loan Risk = ?
Credit-Rating=Low
Observation #
Income
Credit Rating
Loan Risk
Predicted
Loan Risk
0
23
High
High
High
1
17
Low
High
High
2
43
Low
High
High
3
68
High
Low
Low
4
32
Moderate
Low
Low
5
20
High
High
High
24
Loan Risk = High
Credit-Rating=Moderate or High
Loan Risk = Low
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)



The tree achieves 100% accuracy on the training data set.
It may overfit the training data instances.
Trees may be simplified by pruning:




Removing nodes or branches to improve the accuracy on the
test samples.
Tree growing could be terminated when the number of
instances in the node is less than a pre-specified number.
Notice we have built a binary tree where every non-leaf
nodes have 2 branches.
For ordinal discrete variables with N values, check N − 1
possible splits.
25
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Gini Index (cont.)

For nominal discrete variables with N values, check
2( N −1) − 1 possible splits.

For example {Red, Green, Blue}, check 2(3−1) − 1 = 3 possible
splits:



26
Red versus Green, Blue
Green versus Red, Blue
Blue versus Red, Green
CG DADL (June 2023) Lecture 6 – Decision Tree
Decision Tree in Scikit Lean


We can perform decision tree classification using Scikit
Learn’s tree.DecisionTreeClassifier.
However, this class cannot process categorical
independent variables and thus we need to recode
CreditRating:





Use one hot encoding or one-of-K scheme.
LoanRisk has three levels – Low, Moderate and High.
So we will create three binary variables – CreditRatingLow,
CreditRatingModerate and CreditRatingHigh.
For each observation, only exactly one of these three variables
will be set to 1.
This is a small dataset and so we won’t do any validation.
27
CG DADL (June 2023) Lecture 6 – Decision Tree
Decision Tree in Scikit Lean (cont.)
src01
28
CG DADL (June 2023) Lecture 6 – Decision Tree
Decision Tree in Scikit Lean (cont.)
29
CG DADL (June 2023) Lecture 6 – Decision Tree
Classification Rule Generation

Trace each path from the root node to a leaf node to
generate a rule:
If Income ≤ 27.5, then Loan-Risk = High
Else if Income > 27.5 and Credit-Rating=Low, then Loan-Risk = High
Else if Income > 27.5 and Credit-Rating= Moderate or High, then Loan-Risk = Low
30
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Entropy Measure

ID3 and its successors (C4.5 and C5.0) are the most
widely used decision tree algorithms:




Developed by Ross Quinlan, University of Sydney.
ID3 uses a single best variable to test at each node of the tree,
it selects the most useful variable for classifying the instances.
The “goodness” or “usefulness” of a variable is measured
according to its information gain computed as entropy.
In the Loan Risk example:


4 observations have Loan-Risk = High, p1 = 4 / 6
2 observations have Loan-Risk = Low, p2 = 2 / 6
 4
Entropy (q ) = −  log 2
 6
31
4 2
 +  log 2
6 6
2 
2
1
(
)
(− 1.58496) = 0.91830
=
−
−
0
.
58496
−

6 
3
3
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Entropy Measure (cont.)

Let’s consider Split 1 again, Income ≤ 23 versus Income >
23


Subset q1 having patterns #1, #5, #0, all with Loan-Risk = High.
Entropy (q1 ) = 0
Obs #
Split 1
32
Income Credit Rating
Loan Risk
1
17
Low
High
5
20
High
High
0
23
High
High
4
32
Moderate
Low
2
43
Low
High
3
68
High
Low
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Entropy Measure (cont.)

Subset q2 having patterns #4, #3 with Loan-Risk = Low and
pattern #2 with Loan-Risk = High.
 2
Entropy (q2 ) = −  log 2
 3

2 1
1 
2
1
 +  log 2  = − (− 0.58496 ) − (− 1.58496 ) = 0.91829
3 
3
3
3 3
Entropy after splitting:

3  3
I E (q1 , q2 ) =  × 0  +  × 0.91829  = 0.45915

6  6
Split 1
33
Obs #
Income
Credit Rating
Loan Risk
1
17
Low
High
5
20
High
High
0
23
High
High
4
32
Moderate
Low
2
43
Low
High
3
68
High
Low
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Entropy Measure (cont.)

Suppose we split the dataset using Credit-Rating into 3
subsets:




Subset q1 if Credit Rating = Low, Patterns #1 and #2 both with
Loan-Risk = High
Subset q2 if Credit-Rating = Moderate, Pattern #4 with LoanRisk = Low
Subset q3 if Credit-Rating = High, Patterns #5 and #0 with
Loan-Risk = High and Pattern #3 with Loan-Risk = Low
Obs #
Income Credit Rating
Loan Risk
Entropy after splitting:

2  1  3
I E (q1 , q2 , q3 ) =  × 0  +  × 0  +  × 0.91829 

6  6  6
= 0.45915
Same as before using Income.
34
1
17
Low
High
5
20
High
High
0
23
High
High
4
32
Moderate
Low
2
43
Low
High
3
68
High
Low
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Entropy Measure (cont.)

Let’s use Credit-Rating as the first variable to split the
root node:
Credit Rating =
Low
Loan Risk = High


Credit Rating =
High
Credit Rating =
Moderate
Loan Risk = Low
Loan Risk = ?
Apply the algorithm again to reduce the entropy for the
instances in q3 .
Maximizing information gain – When selecting variable for
splitting, pick the one that reduces the entropy the most.
35
CG DADL (June 2023) Lecture 6 – Decision Tree
Using Entropy Measure (cont.)

Use Income to split q3 and obtain:
Credit Rating =
Low
Loan Risk = High
Credit Rating =
Moderate
Credit Rating =
High
Loan Risk = Low
Income ≤ 45.5
Income > 45.5
Loan Risk = High

Loan Risk = Low
Note that this is not a binary tree.
36
CG DADL (June 2023) Lecture 6 – Decision Tree
The Iris Flower Dataset





This dataset set contains data from three different species
of iris (Iris silky, virginica Iris and Iris versicolor).
The variables include the length and width of the sepals,
and the length and width of the petals.
This dataset is widely used for classification problems.
150 observations with 4 independent
attributes and one target attribute.
See the sample script src02.
37
CG DADL (June 2023) Lecture 6 – Decision Tree
The Iris Flower Dataset (cont.)
We can have a parsimonious tree with reasonable
accuracy by setting the maximum depth of the tree to 2:

src02
38
CG DADL (June 2023) Lecture 6 – Decision Tree
Lecture
7
Bayesian Classifier and
Logistic Regression
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: Dr. TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:






1
Various advanced classification models.
Naïve Bayesian classifier.
Logistic regression.
Multinomial logistic regression.
Using these advanced classification models in Python with
Scikit Learn.
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Conditional Probability
In probability theory, conditional probability is a
measure of the probability of an event occurring given that
another event has occurred.
For an event of interest 𝐴𝐴 and another known event 𝐵𝐵
that is assumed to have occurred, the conditional
probability of 𝐴𝐴 given 𝐵𝐵 is written as 𝑃𝑃 𝐴𝐴|𝐵𝐵
Example:






2
The probability that any given person has a cough on any given
day may be only 5%, i.e., 𝑃𝑃 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0.05.
But if we know or assume that the person has a cold, then they
are much more likely to be coughing.
The conditional probability of coughing given that you have a
cold might be much higher at 75%, i.e., 𝑃𝑃 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶|𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0.75.
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Conditional Probability (cont.)
Kolmogorov definition:





3
The conditional probability of 𝐴𝐴 given 𝐵𝐵 is defined as the
quotient of the probability of the joint of events 𝐴𝐴 and 𝐵𝐵, and
the probability of 𝐵𝐵:
𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵
𝑃𝑃 𝐴𝐴|𝐵𝐵 =
𝑃𝑃 𝐵𝐵
Where 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 is the probability that both events 𝐴𝐴 and 𝐵𝐵
occur.
And assuming that the unconditional probability of 𝐵𝐵 is greater
than zero, i.e., 𝑃𝑃 𝐵𝐵 > 0.
Conditional probability may also be written as an axiom of
probability:
𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 = 𝑃𝑃 𝐴𝐴|𝐵𝐵 𝑃𝑃 𝐵𝐵
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Bayesian Methods
Bayesian methods belong to the family of probabilistic
classification models.
Given the information about the values of explanatory
variables Χ , what is the probability that the instance
belongs to class y ?
We need to calculate the posterior probability P( y | Χ )
by means of Bayes’ theorem.
This could be done if we have the values of the prior
probability P( y ) and the class conditional probability
P(Χ | y ) .




4
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Bayesian Methods (cont.)
Suppose there are H distinct values for the target
variable y denoted as Η = {v1 , v2 ,..., vH } .
According to Bayes’ theorem, the posterior probability
P( y | Χ ) , that is, the probability of observing the target
class y given the instance Χ :


P( y | Χ ) =
P(Χ | y )P( y )
∑
H
l =1
5
P(Χ | y )P( y )
=
P(Χ | y )P( y )
P(Χ )
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Bayesian Methods (cont.)
An illustration:






Suppose H = 2 .
What is the probability that y = v1 given Χ ?
Answer – P( y = v1 | Χ )
For simplicity, let A be the event that y = v1 and B be the event
that y = v2 .
P( A ∩ Χ )
P( A | Χ ) =
P(Χ )
We know that: P( A ∩ Χ ) = P(Χ ∩ A)
and
so
6
P(Χ ∩ A)
P(Χ | A) =
P ( A)
P(Χ | A)P( A)
P( A | Χ ) =
P(Χ )
P( y | Χ ) =
P(Χ | y )P( y )
∑
H
l =1
P(Χ | y )P( y )
=
P(Χ | y )P( y )
P(Χ )
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Bayesian Methods (cont.)

Also:
P(Χ ) = P(Χ ∩ A) + P(Χ ∩ B )
= P(Χ | A)P( A) + P(Χ | B )P(B )
P( y | Χ ) =
P(Χ | y )P( y )
∑
H
l =1
P(Χ | y )P( y )
=
P(Χ | y )P( y )
P(Χ )
Χ
y = v1
y = v1 ∩ Χ
y = v2 ∩ Χ
y = v2
7
Event 𝐴𝐴 is 𝑦𝑦 = 𝑣𝑣1 and event 𝐵𝐵 is 𝑦𝑦 = 𝑣𝑣2
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Bayesian Methods (cont.)
Maximum A Posteriori Hypothesis (MAP):




8
Calculate the posterior probability P( A | Χ ) and P(B | Χ )
P(Χ | A)P( A)
P(Χ | B )P(B )
P( A | Χ ) =
P (B | Χ ) =
P(Χ )
P(Χ )
then see which one is higher.
Since the denominator is the same for both posterior
probabilities, need only to compare:
P(Χ | A)P( A)
P(Χ | B )P(B )
We conclude that y = v1 if P(Χ | A)P( A) is the larger of the
two, otherwise conclude y = v2 .
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Bayesian Methods (cont.)

To obtain P(Χ | A)P( A) and P(Χ | B )P(B ) .
y = v1
Χ
y = v1 ∩ Χ
y = v2 ∩ Χ
y = v2
P(Χ ∩ A)
P ( A)
P( A ∩ Χ ) = P(Χ | A)P( A)
= P( x1 | A)P(x2 | A)...P(xn | A)P( A)
P(Χ | A) =
9
P(Χ ∩ B )
P (B )
P(B ∩ Χ ) = P(Χ | B )P(B )
= P( x1 | B )P( x2 | B )...P(xn | B )P(B )
P(Χ | B ) =
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Naïve Bayesian Classifier

The classifier assumes that given the target class, the
explanatory variables are conditionally independent:
P(Χ | y ) = P( x1 | y )× P( x2 | y )× P( x3 | y )× ...... × P(xn | y )

The class conditional probability values are calculated
from the available data:

10
For categorical or discrete numerical variable a j :
s jhk
P (x j | y ) = P (x j = rjk | y = vh ) =
mh
where
s jhk is the number of instances class vh for which the variable
𝑥𝑥𝑗𝑗 takes value 𝑟𝑟𝑗𝑗𝑘𝑘
mh is the total number of instances of class vh in dataset D .
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Naïve Bayesian Classifier (cont.)

For numerical attributes:


11
P (x j | y ) is estimated by making some assumption regarding its
distribution.
Often, this conditional probability is assumed to follow the Gaussian
distribution (also known as normal distribution) and we compute the
Gaussian density function.
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
An Example of Naïve Bayesian Classifier

For tomorrow, the
weather forecast is:
DAY
OUTLOOK
SUNNY
TEMPERATURE
HOT
HUMIDITY
HIGH
WIND
WEAK
PLAYTENNIS
D2
SUNNY
HOT
HIGH
STRONG
NO
D3
OVERCAST HOT
HIGH
WEAK
YES
D4
RAIN
MILD
HIGH
WEAK
YES

D5
RAIN
COOL
NORMAL
WEAK
YES

D6
RAIN
COOL
NORMAL
STRONG
NO
D7
OVERCAST COOL
NORMAL
STRONG
YES
D8
SUNNY
MILD
HIGH
WEAK
NO
D9
SUNNY
COOL
NORMAL
WEAK
YES
D10
RAIN
MILD
NORMAL
WEAK
YES
D11
SUNNY
MILD
NORMAL
STRONG
YES
D12
OVERCAST MILD
HIGH
STRONG
YES
D13
OVERCAST HOT
NORMAL
WEAK
YES
D14
RAIN
HIGH
STRONG
NO
D1
12
MILD
NO




Outlook – Sunny
Temperature – Cool
Humidity – High
Wind – Strong
Do we play tennis?
Prior probabilities:
9
14
5
P(PlayTennis = No ) =
14
P(PlayTennis = Yes ) =
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
An Example of Naïve Bayesian Classifier
(cont.)

Estimate class conditional probabilities:
DAY
OUTLOOK
SUNNY
TEMPERATURE
HOT
HUMIDITY
HIGH
WIND
WEAK
PLAYTENNIS
D2
SUNNY
HOT
HIGH
STRONG
NO
D3
OVERCAST HOT
HIGH
WEAK
YES
D4
RAIN
MILD
HIGH
WEAK
YES
D5
RAIN
COOL
NORMAL
WEAK
YES
D6
RAIN
COOL
NORMAL
STRONG
NO
D7
OVERCAST COOL
NORMAL
STRONG
YES
D8
SUNNY
MILD
HIGH
WEAK
NO
D9
SUNNY
COOL
NORMAL
WEAK
YES
P ( Sunny | Yes) P (Cool | Yes) P ( High | Yes) P ( Strong | Yes) P (Yes) =
D10
RAIN
MILD
NORMAL
WEAK
YES
D11
SUNNY
MILD
NORMAL
STRONG
YES
D12
OVERCAST MILD
HIGH
STRONG
YES
 2  3  3  3  9 
      = 0.00529
 9  9  9  9  14 
D13
OVERCAST HOT
NORMAL
WEAK
YES
P( Sunny | No) P(Cool | No) P( High | No) P ( Strong | No) P( No) =
D14
RAIN
HIGH
STRONG
NO
 3  1  4  3  5 
      = 0.02057
 5  5  5  5  14 
D1
MILD
Note on posterior probabilities:
P(Sunny, Cool , High, Strong ) = 0.00529 + 0.02057 = 0.02586
P(Yes | Sunny, Cool , High, Strong ) = 0.00529 0.02586 = 0.2045
3
9
3
P(Wind = Strong | PlayTennis = No ) =
5
P(Wind = Strong | PlayTennis = Yes ) =
NO



Compute class conditional probabilities for the
other variables.
Then compute:
Decision: No, we do not play tennis.
P( No | Sunny, Cool , High, Strong ) = 0.02057 0.02586 = 0.7955
13
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Naïve Bayesian Classifier with Scikit
Learn

The sklearn.naive_bayes module implements
various Naive Bayes algorithms:


Supervised learning methods based on applying Bayes’ theorem
with strong (naive) feature independence assumptions.
MultinomialNB:




BernoulliNB:


14
Naive Bayes classifier for multinomial models.
Suitable for classification with discrete features having a distribution
like word frequencies (e.g., word counts for text classification).
The multinomial distribution normally requires integer feature counts.
Naive Bayes classifier for multivariate Bernoulli models.
Suitable for discrete data such as binary/boolean features.
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Naïve Bayesian Classifier with Scikit
Learn (cont.)

GaussianNB:



Gaussian Naive Bayes classifier.
Features with distribution that is assumed to be normal and the
values can be continuous.
See the sample script file src01 for the tennis example
with GaussianNB.
15
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression



Logistic regression is a technique for converting binary
classification problems into linear regression.
Values of response variables are assumed to be 0 or 1.
Using logistic regression, the posterior probability P( y | Χ )
of the target variable y conditioned to the input
Χ = ( X 1 , X 2 ,..., X n ) is modeled according to the logistic
function (where e = 2.718281828... ):
P (Y = 1 | X 1 , X 2 ,..., X n ) =
1
1 + e −( β 0 + β1 X 1 +...+ β n X n )
e β 0 + β1 X 1 +...+ β n X n
=
1 + e β 0 + β1 X 1 +...+ β n X n
P (Y = 0 | X 1 , X 2 ,..., X n )
= 1 − P(Y = 1 | X 1 , X 2 ,..., X n ) = 1 −
16
1
1 + e −( β 0 + β1 X 1 +...+ β n X n )
=
1
1 + e β 0 + β1 X 1 +...+ β n X n
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression (cont.)
1
 Graph of the logistic function f ( x ) =
:
−x
1+ e

Hence, 0 ≤ P(Y = 1 | X 1 , X 2 ,..., X n ) ≤ 1 and

The above is known as the sigmoid function.
0 ≤ P(Y = 0 | X 1 , X 2 ,..., X n ) ≤ 1
17
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression (cont.)

The ratio of the two conditional probabilities is:
P(Y = 1|X 1,X 2 ,...,X n )
= e β 0 + β1 X 1 +...+ β n X n
P(Y = 0|X 1,X 2 ,...,X n )
This is the odds in favor of y = 1

And its logarithm:
 P(Y = 1|X 1,X 2 ,...,X n ) 
 = β 0 + β1 X 1 + ... + β n X n
ln
 P(Y = 0|X 1,X 2 ,...,X n ) 
This is the logit function, or the logarithm of the odds.
18
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression (cont.)

If X 1 is increased by 1:
logit | X +1 = logit | X + β
i i
i
odds | X +1 = odds | X e β i
i
i
is the odds-ratio – The multiplicative increase in the
odds when X 1 increases by one (other variables
remaining constant):
β i > 0 ⇒ e β > 1 ⇒ odds and probability increase with X 1
β i < 0 ⇒ e β < 1 ⇒ odds and probability decrease with X 1
 e
βi
i
i
19
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 1



A system analyst studied the effect of computer
programming experience on ability to complete a
complex programming task within a specified time.
They had varying amount of experience (measured in
months).
All persons were given the same programming task and
their success in the task was recorded:


20
Y = 1 if task was completed successfully within the allotted
Person
Months-Experience
Success
time.
1
14
0
Y = 0 otherwise.
2
29
0
…
…
…
24
22
1
25
8
1
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 1 (cont.)


A standard logistic package was run on the data and the
parameter values found are: β 0 = −3.0595 and β1 = 0.1615 .
The estimated mean response for i = 1 , where X 1 = 14 is:
a = β 0 + β1 X 1 = −3.0595 + 0.1615(14 ) = −0.7985
e a = e −0.7985 = 0.4500
0.4500
ea
=
= 0.3103
P(Y = 1 | X 1 = 14 ) =
a
1+ e
1 + 0.4500


The estimated probability that a person with 14 months
experience will successfully complete the programming
task is 0.3103.
The odds in favor of completing the task = 0.3103 (1 − 0.3103)
= 0.4499
21
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 1 (cont.)


Suppose there is another programmer with 15 months
experience, i.e., X 1 = 15 .
Recall the parameter values are β 0 = −3.0595 and β1 = 0.1615
b = β 0 + β1 X 1 = −3.0595 + 0.1615(15) = −0.637
eb = e −0.637 = 0.5289
eb
0.5289
P(Y = 1 | X 1 = 15) =
=
= 0.3459
b
1+ e
1 + 0.5289


The estimated probability that a person with 15 months
experience will successfully complete the programming
task is 0.3459.
= 0.3459 (1 − 0.3459 )
The odds in favor of completing the task
= 0.5288
22
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 1 (cont.)

Comparing the two odds:
0.5288
= 1.1753 = e 0.1615
0.4499

The odds increase by 17.53% with each additional month
of experience.
23
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 2





This example originates from UCLA.
The data were collected from 200 high school students.
The variables read, write, math, science, and socst are the
results of standardized tests on reading, writing, math,
science and social studies (respectively).
The variable female is coded 1 if female, 0 if male.
The response variable is honcomp with two possible
values:


24
High writing test score if the writing score is greater than or
equal to 60 (honcomp = 1),
Low writing test score, otherwise (honcomp = 0).
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 2 (cont.)


The predictor variables used are gender (female), reading
test score (read) and science test score (science).
See the sample script file src02.
25
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 2 (cont.)
26
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Logistic Regression Example 2 (cont.)

Selected output from StatsModels:


Estimate for Intercept β 0 = −12.7772
Estimate for β1 = 1.4825 (corresponds to variable female)
β
 Odds ratio point estimate corresponds to variable female = 4.404 = e
1



Estimate for β 2 = 0.1035 (corresponds to variable read)
Estimate for β 3 = 0.0948 (corresponds to variable science):


27
The odds of a female student getting high writing test score is more
than 4-fold higher than a male student (given the same reading and
science test scores).
The estimate logistic regression coefficient for a one unit change in
science score, given the other variables in the model are held
constant.
If a student were to increase his/her science score by one point, the
difference in log-odds (logit response values) for high writing score is
expected to increase by 0.0948 units, all other variables held constant.
CG DADL (June 2023) Lecture 7 – Bayesian Classifier and Logistic Regression
Lecture
8
Support Vector Machines
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:






1
Overview of Support vector machines (SVMs).
Maximum-margin hyperplane.
Non-linear SVMs.
Using SVMs in Python with Scikit Learn.
Hyperparameter tuning.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Overview of Support Vector Machines
Support vector machines (SVMs):




SVMs are a family of separation methods for classification and
regression.
SVMs are developed in the context of statistical learning
theory.
Among the best supervised learning algorithm:



2
Most theoretically motivated.
Practically most effective classification algorithms in modern machine
learning.
SVMs are initially designed to fit a linear boundary between the
samples of a binary problem, ensuring the maximum
robustness in terms of tolerance to isotropic uncertainty.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Overview of Support Vector Machines
(cont.)
SVMs can achieve better performance in terms of
accuracy with respect to other classifiers in several
application domains:





Text and hypertext categorization – SVMs can significantly
reduce the need for labeled training instances.
Classification of images.
Recognition of hand-written characters.
SVMs have also been widely applied in the biological and other
sciences – E.g., classify proteins with up to 90% accuracy.
Efficiently scalable for large problems:


3
If we have a big data set that needs a complicated model, the
full Bayesian framework is very computationally expensive.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
What is SVM? – A Video Introduction
4
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How Does SVM Work?
Given a set of labelled training examples for two
categories:






5
An SVM training algorithm builds a model that assigns new
examples to one category or the other.
This makes SVM a non-probabilistic binary linear classifier.
An SVM model is a representation of the examples as points in
space, mapped so that the examples of the separate categories
are divided by a clear gap that is as wide as possible.
New examples are then mapped into that same space and
predicted to belong to a category based on which side of the
gap they fall.
The gist of SVM is to establish an optimal hyperplane for
linearly separable patterns.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How Does SVM Work? (cont.)
The figure below depicts the SVM decision boundary and
the support vectors:





6
The boundary displayed has
the largest distance to the
closest point of both classes.
Any other separating
boundary will have a point of
a class closer to it than this one.
The figure also shows the
closest points of the classes
to the boundary.
These special points are called
support vectors.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How Does SVM Work? (cont.)



In fact, the boundary only depends on the support vectors.
If we remove any other point from the dataset, the boundary
remains intact.
However, in general, if any of these support vectors is removed,
the boundary will change.
In addition to performing linear classification, SVMs can
also efficiently perform a non-linear classification:



7
SVMs can be extended to patterns that are not linearly
separable by transformations of original data to map into new
space using Kernel functions.
This approach is known as the kernel trick – transforming data
into another dimension that has a clear dividing margin
between classes of data.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vectors
Support vectors are the data points that lie closest to
the decision surface (or hyperplane):



They are the data points that are the most difficult to classify.
They have direct bearing on the optimum location of the
decision surface.
Which separating hyperplane should we use?



8
In general, there are many possible solutions.
SVM finds an optimal solution.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vector Machine
SVMs maximize the margin around the
separating hyperplane:


This is known as the “street”.
This is known as the maximummargin hyperplane.
The decision function is fully specified
by a (usually very small) subset of
training samples, i.e., the support
vectors.
This is essentially a quadratic
programming problem that is easy to
solve by standard methods.



9
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vector Machine (cont.)

Separation by hyperplanes:




We will assume linear separability for now and will relax this
assumption later.
In two dimensions, we can separate by a line.
In higher dimensions, we need hyperplanes.
General input/output for SVMs:


Similar to neural nets but there is one important addition.
Input:




10
Set of (input, output) training pair samples.
The input are the sample features 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 .
The output is the result 𝑦𝑦.
Typically, there can be lots of input features 𝑥𝑥𝑖𝑖 .
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vector Machine (cont.)

Output:




Important difference:


11
A set of weights 𝑤𝑤𝑖𝑖 , one for each feature.
The linear combination of the weights predicts the value of 𝑦𝑦.
Thus far, this is similar to neural nets.
We use the optimization of maximizing the margin (“street width”) to
reduce the number of weights that are nonzero to just a few that
correspond to the important features that ‘matter’ in deciding the
separating line (hyperplane).
These nonzero weights correspond to the support vectors (because
they ‘support’ the separating hyperplane).
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vector Machine (cont.)

Two-dimensional case:



12
Find 𝑎𝑎, 𝑏𝑏 and 𝑐𝑐 such that
𝑎𝑎𝑥𝑥 + 𝑏𝑏𝑏𝑏 ≥ 𝑐𝑐 for the red points.
𝑎𝑎𝑎𝑎 + 𝑏𝑏𝑏𝑏 ≤ (or <)𝑐𝑐 for the green points.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vector Machine (cont.)

Which hyperplane to choose?



There are lots of possible solutions for 𝑎𝑎, 𝑏𝑏 and 𝑐𝑐.
Some methods find a separating hyperplane, but not the
optimal one (e.g., neural net).
But the important question is which points should influence
optimality?

All points?



Or only “difficult points” close to decision
boundary?

13
Linear regression
Neural nets
Support vector machines
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vectors for Linearly Separable
Case



Recall that support vectors are the elements of the
training set that would change the position of the dividing
hyperplane if removed.
Support vectors are the critical elements of the training
set.
The problem of finding the optimal hyper plane is an
optimization problem:


14
Can be solved by optimization techniques.
E.g., Lagrange multipliers can be used to get this problem into a
form that can be solved analytically.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vectors for Linearly Separable
Case (cont.)

Support vectors are input vectors that just touch the
boundary of the margin (street):


15
There are three of them as circled in the figure below.
Think of them as the ‘tips’ of the vectors.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Support Vectors for Linearly Separable
Case (cont.)

The figure below shows the actual support vectors, 𝑣𝑣1 , 𝑣𝑣2 ,
𝑣𝑣3 , instead of just the 3 circled points at the tail ends of
the support vectors.

16
𝑑𝑑 denotes 1/2 of the street “width”.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Definitions

Define the hyperplanes 𝐻𝐻 such that:



𝐻𝐻1 and 𝐻𝐻2 are the planes:



𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 ≥ +1 when 𝑦𝑦𝑖𝑖 = +1
𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 ≤ −1 when 𝑦𝑦𝑖𝑖 = −1
𝐻𝐻1 : 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 = +1
𝐻𝐻2 : 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 + 𝑏𝑏 = −1
The points on the planes
𝐻𝐻1 and 𝐻𝐻2 are the tips of
the support vectors.
17
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Definitions (cont.)

The plane 𝐻𝐻0 is the median in between, where 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 +
𝑏𝑏 = 0:




𝑑𝑑 + = The shortest distance to the closest positive point.
𝑑𝑑 − = The shortest distance to the closest negative point.
The margin (gutter) of a separating hyperplane is 𝑑𝑑 + + 𝑑𝑑 −.
The optimization algorithm to generate the weights
proceeds in such a way that only the support vectors
determine the weights and thus the boundary.
18
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Definitions (cont.)
19
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Defining the Separating Hyperplane

Form of equation defining the decision surface separating
the classes is a hyperplane of the form:
𝑊𝑊 𝑇𝑇 𝑋𝑋 + 𝑏𝑏 = 0




𝑊𝑊 is a weight vector.
𝑋𝑋 is input vector.
𝑏𝑏 is bias.
This allows us to write:
𝑊𝑊 𝑇𝑇 𝑋𝑋 + 𝑏𝑏 ≥ 0 for 𝑑𝑑1 = +1
𝑊𝑊 𝑇𝑇 𝑋𝑋 + 𝑏𝑏 < 0 for 𝑑𝑑1 = −1
20
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Defining the Separating Hyperplane
(cont.)


Margin of Separation (𝑑𝑑) – The separation between the
hyperplane and the closest data point for a given weight
vector 𝑊𝑊 and bias 𝑏𝑏.
Optimal Hyperplane (maximal margin) – The particular
hyperplane for which the margin of separation 𝑑𝑑 is
maximized.
21
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Maximizing the Margin (a.k.a. Street
Width)




We want a classifier (linear separator) with as big a
margin as possible.
The distance from a point 𝑥𝑥0 , 𝑦𝑦0 to a line 𝐴𝐴𝐴𝐴 + 𝐵𝐵𝐵𝐵 +
𝑐𝑐 = 0 is:
𝐴𝐴𝑥𝑥0 + 𝐵𝐵𝑦𝑦0 + 𝑐𝑐
𝐴𝐴2 + 𝐵𝐵2
The distance between 𝐻𝐻0 and 𝐻𝐻1 is then:
𝑤𝑤 ⋅ 𝑥𝑥 + 𝑏𝑏
1
=
𝑤𝑤
𝑤𝑤
The total distance between 𝐻𝐻1 and 𝐻𝐻2 is thus:
2
𝑤𝑤
22
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Maximizing the Margin (a.k.a. Street
Width) (cont.)

In order to maximize the margin, we thus need to
minimize 𝑤𝑤 and subject to the condition that there are
no data points between 𝐻𝐻1 and 𝐻𝐻2 :
𝑥𝑥𝑖𝑖 ⋅ 𝑤𝑤 + 𝑏𝑏 ≥ +1 when 𝑦𝑦𝑖𝑖 = +1
𝑥𝑥𝑖𝑖 ⋅ 𝑤𝑤 + 𝑏𝑏 ≤ −1 when 𝑦𝑦𝑖𝑖 = −1
Can be combined into:
𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖 ⋅ 𝑤𝑤 ≥ +1
23
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Maximizing the Margin (a.k.a. Street
Width) (cont.)

We need to solve a quadratic programming problem:



Minimize 𝑤𝑤 , s.t. discrimination boundary is obeyed, i.e.,
min𝑓𝑓 𝑥𝑥 s.t. 𝑔𝑔 𝑥𝑥 = 0.
We can rewrite this as min𝑓𝑓: 1⁄2 𝑤𝑤 2 s.t. 𝑔𝑔: 𝑦𝑦𝑖𝑖 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 − 𝑏𝑏 =
1 or 𝑦𝑦𝑖𝑖 𝑤𝑤 ⋅ 𝑥𝑥𝑖𝑖 − 𝑏𝑏 − 1 = 0
This is a constrained optimization problem:


24
It can be solved by the Lagrangian multiplier method.
Because it is quadratic, the surface is a paraboloid, with just a
single global minimum.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
A Video Explanation
25
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How about Non-Linear Separable?

We would be finding a line that penalizes points on “the
wrong side”.
26
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Non-Linear SVM – Continue Video
27
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How about Non-Linear Separable?
(cont.)

We can transform the data points such that they are
linearly separable:
28
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How about Non-Linear Separable?
(cont.)

Non-Linear SVMs:



29
The idea is to gain linearly separation by mapping the data to a
higher dimensional space.
The following set cannot be separated by a linear function, but
can be separated by a quadratic one.
So if we map 𝑥𝑥 ↦ 𝑥𝑥 2 , 𝑥𝑥 , we gain linear separation:
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How about Non-Linear Separable?
(cont.)


But what happen if the decision function is not linear?
What transformation would separate the data points in
the figure below?
30
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How about Non-Linear Separable?
(cont.)

The kernel trick:


31
Imagine a function 𝜙𝜙 that maps the data into another space.
A kernel function performs the transformation and defines the
inner dot products 𝑥𝑥𝑖𝑖 ⋅ 𝑥𝑥𝑗𝑗 , i.e., it also defines similarity in the
transformed space.
CG DADL (June 2023) Lecture 8 – Support Vector Machines
How about Non-Linear Separable?
(cont.)

Examples of non-linear SVMs:
Example of a Gaussians kernel
32
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Performing SVM in Scikit Learn




We can apply linear SVM to the binary classification
problem of the tennis dataset.
See the sample script src01
We can also apply linear SVM to the multiclass
classification problem of the Iris flower dataset.
See the sample script src02
33
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Hyperparameter Tuning


In order to improve the model accuracy, there are several
parameters that need to be tuned.
The three major parameters include:
Parameter
Description
Kernel
Kernel takes low dimensional input space and transform it into a higherdimensional space. It is mostly useful in non-linear separation problem.
C
C is the penalty parameter, which represents misclassification or error
(Regularisation) term. C tells the SVM optimisation how much error is bearable. Control
the trade-off between decision boundary and misclassification term.
When C is high it will classify all the data points correctly, also there is a
chance to overfit.
Gamma
Defines how far influences the calculation of plausible line of separation.
When gamma is higher, nearby points will have high influence; low gamma
means far away points also be considered to get the decision boundary.
34
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Hyperparameter Tuning (cont.)
C (Regularisation)
35
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Hyperparameter Tuning (cont.)
Gamma
36
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Hyperparameter Tuning (cont.)

Hyperparameters are parameters that are not directly
learnt within estimators:




In scikit-learn, they are passed as arguments to the constructor
of the estimator classes.
Grid search is commonly used as an approach to
hyperparameter tuning that will methodically build and evaluate
a model for each combination of algorithm parameters
specified in a grid.
GridSearchCV helps us combine an estimator with a
grid search preamble to tune hyperparameters.
See the sample script src03
37
CG DADL (June 2023) Lecture 8 – Support Vector Machines
Lecture
9
Clustering
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:








1
What is clustering?
Clustering methods.
Affinity measures.
Partition methods.
Hierarchical methods.
Evaluation of clustering models.
Using Clustering in Python with Scikit Learn.
CG DADL (June 2023) Lecture 9 – Clustering
Overview of Clustering
Clusters are homogeneous groups of observations.
To measure similarity between pairs of observations, a
distance metric must be defined.
Clustering is an unsupervised learning process.
Focus of our discussions will be on:








2
Features of clustering models.
Two partition methods: K-means and K-medoids.
Two hierarchical methods: Agglomerative and divisive
clustering methods.
Quality indicators for clustering methods.
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Methods
Aim – To subdivide the records of a dataset into
homogeneous groups of observations called clusters.
Observations in a cluster are similar to one another and
are dissimilar from observations in other clusters.


Purpose of clustering:


As a tool which could provide meaningful interpretation of the
phenomenon of interest:

3
Example – Grouping consumers based on their purchase behavior
may reveal the existence of a market niche.
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Methods (cont.)

As a preliminary phase of a data mining project that will be
followed by other methodologies within each cluster:

Example:



4
Clustering is done before classification.
In retention analysis, distinct classification models may be developed for
various clusters to improve the accuracy in spotting customers with high
probability of churning.
As a way to highlight outliers and identify an observation that
might represent its own cluster.
CG DADL (June 2023) Lecture 9 – Clustering
General Requirements for
Clustering Methods
Flexibility:



Some methods can be applied to data having numerical
variables only.
Other methods can be applied to datasets containing
categorical variables as well.
Robustness:




5
The stability of the clusters with respect to small changes in
the values of variables of each observation.
Noise should not affect the clusters.
The clusters formed should not depend on the order of
appearance of the observations in the dataset.
CG DADL (June 2023) Lecture 9 – Clustering
General Requirements for
Clustering Methods (cont.)
Efficiency:


6
The method must be able to generate clusters efficiently
within reasonable computing time even for a large dataset with
many observations and large number of variables.
CG DADL (June 2023) Lecture 9 – Clustering
Taxonomy of Clustering Methods
Based on the logic used for deriving the clusters.
Partition methods:




Develop a subdivision of the given dataset into a
predetermined number K of non-empty subsets.
They are usually applied to small or medium sized data sets.
Hierarchical methods:




7
Carry out multiple subdivisions into subsets.
Based on a tree structure and characterized by different
homogeneity thresholds within each cluster and inhomogeneity
threshold between distinct clusters.
No predetermined number of clusters is required.
CG DADL (June 2023) Lecture 9 – Clustering
Taxonomy of Clustering Methods (cont.)
Density-based methods:



Derive clusters such that for each observation in a specific
cluster, a neighborhood with a specified diameter must contain
at least a pre-specified number of observations.
E.g., Defined distance (DBSCAN):
If the Minimum Features per Cluster cannot be found within the Search Distance from a particular point, then that point will
be marked as noise. In other words, if the core-distance (the distance required to reach the minimum number of features)
for a feature is greater than the Search Distance, the point is marked as noise. The Search Distance, when using DBSCAN, is
treated as a search cut-off.
8
CG DADL (June 2023) Lecture 9 – Clustering
Taxonomy of Clustering Methods (cont.)
Grid methods:



9
Derive a discretization of the space of observations, obtaining
grid structure consisting of cells.
Subsequent clustering operations are developed with respect
to the grid structure to achieve reduced computing time, but
lower accuracy.
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures



Clustering models are typically based on a measure of
similarity between observations.
The measure can typically be obtained by defining an
appropriate notion of distance between each pair of
observations.
There are many popular metrics depending on the type of
variables being analyzed.
10
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures (cont.)

Given a dataset D having m observations Χ1, Χ 2 , Χ 3 ,...Χ m
each described by n-dimensional variables, we compute
the distance matrix D:
0 d12  d1,m −1

0  d 2,m −1

D = [d ik ] = 



0


d1m
d 2m






d m −1,m 
0 
where d ik is the distance between observations Χi and Χ k .
d ik = dist (Χ i , Χ k ) = dist (Χ k , Χ i ) for i, k = 1,2,..., m
D is a symmetric m × m matrix with zero diagonal.
11
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures (cont.)

Similarity measure can be obtained by letting:
s ik =
1
1 + d ik
or
s ik =
d max − d ik
d max
where d max = max i ,k d ik is the max value of D .
12
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Numerical
Variables


If all n variables of the observations Χ1, Χ 2 , Χ 3 ,...Χ m are
numerical, the distance between Χi and Χ k can be
computed in four ways.
Euclidean distance (or 2 norm):
dist (Χ i , Χ k ) =

n
∑ (x
j =1
ij
− xkj ) =
2
(xi1 − xk1 )2 + (xi 2 − xk 2 )2 + ... + (xin − xkn )2
Manhattan distance (or 1 norm):
n
dist (Χ i , Χ k ) = ∑ xij − xkj = xi1 − xk1 + xi 2 − xk 2 + ... + xin − xkn
j =1
13
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Numerical
Variables (cont.)

Infinity norm:

Minkowski distance, which generalizes both the
Euclidean and Manhattan metrics:
dist (Χ i , Χ k ) = max( xi1 − xk1 , xi 2 − xk 2 ,..., xin − xkn )
dist (Χ i , Χ k ) =

14
n
q
∑
j =1
xij − xkj
q
q
q
= q xi1 − xk1 + xi 2 − xk 2 + ... + xin − xkn
q
The Minkowski distance reduces to the Manhattan distance
when q = 1 , and to the Euclidean distance when q = 2 .
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Numerical
Variables (cont.)
Example:

Χ1
= (5,0 )
and
Χ2
= (1,−3)
Euclidean distance (or 2 norm):
dist (Χ1 , Χ 2 ) =
(5 − 1)2 + (0 − (− 3))2
= 16 + 9 = 5

Manhattan distance (or 1 norm):
dist (Χ1 , Χ 2 ) = 5 − 1 + 0 − (− 3)
0
1
Infinity norm:
dist (Χ1 , Χ 2 ) = max( 5 − 1 , 0 − (− 3) )
-3
= max(4,3) = 4
4
5

-2

3
-1
= 4+3= 7
2

-4

15
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Numerical
Variables (cont.)

Distance can also be measured by computing the cosine
of the angle formed by the two vectors representing the
observations. The similarity coefficient is defined as:
∑ xx
∑ x∑
n
Bcos (Χ i , Χ k ) = cos(Χ i , Χ k ) =


j =1 ij kj
n
2
j =1 ij
n
2
x
j =1 kj
The value of Bcos (Χ i , Χ k ) lies in the interval [0,1] .
The distance itself is the angle between the two vectors
Χi and Χ k derived as follow:
dist (Χ i , Χ k ) = arccos(Χ i , Χ k )
16
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Numerical
Variables (cont.)


The coefficient Bcos (Χ i , Χ k ) is closer to 1 when Χi and Χ k
are similar to each other (parallel)
It is closer to 0 when they are dissimilar.

Note that cos(90) = 0 and cos(0) = 1 .
Cluster 1
xi
Cluster 2
xk
αi
Origin
αk
xnew
xnew is more similar to xk than to xi
17
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Binary Variables
 Suppose the variable a = (x , x ,..., x ) can only have value
j


ij
2j
mj
either 0 or 1.
Even if the distance between observations can be
computed, the quantity would not represent a meaningful
measure.
The values 0 and 1 are purely conventional and their
meaning could be interchanged.
18
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Binary Variables
(cont.)

Assuming all n variables in the dataset D are binary,
define a contingency table:
Observation xk
Observation xi




19
0
1
total
0
p
q
p+q
1
u
v
u+v
total
p+u
q+v
n
p – The number of variables for which Χi and Χ k assume the
value 0.
v – The number of attributes for which Χi and Χ k assume the
value 1.
q – The number of attributes for which Χi assumes the value of
0 and Χ k assumes the value 1.
u – the number of attributes for which Χi assumes the value
of 1 and Χ k assumes the value 0.
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Binary Variables
(cont.)

Example:
Χi
Χk
= (0,0,1,1,0,1,0,0)
= (1,1,1,0,0,0,0,0 )
1
2
3
4
5
6
7
n=8
Xi
0
0
1
1
0
1
0
0
Xj
1
1
1
0
0
0
0
0
Observation xk
Observation xi
20
0
1
total
0
p=3
q=2
p+q=5
1
u=2
v=1
u+v=3
total
p+u=5
q+v=3
n=8
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Binary Variables
(cont.)

Symmetric binary variables:



The presence of the value 0 is as interesting as the presence of
value 1.
E.g., Customer authorization of promotional mailings assumes
the value {yes, no} .
Asymmetric binary variables:


21
We are mostly interested in the presence of the value 1, which
can be found in a small number of observations.
E.g., If the customer purchased item A in a supermarket which
carries more than 100,000 items.
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Binary Variables
(cont.)

For symmetric binary variables, the distance is given by
the coefficient of similarity:
dist (Χ i , Χ j ) =

q+u
q+u
=
p+q+u +v
n
For asymmetric binary variables, it is more interesting to
match positives (coded as 1’s) with respect to matching
negatives (coded as 0’s). The Jaccard coefficient is
defined as:
dist (Χ i , Χ j ) =
q+u
q+u +v
Observation xk
Observation xi
22
0
1
total
0
p
q
p+q
1
u
v
u+v
total
p+u
q+v
n
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Nominal
Categorical Variables


A nominal categorical variable can be interpreted as a
generalization of a symmetric binary variable for which
the number of distinct values is more than 2.
The distance measure can then be computed as:
dist (Χ i , Χ j ) =

n− f
n
where f is the number of variables for which the
observations Χi and Χ k have the same nominal value.
Example:
= (male, Chinese, Clementi, Full − Time Employee)
Χ k = (female, Caucasian, Clementi, Temporary Worker )
4 −1
n=4
f =1
dist (Χ i , Χ j ) =
= 0.75
4
Χi
23
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Ordinal
Categorical Variables



An ordinal categorical variable can be placed on a natural
ordering scale, but the numerical values are, however,
arbitrary.
Standardization is required before affinity measures can
be computed.
For example, level of education have 4 possible values
{elementary school, high school, bachelor's degree, post - graduate} ,
which is coded as follows:




24
1 = elementary school
2 = high school
3 = bachelor’s degree
4 = post-graduate
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Ordinal
Categorical Variables (cont.)

If the values associated with the ordinal variable are
{1,2,3,..., H j } then standardize by the transformation:
'
ij
x =

xij − 1
H j −1
where H j is the maximum assigned numerical value of the
variable and xij' lies in the interval [0,1] .
After transformation, the variable level of education have
numerical values as follows:




25
0
= elementary school
1 3 = high school
2 3 = bachelor’s degree
1 = post-graduate
We can then apply measures of distance
for numerical variable accordingly.
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Mixed
Composition Variables



Suppose the n variables in the dataset D has mixed
composition – numerical, symmetric/asymmetric binary
and nominal/categorical variables.
How do we define an affinity measure between two
observations Χi and Χ k in D ?
Let δ ikj be a binary indicator where δ ikj = 0 if and only if
one of the two cases below is true:



At least one of the two values xij or xkj is missing in the
corresponding observations;
The variable a j is binary asymmetric and xij = xkj = 0.
For other cases, δ ikj = 1 .
26
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Mixed
Composition Variables (cont.)


Example:
aj j = 2
xi
0
3.8
-1
xk
0
missing
+1
δikj = 0
δikj = 1
δikj = 0
aj j = 5
aj j = 7
Define the contribution ∆ ikj of the variable a j to the
similarity between Χi and Χ k as follows:


If a j is binary or nominal, ∆ ikj = 0 if xij = xkj and ∆ ikj = 1 ,
otherwise.
If the attribute a j is numerical, we set:
∆ ikj =
xij − xkj
maxι xιj
where maxι xιj is the maximum value of variable a j .
27
CG DADL (June 2023) Lecture 9 – Clustering
Affinity Measures for Mixed
Composition Variables (cont.)


If variable a j is ordinal, its standardized value is computed
as before and we set ∆ ikj as for numerical variables.
The similarity coefficient between the observations Χi and
Χ k can be computed as:
dist (Χ i , Χ j
δ ∆
∑
)=
∑ δ
n
j =1 ikj
n
j =1
28
ikj
ikj
CG DADL (June 2023) Lecture 9 – Clustering
Partition Methods




Given a dataset D , each represented by a vector in ndimensional space, construct a collection of subsets
C = {C1 , C2 ,..., C K } where K ≤ m .
K is the number of clusters and is generally
predetermined.
Clusters generated are usually exhaustive and mutually
exclusive – Each observation belongs to only one cluster.
Partition methods are iterative:


29
Assign m observations to the K clusters.
Then iteratively reallocate to improve overall quality of
clusters.
CG DADL (June 2023) Lecture 9 – Clustering
Partition Methods (cont.)

Criteria for quality:



Degree of homogeneity of observations in the same clusters.
Degree of heterogeneity with respect to observations in other
clusters.
The methods terminate when during the same iteration
no reallocation occurs, i.e., clusters are stable.
30
CG DADL (June 2023) Lecture 9 – Clustering
K-means Algorithm
1.
2.
3.
4.
31
Initialize: choose K observations arbitrarily as the
centroids of the clusters.
Assign each observation to a cluster with the nearest
centroid.
If no observation is assigned to different cluster with
respect to previous iteration, stop.
For each cluster, the new centroid is computed as the
mean of the values belonging to that cluster. Go to Step
2.
CG DADL (June 2023) Lecture 9 – Clustering
K-means Algorithm (cont.)
Source: Vercellis (2009), pp. 304
32
CG DADL (June 2023) Lecture 9 – Clustering
K-means Algorithm (cont.)

Given a cluster Ch , h = 1,2,..., K , the centroid of the cluster
is the point zh having coordinates equal to the mean value
of each variable in the observations belonging to that
cluster:
z hj
∑
=
Χ i ∈C h
xij
card{Ch }
where card{Ch } is the number of observations in cluster Ch .
33
CG DADL (June 2023) Lecture 9 – Clustering
K-means Algorithm (cont.)

Example – Suppose we have 2-dimensional data with the
variables {Weight, Height} :



In Cluster 1, the observations are: {65,168}, {69,172} .
In Cluster 2, the observations are: {50,165}, {58,158}, {54,157} .
The centroids are:

Cluster 1:
 65 + 69 168 + 172 
z1 = {z11 , z12 } = 
,
 = {67,170}
2
 2


Cluster 2:
 50 + 58 + 54 165 + 158 + 157 
z 2 = {z 21 , z 22 } = 
,
 = {54,160}
3
3


34
CG DADL (June 2023) Lecture 9 – Clustering
K-medoids Algorithm




K-medoids algorithms, also known as partitioning
around medoids, is a variant of the K-means method.
Given a cluster Ch , a medoid U h is the most central
observation among those assigned to this cluster.
Instead of the means, the K-medoids algorithm assigns
observations Χ i , i = 1,2,..., m according to the distance to
the medoids,
Essentially, we find cluster Ch such that dist (Χ i , U h ) is
minimized, h = 1,2,..., K .
35
CG DADL (June 2023) Lecture 9 – Clustering
K-medoids Algorithm (cont.)

K-medoids requires a large number of iterations to assess
the advantage afforded by an exchange of medoids:

Check all (m − K )K pairs of




(Χ i , U h ) composed of:
An observation Χ i ∉ U that is not a medoid.
And a medoid U h ∈ U .
If it is beneficial to exchange between Χ i and U h , the
observation Χ i becomes a medoid in place of U h .
Thus, K-medoids is not suited for large datasets.
36
CG DADL (June 2023) Lecture 9 – Clustering
Hierarchical Methods





Hierarchical methods are based on a tree structure.
They do not require the number of clusters to be predetermined.
Distance between two clusters is computed before
merging of clusters.
Let Ch and C f be two clusters and
Z h and Z f their corresponding centroids.
To evaluate the distance between two
clusters, there are five alternative
measures that can be used.
37
CG DADL (June 2023) Lecture 9 – Clustering
Hierarchical Methods (cont.)

Minimum distance (single linkage):

The dissimilarity between two clusters is given by the minimum
distance among all pair of observations such that one belongs
to the first cluster and the other to the second cluster:
dist (Ch , C f ) = min dist (Χ i , Χ k )
Χ i ∈C h
Χ k ∈C f

Maximum distance (complete linkage):

The dissimilarity between two clusters is given by the
maximum distance among all pair of observations such that
one belongs to the first cluster and the other to the second
clusters:
dist (C , C ) = max dist (Χ , Χ )
h
38
f
Χ i ∈C h
Χ k ∈C f
i
k
CG DADL (June 2023) Lecture 9 – Clustering
Hierarchical Methods (cont.)

Mean distance:

The dissimilarity between two clusters is computed as the
mean of the distances between all pairs of observations
belonging to the two clusters:
dist (Ch , C f ) =

∑
Χ i ∈C h
∑
Χ k ∈C f
dist (Χ i , Χ k )
card{Ch } card{C f }
Distance between centroids:

The dissimilarity between two clusters is determined by the
distance between the two centroids of the clusters:
dist (Ch , C f ) = dist (Z h , Z f )
39
CG DADL (June 2023) Lecture 9 – Clustering
Hierarchical Methods (cont.)

Ward distance:


Distance is computed based on the analysis of the variance of
the Euclidean distances between the observations.
Two types of hierarchical clustering methods:


40
Agglomerative methods.
Divisive methods.
CG DADL (June 2023) Lecture 9 – Clustering
Agglomerative Hierarchical Methods

Bottom-up techniques:
1.
2.
3.

Initialization phase – Start with each observation representing
a cluster.
The minimum distance between the clusters is computed and
the two clusters with the minimum distance are merged into
a new cluster.
If all the observations have been merged into one cluster,
stop. Otherwise, go to step 2.
Dendrogram:



41
A graphical representation of the merging process.
On one axis is the value of the minimum distance
corresponding to each merger.
On the other axis are the observations.
CG DADL (June 2023) Lecture 9 – Clustering
Agglomerative Hierarchical Methods
(cont.)

A dendrogram provides a whole hierarchy of clusters
corresponding to different threshold values for the
minimum distance between clusters.


Lower threshold gives more clusters.
Example:
42
CG DADL (June 2023) Lecture 9 – Clustering
Divisive Hierarchical Methods

Top down approach:
1.
2.
3.

Initialization – Start with all observations placed in a single
cluster.
Split clusters into smaller clusters so that the distances
between the generated new clusters are minimized.
Repeat Step 2 until every cluster contains only one
observation, or until a similar stopping condition is met.
In order to reduce the computation time, not all possible
splits are considered:


43
At any given iteration, determine for each cluster the two
observations that are furthest from each other.
Subdivide the cluster by assigning the remaining records to one
or the other based on their similarity.
CG DADL (June 2023) Lecture 9 – Clustering
Pros and Cons of Hierarchical
Clustering Methods

Hierarchical clustering is appealing because:



It does not require specification of number of clusters.
It’s ability to represent the clustering process and results
through dendrograms makes it easier to understand and
interpret.
Hierarchical clustering has certain limitations:



44
Has greater computational complexity.
Has lower robustness because reordering data or dropping a
few observations can lead to very different solution.
Is sensitive to outliers.
CG DADL (June 2023) Lecture 9 – Clustering
Evaluation of Clustering Models



Unlike supervised learning, evaluation of unsupervised
learning is more subjective.
It is still useful to compare different algorithms/metrics
and number of clusters.
Various performance indicators can be used once the set
of K clusters C = {C1 , C2 ,..., CK } has been obtained:



45
Cohesion.
Separation.
Silhouette coefficient.
CG DADL (June 2023) Lecture 9 – Clustering
Evaluation of Clustering Models (cont.)

Cohesion:

An indicator of homogeneity of observations within each
cluster Ch is defined as:
coh (Ch ) =

∑ dist (Χ , Χ )
Χ i ∈C h
Χ k ∈C h
∑ coh(C )
C h ∈C
46
k
The overall cohesion of the partition C is defined as:
coh (C ) =

i
h
One clustering is preferable over another, in term of
homogeneity within each cluster, if it has a smaller overall
cohesion.
CG DADL (June 2023) Lecture 9 – Clustering
Evaluation of Clustering Models (cont.)

Separation:

An indicator of inhomogeneity between a pair of clusters is
defined as:
sep(Ch , C f ) =

∑ dist (Χ , Χ )
∑ sep(C , C )
C h ∈C
C f ∈C
47
k
The overall separation of the partition C is defined as:
sep(C ) =

i
Χ i ∈C h
Χ k ∈C f
h
f
One clustering is preferable over another, in term of
inhomogeneity among all clusters, if it has a greater overall
separation.
CG DADL (June 2023) Lecture 9 – Clustering
Evaluation of Clustering Models (cont.)

Silhouette coefficient:


A combination of cohesion and separation.
Calculation of silhouette coefficient for a single observation
1.
2.
3.
Χi:
The mean distance ui of Χ i from all the remaining observations in
the same cluster is computed.
For each cluster C f other than the cluster containing Χ i , the mean
distance wif between Χ i and all the observations in C f is
calculated.
The minimum vi among the distances wif is determined by varying
the cluster C f .
The silhouette coefficient of Χ i is defined as:
vi − ui
silh (Χ i ) =
max(ui , vi )
48
CG DADL (June 2023) Lecture 9 – Clustering
Evaluation of Clustering Models (cont.)




The silhouette coefficient varies between -1 and 1.
A negative value indicates that the mean distance ui of the
observation Χ i from the points of its cluster is greater than
the minimum value vi of the mean distances from the
observations of the other clusters,
A negative value is therefore undesirable since the membership
of Χ i in its cluster is not well characterized.
Ideally, silhouette coefficient should be positive and ui should
be as close as possible to 0:


49
That is, the best value is 1.
Overall silhouette coefficient of a clustering may be
computed as the mean of the silhouette coefficients for all
observations in the dataset D .
CG DADL (June 2023) Lecture 9 – Clustering
Evaluation of Clustering Models (cont.)

Silhouette diagram:




50
Used to illustrate silhouette coefficients.
The observations are placed on vertical axis, subdivided by
clusters.
The value of silhouette coefficient for each observation are
shown on the horizontal axis.
The mean value of silhouette coefficient for each cluster and
the overall mean may also be shown in the diagram.
CG DADL (June 2023) Lecture 9 – Clustering
Evaluation of Clustering Models (cont.)

Example of silhouette diagrams:


Generated from iris dataset using K-means clustering.
Source – MathWorks
K=2
51
K=3
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Example 1 – K-means


Iris classification problem:

3 classes – Setosa,Versicolor and Virginica.

4 variables – Sepal length, sepal width, petal length and petal
width.
We use K-means clustering with K=3:


52
Silhouette Score = 0.5526 (positive and close to 1.0 is better)
Homogeneity Score = 0.7515 (close to 1.0 is better)
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Example 1 – K-means (cont.)
src01
53
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Example 1 – K-means (cont.)

We can generate the
silhouette diagrams
for K=2 and K=3 for
comparison:

54
See the sample script
src02.
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Example 1 – K-means (cont.)

To identify the distinguishing characteristics of
observations in each cluster:



55
We can compute the within-cluster means and standard
deviations of the independent variables.
Plot scatter plots of the observations using the required
independent variables.
See the sample script src03.
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Example 2 – Agglomerative
Hierarchical


We perform agglomerative hierarchical clustering on the
Iris dataset.
The number of clusters to find is set to 3:





Silhouette Score = 0.5541 (positive and close to 1.0 is better)
Homogeneity Score = 0.7608 (close to 1.0 is better)
The statistics are slightly better than K-means
See the sample script src04.
A dendrogram is generated to visualize the merging
process – See the sample script src05.
56
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Example 2 – Agglomerative
Hierarchical (cont.)
57
CG DADL (June 2023) Lecture 9 – Clustering
Clustering Example 2 – Agglomerative
Hierarchical (cont.)
• Dendrogram diagram showing the merging process.
• For 3 clusters:
• The first cluster would have 50 observations.
• The second cluster would have 50 observations.
• The third cluster would have 50 observations.
Sample generated with SAS
58
CG DADL (June 2023) Lecture 9 – Clustering
Lecture
10
Association
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:






1
Motivation and structure of association rules.
Single-dimension association rules.
Apriori algorithm.
General association rules.
Using association in Python with Mlxtend.
CG DADL (June 2023) Lecture 10 – Association
Overview of Association Rules
Association rules is a class of unsupervised learning
models.
Aim of association rules is to identify regular patterns and
recurrences within a large set of transactions.
Fairly simple and intuitive.
Frequently used to investigate:






2
Sales transactions in market basket analysis.
Navigation paths within websites.
CG DADL (June 2023) Lecture 10 – Association
Motivations of Association Rules
Many real-world applications involve systematic collection
of data that gives rise to massive lists of transactions.
These transactions may be analyzed using association
rules to identify possible recurrences in the data.
Market basket analysis:






3
In typical retail purchase transactions, the list of purchased
items is stored along with the price, time and date of purchase.
Analyze the transactions to identify recurrent rules that relate
the purchase of a product, or group of products, to the
purchase of another product, or group of products.
E.g., A customer buying breakfast cereals will also buy
milk with a probability of 0.68.
CG DADL (June 2023) Lecture 10 – Association
Motivations of Association Rules (cont.)
Web mining:





4
List of web pages visited during a session is recorded as a
transaction together with a sequence number and time of visit.
Understand the pattern of navigation paths and the frequency
with which combinations of web pages are visited by a given
individual during a single session or consecutive sessions.
Identify the association of one or more web pages being
viewed with visits to other pages.
E.g., An individual visiting the website timesonline.co.uk
will also visit the website economist.com within
a week with a probability of 0.87.
CG DADL (June 2023) Lecture 10 – Association
Motivations of Association Rules (cont.)
Credit card purchase:



Each transaction consists of the purchases and payments made
by card holders.
Association rules are used to analyze the purchases in order to
direct future promotions.
Fraud detection:



5
Transactions consist of incident reports and applications for
damage compensation.
Existence of special combinations may reveal potentially
fraudulent behaviors and triggers investigation.
CG DADL (June 2023) Lecture 10 – Association
Structure of Association Rules
events that happens naturally
Given two propositions Y and Z , which may be true or
false, we can state in general terms that a rule is an
implication of the type Y ⇒ Z with the following
meaning:
 If Y is true then Z is also true.
 A rule is called probabilistic if the validity of Z is associated
with a probability p .
 That is, if Y is true then Z is also true with probability p .
 The notation ⇒ read as “material implication”:



6
A ⇒ B means if A is true then B is also true;
if A is false then nothing is said about B .
CG DADL (June 2023) Lecture 10 – Association
Structure of Association Rules (cont.)
Rules represent a classical paradigm for knowledge
representation and is popular for their simple and
intuitive structure.
Rules should be non-trivial and interpretable so that they
can be translated into concrete action plans:




A marketing analysis may generate a rule that only reflect the
effects of historical promotional campaigns, or that merely
state the obvious.
Rules may sometime reverse the causal relationship of an
implication:


7
E.g., Buyers of an insurance policy will also buy a car with a probability
of 0.98.
This rule confuses the cause with the effect and is useless.
CG DADL (June 2023) Lecture 10 – Association
Representation of Association Rules
 Let O = {o1 , o2 ,..., on } be a set of n objects.
A generic subset L ⊆ O is called an itemset.
An itemset that contains k objects is called a k-itemset.
A transaction represents a generic itemset recorded in a
database in conjunction with an activity or cycle of
activities.
The dataset D is composed of a list of m transactions Ti ,
each associated with a unique identifier denoted by ti.





8
Market basket analysis – The objects represent items from the
retailer and each transaction corresponds to items listed in a
sales receipt.
CG DADL (June 2023) Lecture 10 – Association
Representation of Association Rules
(cont.)


Web mining – The objects represent the web pages in a
website and each transaction corresponds to the list of web
pages visited by a user during one session.
Example on market basket analysis:
• This example is for market basket
analysis.
• In this example, t1 = 001 and
T1 = {a, c} = {bread, cereals} .
• Similarly, t3 = 003 and the corresponding
T3 = {b, d } = {milk, coffee} .
Source: Vercellis (2009), pp. 279
9
CG DADL (June 2023) Lecture 10 – Association
Representation of Association Rules
(cont.)

A dataset of transactions can be represented by a twodimensional matrix X :
 The n objects of the set O correspond to the columns of the


matrix.
The m transactions Ti are the rows.
The generic element of X is defined as:
1 if object o j belongs to transaction Ti ,
xij = 
0 otherwise.
10
CG DADL (June 2023) Lecture 10 – Association
Representation of Association Rules
(cont.)

Same example on market basket analysis:
Source: Vercellis (2009), pp. 280
• Recall that T1 = {bread, cereals} = {a, c}
• And T3 = {milk, coffee} = {b, d }
11
CG DADL (June 2023) Lecture 10 – Association
Representation of Association Rules
(cont.)

The representation could be generalized:
 Assuming that each object o j appearing in a transaction Ti is
associated with a number f ij .
 f ij represents the frequency in which o j appears in Ti .


Possible to fully describe multiple sales of a given item in a
single transaction.
Let L ⊆ O be a given set of objects, then transaction T is
said to contain the set L if L ⊆ T .
 In the market basket analysis example, the 2-itemset L = {a, c}
is contained in the transaction with identifier ti = 005 .
 But it is not contained in ti = 006 .
12
CG DADL (June 2023) Lecture 10 – Association
Representation of Association Rules
(cont.)


The empirical frequency f (L ) of an itemset L is
defined as the number of transactions Ti existing in the
dataset D that contain the set L :
f (L ) = card{Ti : L ⊆ Ti , i = 1,2,..., m}
For a large sample (i.e., as m increases), the ratio f (L ) / m
approximate the probability Pr (L ) of occurrence of
itemset L :
 That is, the probability that L is contained in a new transaction
T recorded in the database.

In the market basket analysis example:


13
The set of objects L = {a, c} has a frequency f (L ) = 4 .
Probability of occurrence is estimated as Pr (L ) = 4 / 10 = 0.4 .
CG DADL (June 2023) Lecture 10 – Association
Single-dimension Association Rules

Given two items L ⊂ O and H ⊂ O such that L ∩ H = φ
and a transaction T , the association rule is a
probabilistic implication denoted by L ⇒ H with the
following meaning:
 If L is contained in T , then H is also contained in T with a
given probability p .
 p is termed the confidence of the rule in D and defined as:
f (L ∪ H )
p = conf {L ⇒ H } =
f (L )


14
The set L is called the antecedent or body of the rule.
H is the consequent or head.
CG DADL (June 2023) Lecture 10 – Association
Single-dimension Association Rules
(cont.)


The confidence of the rule indicates the proportion of
transactions containing the set H among those that include L .
This refers to the inferential reliability of the rule.

As the number of m transactions increases, the
confidence approximates the conditional probability that
H belongs to a transaction T given that L does belong
to T :
Pr{{H ⊆ T }∩ {L ⊆ T }}
Pr{H ⊆ T | L ⊆ T } =
Pr{L ⊆ T }

Higher confidence thus corresponds to greater
probability that itemset H exists in a transaction that also
contains the itemset L .
15
CG DADL (June 2023) Lecture 10 – Association
Single-dimension Association Rules
(cont.)

The rule L ⇒ H is said to have a support s in D if the
proportion of transactions containing both L and H is
equal to s :
f (L ∪ H )
s = supp{L ⇒ H } =
m



16
The support of the rule expresses the proportion of
transactions containing both the body and head of the rule.
Measures the frequency with which an antecedent-consequent
pair appears together in the transactions of a dataset.
A low support suggests that a rule may have occurred
occasionally, of little interest to decision maker and is typically
discarded.
CG DADL (June 2023) Lecture 10 – Association
Single-dimension Association Rules
(cont.)


As m increases, the support approximates the probability
that both L and H are contained in some future
transactions.
In the market basket analysis example:
 Given the itemsets L = {a, c} and H = {b} for the rule L ⇒ H.

We have:
f (L ∪ H ) 2 1
p = conf {L ⇒ H } =
= = = 0.5
4 2
f (L )
f (L ∪ H ) 2
s = supp{L ⇒ H } =
=
= 0.2
m
10
17
CG DADL (June 2023) Lecture 10 – Association
Strong Association Rules



Once a dataset D of m transactions has been assigned:
 Determine minimum threshold value smin for the support.
 Determine minimum threshold value pmin for the confidence.
All strong association rules should be determined,
characterized by:
 A support s ≥ smin ; and
 A confidence p ≥ pmin .
For large dataset:


18
Extracting all association rules through a complete
enumeration procedure requires excessive computation time.
The number N T of possible association rules increases
exponentially as n increases: N T = 3n − 2 n +1 + 1 .
CG DADL (June 2023) Lecture 10 – Association
Strong Association Rules (cont.)


Moreover, most rules are not strong and a complete
enumeration will likely end up discarding many unfeasible rules.
Need to devise a method capable of deriving strong rules
only, implicitly filtering out rules that do not meet the
minimum threshold requirements:



19
Generation of strong association rules may be divided into two
successive phases.
First phase – Generation of frequent itemsets.
Second phase – Generation of strong rules.
CG DADL (June 2023) Lecture 10 – Association
Generation of Frequent Itemsets


Recall that support of a rule only depends on the union
of the itemsets L and H and not the actual distribution
of objects between the body and head.
For example, all six rules that can be generated using the
objects {a, b, c} have support s = 2 10 = 0.2 :
{a, b} ⇒ {c}
{b, c} ⇒ {a}
{b} ⇒ {a, c}

{a, c} ⇒ {b}
{a} ⇒ {b, c}
{c} ⇒ {a, b}
If the threshold value for support is smin = 0.25 , all six
rules can be eliminated based on the analysis of the union
set {a, b, c} .
20
CG DADL (June 2023) Lecture 10 – Association
Generation of Frequent Itemsets (cont.)


The aim of generating frequent itemsets is to extract
all sets of objects whose relative frequency is greater than
the assigned minimum support smin .
This phase is more computationally intensive than the
subsequent phase of rule generation:


21
Several algorithms have been proposed for obtaining the
frequent itemsets efficiently.
The most popular one is known as the Apriori algorithm.
CG DADL (June 2023) Lecture 10 – Association
Generation of Strong Rules



Need to separate the objects contained in each frequent
itemsets according to all possible combinations of head
and body.
Verify if the confidence of each rule exceeds the minimum
threshold pmin .
In the previous example, if the itemset {a, b, c} is frequent,
we can obtain six rules although only those with
confidence higher than pmin can be considered strong.
conf ({a, b} ⇒ {c}) = 2 4 = 0.50
conf ({b, c} ⇒ {a}) = 2 3 = 0.67
conf ({b} ⇒ {a, c}) = 2 7 = 0.29
conf ({a, c} ⇒ {b}) = 2 4 = 0.50
conf ({a} ⇒ {b, c}) = 2 7 = 0.29
conf ({c} ⇒ {a, b}) = 2 5 = 0.40
If we choose pmin = 0.50 , we will have three strong rules but zero if pmin = 0.70 .
22
CG DADL (June 2023) Lecture 10 – Association
Lift of a Rule


Strong rules are not always meaningful or of interest to
decision makers.
Example – Sales of Consumer Electronics:




23
Analyze a set of transactions to identify associations between
sales of color printers and sales of digital camera.
Assuming there are 1000 transactions, of which 600 include
cameras, 750 include printers and 400 include both.
If the threshold values smin = 0.3 and pmin = 0.6 are used.
The rule {camera} ⇒ {printer} would be a strong rule since it
has support s = 400 1000 = 0.4 and confidence
p = 400 600 = 0.66 that exceed the threshold values.
CG DADL (June 2023) Lecture 10 – Association
Lift of a Rule (cont.)




The rule seems to suggest that the purchase of a camera also
induces the purchase of a printer.
However, the probability of purchasing a printer is equal to
750 1000 = 0.75 and is greater than 0.66, the probability of
purchasing a printer conditioned on the purchase of a camera.
In fact, sales of printers and cameras show a negative
correlation, since the purchase of a camera reduces the
probability of buying a printer.
This example highlights a shortcoming in the evaluation of
the effectiveness of a rule based only on its support and
confidence.
24
CG DADL (June 2023) Lecture 10 – Association
Lift of a Rule (cont.)

A third measure of the significance of an association rule
is the lift, defined as:
conf {L ⇒ H }× m f (L ∪ H )× m
l = lift{L ⇒ H } =
=
f (H )
f (L ) f (H )

Lift values greater than 1:



Lift is less than 1:


25
Indicate that the rule being considered is more effective than the
relative frequency of the head in predicting the probability that the
head is contained in some transaction of the dataset.
Body and head of the rule are positively associated.
The rule is less effective than the estimate obtained through the
relative frequency of the head.
Body and head of the rule are negatively associated.
CG DADL (June 2023) Lecture 10 – Association
Lift of a Rule (cont.)

In the preceding example:
l = lift{{camera} ⇒ {printer}}
conf {{camera} ⇒ {printer}}× m 0.66 ×1000
=
=
= 0.88
f ({printer})
750

If the lift is less than 1, the rule that negates the head,
expressed as {L ⇒ (O − H )} , is more effective than the
original rule:
 The negated rule has confidence 1 − conf {L ⇒ H } .

26
Therefore the lift of the negated rule is greater than 1.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm




A dataset D of m transactions defined over a set O of n
objects may contain up to 2 n − 1 frequent itemsets.
In real-world applications, n is likely to be at least of the
order of a few dozen objects.
Complete enumeration of the itemsets is not practical.
The Apriori algorithm is a more efficient method of
extracting strong rules:


27
In the first phase, the algorithm generates the frequent
itemsets in a systematic way, without exploring the space of all
candidates.
In the second phase, it extracts the strong rule.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm (cont.)


Apriori principle – If an itemset is frequent, then all its
subsets are also frequent.
Example:
 Assuming that the itemset {a, b, c} is frequent.
 It is clear that each transaction containing {a, b, c} should also
contain each of its six proper subsets:



28
2-itemsets {a, b} , {a, c} and {b, c} .
1-itemsets {a} , {b} and {c} .
Therefore, these six itemsets are also frequent.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm (cont.)

There is also a consequence of the Apriori principle
that is fundamental in reducing the search space:
 If we suppose that the itemset {a, b, c} is not frequent, then


29
each of the itemsets containing it must in turn not be frequent.
Thus, once a non-frequent itemset has been identified in the
course of the algorithm, it is possible to eliminate all itemsets
with a greater cardinality that contain it.
This significantly increase overall efficiency.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of
Frequent Itemsets
1.
2.
30
The transactions in the dataset are scanned to compute
the relative frequency of each object. Objects with a
frequency lower than the support threshold smin are
discarded.
At the end of this step, the collection of all frequent 1itemsets has been generated.
The iteration counter is set to k = 2 .
The candidate k-itemsets are iteratively generated
starting from the (k – 1)-itemsets, determined during
the previous iteration.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of
Frequent Itemsets (cont.)
3.
4.
5.
31
The support of each candidate k-itemset is computed
by scanning through all transactions included in the
dataset.
Candidates with a support lower than the threshold smin
are discarded.
The algorithm stops if no k-itemset has been generated.
Otherwise, set k = k + 1 and return to step 2.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of
Frequent Itemsets (cont.)

Example – Market basket analysis:
 Minimum threshold for support smin = 0.2 .

Step 1 – Determine the relative frequency for each object in
O = {a, b, c, d , e} :
Source: Vercellis (2009), pp. 286

32
All 1-itemsets are frequent since their frequency is higher than
the threshold smin
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of
Frequent Itemsets (cont.)


The second iteration proceeds by generating all candidate 2itemsets, which are obtained from the 1-itemsets.
Then determine the relative frequency for each 2-itemset and
compare with the threshold value smin :
Source: Vercellis (2009), pp. 286
33
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of
Frequent Itemsets (cont.)


The next iteration generates the candidate 3-itemsets that can
be obtained from the frequent 2-itemsets.
Then determine the relative frequency for each 3-itemset and
compare with the threshold value smin :
Source: Vercellis (2009), pp. 286


34
Since there are no candidate 4-itemsets, the procedure stops.
Total frequent itemsets are 5 + 6 + 2 = 13 .
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of Strong
Rules
1.
2.
35
The list of frequent itemsets generated during the first
phase is scanned. If the list is empty, the procedure stops.
Otherwise, let B be the next itemset to be considered,
which is then removed from the list.
The set B of objects is subdivided into two non-empty
disjoint subsets L and H = B − L , according to all
possible combinations.
There are altogether 2 k − 2 candidate association rules,
excluding the rules having the body or head as the
empty set.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of Strong
Rules (cont.)
3.
4.
36
For each candidate rule L ⇒ H , the confidence is
computed as:
f (B )
p = conf {L ⇒ H } =
f (L )
Note that f (B ) and f (L ) are already known at the
end of the first phase. Do not need to rescan the
dataset to compute them.
If p ≥ pmin the rule is included into the list of strong
rules, otherwise, it is discarded.
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of Strong
Rules (cont.)

Example – Market basket analysis:
Source: Vercellis (2009), pp. 288
37
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of Strong
Rules (cont.)

The number of operations required by the Apriori
algorithm grows exponentially as the number of objects
increases, i.e., n , increases.

To generate a frequent 100-itemset, need to examine at least:
100  100
30


=
−
≈
2
1
10
∑
 h 
h =1 

100
38
CG DADL (June 2023) Lecture 10 – Association
Apriori Algorithm – Generation of Strong
Rules (cont.)

Several tactics have been suggested to increase the
efficiency:


Adopt more advanced data structures for representing the
transactions, such as dictionaries and binary trees.
Partition the transactions into disjoint subsets:



39
Separately apply the Apriori algorithm to each partition to identify
local frequent items.
At a later stage, starting from the latter, the entire transactions is
considered to obtain global frequent itemsets and the corresponding
strong rules.
Randomly extract a significant sample of transactions and
obtain strong association rules for the extracted sample with a
greater efficiency but at the expense of possible inaccuracies.
CG DADL (June 2023) Lecture 10 – Association
Binary, Asymmetric and Onedimensional Association Rules


The association rules that we have discussed in the
preceding section are binary, asymmetric and onedimensional.
Binary:

The rules refer to the presence or absence of each object in
the transactions that made up the dataset.
• 1 indicates an object’s presence in a
transaction.
• 0 indicates an object’s absence from a
transaction.
• Assuming that we have large number of
objects and transactions, we should observe
more 0s than 1s since each transaction is likely
to involve a small subset of objects.
40
CG DADL (June 2023) Lecture 10 – Association
Binary, Asymmetric and Onedimensional Association Rules (cont.)

Asymmetric:



One-dimensional:



Implicit assumption that the presence of an object in a single
transaction, corresponding to a purchase, is far more relevant
than its absence, corresponding to a non-purchase.
That is, we are mostly interested in the presence of the value 1,
which can be found in a small number of observations.
The rules involve only one logical dimension of data from
multi-dimensional data warehouses and data marts.
In this example, we only look at the product dimension.
Possible to identify more general association rules
that may be useful across a range of applications.
41
CG DADL (June 2023) Lecture 10 – Association
General Association Rules

Association rules for symmetric binary variables:




Handle binary variables for which the 0 and 1 values are
equally relevant.
E.g., Online registration form seeking the activation of
newsletter service or wishing to receive a loyalty card.
Transform each symmetric binary variable to two asymmetric
binary variables, which correspond to the presence of a given
value and absence of a given value.
E.g., Activation of newsletter service:

42
Two asymmetric binary variables consent and do not consent with the
values 0,1 .
{ }
CG DADL (June 2023) Lecture 10 – Association
General Association Rules (cont.)

Association rules for categorical variables:



Transform each categorical variable by introducing a set of
asymmetric binary variables, equal in number to the levels of
the categorical variable.
Each binary variable takes the value 1 if in the corresponding
record the categorical variable assumes the level associated
with it.
Association rules for continuous variables:




43
E.g., Age or income of customers.
Transform the continuous variables in two sequential steps.
Transform each continuous variable into a categorical variable
using a suitable discretization technique.
Then transform each discretized variable into a set of
asymmetric binary variables.
CG DADL (June 2023) Lecture 10 – Association
General Association Rules (cont.)

Multidimensional association rules:



Data are stored in data warehouses or data marts, organized
according to several logical dimensions.
Association rules may thus involve multiple dimensions.
E.g., Market basket analysis:




44
If a customer buys a digital camera, is 30-40 years old, and has an
annual average expenditure of $300-500 on electronic equipment,
then she will also buy a color printer with probability 0.78.
This is a three-dimensional rule, since the body consists of three
dimensions – purchased item, age and expenditure amount.
Head refers to a purchase.
Notice that the continuous variables age and expenditure
amount have already been discretized.
CG DADL (June 2023) Lecture 10 – Association
General Association Rules (cont.)

Multi-level association rules:





45
In some applications, association rules do not allow strong
associations to be extracted due to rarefaction.
E.g., Each of the items for sale can be found in too small a
proportion of transaction to be included in the frequent
itemsets thus preventing search for association rules.
However, objects making up the transactions usually belong to
hierarchies of concepts.
E.g., Items are usually grouped by type and in turn by sales
department.
To remedy rarefaction, transfer the analysis to a higher level in
the hierarchy of concepts.
CG DADL (June 2023) Lecture 10 – Association
General Association Rules (cont.)

Sequential association rules:


Transactions are often recorded according to a specific
temporal sequence.
Examples:




Transactions for a loyalty card holder correspond to the sequence of
sale receipts.
Transactions that gather the sequence of navigation paths followed by
a user are associated with the temporal sequence of the sessions.
Analysts are interested to extract association rules that take
into account temporal dependencies.
The algorithms used to extract general association rules
usually consist of extensions of the Apriori algorithm and
its variants.
46
CG DADL (June 2023) Lecture 10 – Association
Example – Market Basket Analysis

RapidMiner Studio:





47
The dataset consisting of five binary variables, one per object, is
prepared in Excel.
This dataset is imported into RPM.
The “Numerical to Binary” operator is used to convert the
integer variables to actual binary variables.
The transformed dataset is then feed into the “FP-Growth”
operator to generate the frequent itemsets with the threshold
value 𝑠𝑠𝑚𝑚𝑚𝑚𝑚𝑚 = 0.2.
The frequent itemsets is then feed into the “Create Association
Rules” operator with the threshold value 𝑝𝑝𝑚𝑚𝑚𝑚𝑚𝑚 = 0.55.
CG DADL (June 2023) Lecture 10 – Association
Example – Market Basket Analysis
(cont.)
48
CG DADL (June 2023) Lecture 10 – Association
Example – Market Basket Analysis
(cont.)
Frequent itemsets with smin = 0.2
Association rules with pmin = 0.55
49
CG DADL (June 2023) Lecture 10 – Association
Example – Market Basket Analysis
(cont.)


Scikit Learn does not support the Apriori algorithm.
So we will use Mlxtend (machine learning extensions):
src01
50
CG DADL (June 2023) Lecture 10 – Association
Example – Market Basket Analysis
(cont.)
Original transactions dataset
51
Frequent itemsets
CG DADL (June 2023) Lecture 10 – Association
Example – Market Basket Analysis
(cont.)
52
CG DADL (June 2023) Lecture 10 – Association
Lecture
11
Text Mining
Corporate Gurukul – Data Analytics using Deep Learning
June 2023
Lecturer: A/P TAN Wee Kek
Email: tanwk@comp.nus.edu.sg :: Tel: 6516 6731 :: Office: COM3-02-35
Learning Objectives
At the end of this lecture, you should understand:





1
Concepts and definitions of text mining.
Text mining process.
Extracting knowledge from text data.
Using text mining in Python with NLTK and Scikit Learn.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Concepts
85-90% of all corporate data is in some kind of
unstructured form (e.g., text).
Unstructured corporate data is doubling in size every 18
months.
Tapping into these information sources is not an option,
but a need to stay competitive.
Text mining is the answer:






2
A semi-automated process of extracting knowledge from large
amount of unstructured data sources.
Also known as text data mining or knowledge discovery in
textual databases.
CG DADL (June 2023) Lecture 11 – Text Mining
Data Mining versus Text Mining
Both seek novel and useful patterns.
Both are semi-automated processes.
Main difference lies in the nature of the data:






Structured versus unstructured data.
Structured data – databases.
Unstructured data – Word documents, PDF files, text excerpts,
XML files, etc.
Text mining generally involves:



3
Imposing structure to the data.
Then mining the structured data.
CG DADL (June 2023) Lecture 11 – Text Mining
Benefits of Text Mining
Benefits of text mining are obvious especially in text-rich
data environments:


E.g., law (court orders), academic research (research articles),
finance (quarterly reports), medicine (discharge summaries),
biology (molecular interactions), technology (patent files),
marketing (customer comments), etc.
Electronic communication records (e.g., email):




4
Spam filtering.
Email prioritization and categorization.
Automatic response generation.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Application Area
Information extraction:



Identification of key phrases and relationships within text.
Look for predefined sequences in text via pattern matching.
Topic tracking:



Based on a user profile and documents that a user views.
Text mining can predict other documents of interest to the
user.
Summarization:


5
Summarizing a document to save time on the part of the
reader.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Application Area (cont.)
Categorization:



Identifying main themes of a document.
Place the document into a predefined set of categories based
on those themes.
Clustering:


Group similar documents without having a predefined set of
categories.
Concept linking:



6
Connect related documents by identifying their shared
concepts.
Help users find information that they perhaps would not have
found using traditional search methods.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Application Area (cont.)
Question answering:


Find the best answer to a given question through knowledgedriven pattern matching.
Text mining initiatives by Elsevier, a leading publisher of research journals.
(Source: http://www.elsevier.com/connect/elsevier-updates-text-mining-policy-to-improve-access-for-researchers)
7
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies
Unstructured data (versus structured data):


Structured data:



Unstructured data:



8
Have a predetermined format.
Usually organized into records with simple data values (categorical,
ordinal and continuous variables) and stored in databases.
Do not have a predetermined format.
Stored in the form of textual documents.
Structured data are for computers to process while
unstructured data are for humans to process and understand.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies (cont.)
Corpus:



Corpus is a large and structured set of texts (usually stored
and processed electronically).
Corpus is prepared for the purpose of conducting knowledge
discovery.
Terms:



9
A term is a single word or multiword phrase extracted directly
from the corpus of a specific domain.
Uses natural language processing (NLP) methods.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies (cont.)

Concepts:




Concepts are features generated from a collection of
documents.
Uses manual, statistical, rule-based or hybrid categorization
methodology.
Compared to terms, concepts are the result of higher level
abstraction.
Stemming:


10
Process of reducing inflected words to their stem (or base or
root) form.
For example, stemmer, stemming, stemmed are all based on the
root stem.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies (cont.)

Stop words:




Stop words (or noise words) are words that are filtered out
prior to or after processing of natural language data, i.e., text.
No universally accepted list of stop words.
Most NLP tools use a list that includes articles (a, am, the, of,
etc.), auxiliary verbs (is, are, was, were, etc.) and context-specific
words that are deemed not to have differentiating value.
Synonyms and polysemes:

Synonyms are syntactically different words (i.e., spelled
differently) with identical or at least similar meanings.

11
E.g., movie, film and motion picture.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies (cont.)

Polysemes (or homonyms) are syntactically identical words
with different meanings.


E.g., Bow can mean “to bend forward”, “the front of the ship”, “the
weapon that shoots arrows,” or “a kind of tied ribbon”.
Tokenizing:





12
A token is a categorized block of text in a sentence.
The block of text corresponding to the token is categorized
according to the function it performs.
This assignment of meaning to blocks of text is known as
tokenizing.
A token can look like anything, it just need to be a useful part
of the structured text.
E.g., email address, URL or phone number.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies (cont.)

Term dictionary:


Word frequency:



A collection of terms specific to a narrow field that can be
used to restrict the extracted terms within a corpus.
The number of times a word is found in a specific document.
Term frequency refers to the raw frequency of a term in a
document.
Term-by-document matrix (occurrence matrix or
term-document matrix):

13
A common representation schema of the frequency-based
relationship between the terms and documents in tabular
format.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies (cont.)


14
Terms are listed in rows and documents are listed in columns.
The frequency between the terms and documents is listed in
cells as integer values.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Terminologies (cont.)

Singular-value decomposition (latent semantic
indexing):


15
A dimensionality reduction method used to transform the
term-by-document matrix to a manageable size.
Generate an intermediate representation of the frequencies
using a matrix manipulation method similar to principle
component analysis.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Process
Task 1
Establish the Corpus:
Collect & Organize the
Domain Specific
Unstructured Data
Task 2
Create the TermDocument Matrix:
Introduce Structure
to the Corpus
Feedback
The inputs to the process
includes a variety of relevant
unstructured (and semistructured) data sources such
as text, XML, HTML, etc.
The output of the Task 1 is a
collection of documents in
some digitized format for
computer processing
Task 3
Extract Knowledge:
Discover Novel
Patterns from the
T-D Matrix
Feedback
The output of the Task 2 is a
flat file called term-document
matrix where the cells are
populated with the term
frequencies
The output of Task 3 is a
number of problem specific
classification, association,
clustering models and
visualizations
The three-step text mining process
16
CG DADL (June 2023) Lecture 11 – Text Mining
Task 1 – Establish the Corpus

Collect all the documents related to the context (domain
of interest) being studied:




May include textual documents, XML files, emails, web pages
and short notes.
Voice recordings may also be transcribed using speechrecognition algorithms.
Text documents are transformed and organized into the
same representational form (e.g., ASCII text file).
Place the collection in a common place (e.g., in a flat file,
or in a directory as separate files).
17
CG DADL (June 2023) Lecture 11 – Text Mining
Task 2 – Create the Term-Document
Matrix

The digitized and organized documents (the corpus) are
used to create the term-document matrix (TDM):





Rows represent documents and columns represent terms.
The relationships between the terms and documents are
characterized by indices.
Indices is a relational measure that can be as simple as the
number of occurrences of the term in respective documents.
The goal is to convert the corpus into a TDM where the
cells are filled with the most appropriate indices.
The assumption is that the essence of a document can be
represented with a list of frequency of the terms used in
that document.
18
CG DADL (June 2023) Lecture 11 – Text Mining
Task 2 – Create the Term-Document
Matrix (cont.)
Terms
Documents
Document 1
in
me
t
s
ve
sk
ri
nt
p
e
ag
n
ma
ng
ri
ee
gin
n
ee
ar
w
oft
s
1
elo
v
de
nt
e
pm
P
SA
...
1
1
Document 2
1
3
Document 3
1
Document 4
2
Document 5
Document 6
t
ec
roj
t
n
me
1
1
1
...
19
CG DADL (June 2023) Lecture 11 – Text Mining
Task 2 – Create the Term-Document
Matrix (cont.)

On the one hand, not all the terms are important when
characterizing the documents:




Some terms such as articles, auxiliary verbs, and terms used in
almost all of the documents in the corpus, have no
differentiating power.
This list of terms, known as stop terms or stop words, should
be identified by domain experts.
Stop terms should be excluded from the indexing process.
On the other hand, the analyst might choose a set of
predetermined terms, i.e., term dictionary, under which
the documents are to be indexed.
20
CG DADL (June 2023) Lecture 11 – Text Mining
Task 2 – Create the Term-Document
Matrix (cont.)

To create more accurate index entries:



Synonyms and specific phrases (e.g., “Eiffel Tower”) can be
provided.
Stemming may be used such that different grammatical forms
or declinations of verb are identified and indexed as the same
word.
The first generation of the TDM:



21
Include all unique terms identified in the corpus as its columns,
excluding the stop terms.
All documents as its rows.
Occurrence count of each term for each document as its cell
values.
CG DADL (June 2023) Lecture 11 – Text Mining
Task 2 – Create the Term-Document
Matrix (cont.)

For a large corpus, the TDM will contain a very large
number of terms:



Increases processing time.
Leads to inaccurate patterns.
Need to decide:


22
What is the best representation of the indices?
How can we reduce the dimensionality of the TDM to a
manageable size?
CG DADL (June 2023) Lecture 11 – Text Mining
Representing the Indices

The raw term frequencies generally reflect on how salient
or important a term is in each document:




Terms that occur with greater frequency in a document are
better descriptors of the contents of that document.
However, term counts themselves might not be proportional
to their importance as descriptors of the documents.
E.g., a term that occurs one time in document A but three
times in document B does not necessarily mean that this term
is three times as important a descriptor of B compared to A.
These raw indices need to be normalized to obtain a
more consistent TDM for further analysis.
23
CG DADL (June 2023) Lecture 11 – Text Mining
Representing the Indices (cont.)


Other than actual frequency, the numerical representation
between terms and documents can be normalized with
several methods.
Log frequencies:


This transformation would “dampen” the raw frequencies and
how they affect the results of subsequent analysis.
Given wf is the raw word (or term) frequency and f (wf ) is
the result of the log transformation:
1 + log(wf ) if wf > 0
f (wf ) = 
0 if wf = 0

24
This transformation is applied to all raw frequencies in the
TDM that are greater than 0.
CG DADL (June 2023) Lecture 11 – Text Mining
Representing the Indices (cont.)

Binary frequencies:

This is a simple transformation to enumerate whether a term
is used in a document.
1 if wf > 0
f (wf ) = 
0 if wf = 0


25
The resulting TDM will contain 1s and 0s to indicate the
presence or absence of the respective term.
This transformation will also dampen the effect of the raw
frequency counts on subsequent computation and analyses.
CG DADL (June 2023) Lecture 11 – Text Mining
Representing the Indices (cont.)

Inverse document frequencies:


Sometime, it may be useful to consider the relative document
frequencies (df) of different terms.
Example:

A term such as guess may occur frequently in all documents.


Another term such as software may appear only a few times.


26
One might make guesses in various contexts regardless of the specific
topic.
Software is a more semantically focused term that is only likely to occur in
a documents that deal with computer software.
Inverse document frequency is a transformation that deals with
specificity of words (document frequencies) as well as the
overall frequencies of their occurrences (term frequencies).
CG DADL (June 2023) Lecture 11 – Text Mining
Representing the Indices (cont.)

This transformation for the ith term and jth document is
defined as:
0 if wf ij = 0

N
idf (i, j ) = 
(
1 + log(wf ij ))log
if wf ij ≥ 1

df i

where:
N is the total number of documents.
df i is the document frequency for the ith term, i.e., the number
of documents that include this term.
This transformation includes both the dampening of the
simple-term frequencies via the log function and a weighting
factor that:


27
Evaluates to 0 if the word occurs in all documents, i.e., log(N N ) = log 1 = 0.
Evaluates to the maximum value when a word only occurs in a single
document, i.e., log(N 1) = log N .
CG DADL (June 2023) Lecture 11 – Text Mining
Reducing the Dimensionality of the
Matrix


TDM is often very large and rather sparse (most of the
cells are filled with 0s).
There are three options to reduce the dimensionality of
this matrix to a manageable size:

Use a domain expert to go through the list of terms and
eliminates those that do not make much sense for the study’s
context.



28
This is a manual process that might be labor-intensive.
Eliminate terms with very few occurrences in very few
documents.
Transform the matrix using singular value decomposition
(SVD).
CG DADL (June 2023) Lecture 11 – Text Mining
Reducing the Dimensionality of the
Matrix (cont.)

Singular value decomposition (SVD):





29
Closely related to principal components analysis.
Reduces the overall dimensionality of the input matrix
(number of input documents by number of extracted terms) to
a lower dimensional space.
Each consecutive dimension represents the largest degree of
variability (between words and documents) possible.
Identify the two or three most salient dimensions that account
for most of the variability (differences) between the words and
documents.
Once the salient dimensions are identified, the underlying
“meaning” of what is contained in the documents is extracted.
CG DADL (June 2023) Lecture 11 – Text Mining
Task 3 – Extract the Knowledge

Using the well-structured TDM:



We can extract novel patterns within the context of the
specific problem.
The knowledge extraction process may be potentially
augmented with other structured data elements.
The main categories of knowledge extraction methods
are:



30
Classification.
Clustering.
Association.
CG DADL (June 2023) Lecture 11 – Text Mining
Classification




Classify a given data instance into a predetermined set of
categories (or classes).
In text mining, this is known as text categorization.
For a given set of categories (subjects, topics or concepts)
and a collection of text documents, the goal is to find the
correct category for each document.
Automated text classification may be applied to:





31
Indexing of text.
Spam filtering.
Web page categorization under hierarchical catalogs.
Automatic generation of metadata.
Detection of genre and many others.
CG DADL (June 2023) Lecture 11 – Text Mining
Classification (cont.)

Two main approaches:



32
Knowledge engineering – Expert’s knowledge about the
categories is encoded into the system either declaratively or in
the form of procedural classification rules.
Machine learning – A general inductive process builds a
classifier by learning from a set of pre-classified examples.
Increasing number of documents and decreasing knowledge
experts result in a trend towards machine learning.
CG DADL (June 2023) Lecture 11 – Text Mining
Clustering




Unsupervised process whereby objects are placed into
“natural” groups called clusters.
In classification, descriptive features of the classes in preclassified training objects are used to classify a new
object.
In clustering, unlabeled collection of objects are grouped
into meaningful clusters without any prior knowledge.
Clustering is useful in a wide range of applications:


33
Document retrieval.
Enabling better web content searches.
CG DADL (June 2023) Lecture 11 – Text Mining
Clustering (cont.)

Analysis and navigation of very large text collections such
as web pages is an prominent application of clustering:


34
Underlying assumption is that relevant documents tend to be
more similar to each other than to irrelevant ones.
Clustering of documents based on content similarity thus
improves search effectiveness.
CG DADL (June 2023) Lecture 11 – Text Mining
Association



Generating association rules involves identifying the
frequent sets that go together.
In text mining, association refers to the direct
relationships between concepts (terms) or sets of
concepts.
A concept association rule A ⇒ C [S , C ] relates two
frequent concept sets A and C with support S and
confidence C :


35
Support is the proportion of documents that include the
concepts in A and C .
Confidence is the proportion of documents that include all the
concepts in C within the subset of documents that include A.
CG DADL (June 2023) Lecture 11 – Text Mining
Association (cont.)

Example:



36
In a document collection, the concept “Software
Implementation Failure” may appear most often in association
with “Enterprise Resource Planning” and “Customer
Relationship Management” with support 0.04 and confidence
0.55.
This means that 4% of all documents had all three concepts
represented in the same document.
Of the documents that included “Software Implementation
Failure”, 55% of them also included “Enterprise Resource
Planning” and “Customer Relationship Management”.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Example – SMS Spam
Filtering

The SMS spam collection dataset is taken from the UCI
Machine Learning repository:





37
https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
The SMS messages are extracted from various sources such as
the NUS SMS Corpus (NSC).
The authors have classified each SMS as either spam or ham
(i.e., legitimate SMS message).
The pre-classification by domain experts allows us to perform
supervised learning.
On the contrary, most SMS messages corpuses such as the
NSC might not have been pre-classified.
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Example – SMS Spam
Filtering (cont.)


We can skip over Task 1 of establishing the corpus since it
is already provided.
Task 2 of creating the Term-Document Matrix (TDM) can
be done in Python using NLTK (Natural Language Toolkit)
and Scikit Learn:





38
Tokenize each SMS message.
Lowercase the SMS message and remove special characters
and leading/trailing spaces.
Remove stop words using NLTK list of English stop words.
Perform spelling correction and lemmatization using NLTK.
Generate a list of word list from all the SMS messages and
their respective raw term frequencies using Scikit Learn’s
CountVectorizer
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Example – SMS Spam
Filtering (cont.)

Actual TDM exported to a TSV file viewed in Excel:
39
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Example – SMS Spam
Filtering (cont.)

Task 3 of extracting knowledge is based on the TDM with
binary frequencies:



Classification – Performed with classification tree, Naïve Bayes
classifier and neural network.
Association rules – Binary, asymmetric and one-dimensional
association rules were extracted.
The classification models allow us to classify future SMS
messages as spam or ham based on the presence/absence
of key words/terms.
40
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Example – SMS Spam
Filtering (cont.)

Using a decision tree classifier with a maximum depth of
5, we can obtain fairly good prediction:





Training accuracy was 93%
Test accuracy was 93%
True positive rate (recall) for test
data was 93% (48% for spam class)
Precision was 93%
If we allow unlimited depth:



41
Test accuracy would be 96%
True positive rate would be 96%
(80% for spam class)
Precision would be 96%
CG DADL (June 2023) Lecture 11 – Text Mining
Text Mining Example – SMS Spam
Filtering (cont.)

The decision tree also provided some fairly intuitive rules
for classifying whether a SMS message is a spam:
• Full view of the decision tree with a maximum depth of 5 generated from 80%
training dataset with a prediction accuracy of 93% (true positive rate of
classifying spam is 48%).
• E.g., a new SMS message containing the words “call” and “claim” would be
classified as a spam. Another SMS message containing the words “call” but not
“claim” would be classified as a ham
• A decision tree with unlimited depth would be more accurate but the rules
would be less interpretable.
42
CG DADL (June 2023) Lecture 11 – Text Mining
Download