Uploaded by Joana Nascimento

Data Mining Introduction: Concepts and Processes

advertisement
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
1. Introduction to Data Science
1.1.
Data as a strategic resource:
Data is a crucial asset in today's industries:
 The A350 airplane, with 6,000 sensors, produces 2.5 Tb of data daily.
 However, most of the data created is not effectively used. This gap is expected to widen
as data production increases fivefold in the next five years.
1.2.
Definitions:
Artificial Intelligence: science of making things smart and can be defined as: “human
intelligence exhibited by machines”.
 A broad term for getting computers to perform human tasks.
 The scope of AI is disputed and constantly changing over time.
 The systems implemented today are a form of narrow AI – a system that can do just one
(or a few) defined things as well or better than humans:
• Like recognizing objects.
• Play chess.
• Jeopardy.
Machine Learning: creating algorithms (or a set of rules) that learn complex functions (or
patterns) from data and make predictions on it.
 An approach to achieve Artificial Intelligence through systems that can learn from
experience to find patterns on a set of data.
 Involves teaching a computer to recognize patterns by example, rather than programming
it with specific rules.
 These configurations can be found within data.
 Essentially about predicting stuff:
1. It takes some data, to train the system.
2. Learns patterns from this data.
3. Classifies new data it has not seen before (makes a guess on what probably is based
on the knowledge gained in the previous step).
 It learns by itself from the data:
Big Data: Anything that Won't Fit in Excel
 Refers to datasets of such substantial size that conventional data-processing systems are
inadequate, necessitating the use of new technologies.
 Volume: exponential growth
 Variety: increase in the amount of unstructured and semi-structured
 Velocity: increase in the speed of creation of the data and the need for real time analytics
 Veracity: Establishing the veracity of big data sources
 Value: It is good to have access to big data but more important is to turn it into
1
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Data Science: the study of where information comes from, what it represents and how it can be
turned into a valuable resource in the creation of business and IT strategies.
 Data scientists: involved with gathering data, massaging it into a tractable form, making
it tell its story, and presenting that story to others.
1.3.
Data Science roles and skills:
Business Analyst:
 Business analysts’ strengths lie in their business acumen.
 They can communicate well with both the data scientist and C-suite to help drive datadriven decisions faster.
 The best business analysts also have skills in statistics to be able to glean interesting
insights from past behaviour.
Data Scientist:
 Data science is largely rooted in statistics, data modelling, analytics and algorithms.
 They focus on conducting research, optimizing data to help companies get better at what
they do.
 The minds behind recommended products on Amazon.
Data Engineer:
 While data scientists dig into the research and visualization of data, data engineers ensure
the data is powered and flows correctly through the pipeline.
 They’re typically software engineers who can engineer a strong foundation for data
scientists or analysts to think critically about the data.
1.4.
Fundamental principles of data-driven thinking:
Find/Build Features/Attributes:
 Features are fundamental to train an ML system.
 They are the properties of the things you’re trying to learn about.
 Example:
• Fruit features: weight, color (2 dimensions).
• Numerically express dimensions for graphic plotting.
• ML learns to separate oranges and apples with a line.
• ML predicts based on known data; can't predict unknowns.
• Classify papaya: closest match to known apples and oranges.
2
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
The Relevance of Data:
 Features/Attributes:
• Choosing the appropriate features has a major impact on the performance of any ML
system.
• Some features will never allow the system to produce good results.
• How to choose? Practice and knowledge about the problem
• Do you know which examples correspond to apples and which correspond to
oranges? We need the labels (fraud)
• Do you have enough labelled examples? We need experience (scarce)
• Do you know what an orange is? We need clear cut definitions (churn)
 Data-Driven Thinking:
 Supervised Learning:
• Examples (training set):
Weight
X
Y
Z
•
•
Color
O
G
R
Examples (new):
Weight
X
Y
Z
Fruit
Orange
Apple
Apple
Color
O
G
R
Classification:
Label
Orange
Apple
Apple
 Building Features (ETL):
• Extract: To extract and to consolidate data from different sources.
• Transform: Select variables, create new variables, merge, etc.
• Load: Load data, periodicity, replacement, historical.
1.5.
What new about Data Mining?
Traditional statistics: typically involves small, clean, static datasets sampled in an iid manner.
The datasets are often collected to address specific problems and consist solely of numeric values.
(does not apply in Data mining context)
3
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Size of the data sets:
 In the past, statisticians often dealt with not having enough data.
 Too much data can make tests find very small effects, which might not be practically
useful.
 Instead of just statistical importance, we should also think about whether an effect is
important or valuable.
 When working with big datasets, special techniques are needed, especially in fields like
pattern recognition and machine learning.
 Datasets can be big because there are lots of records or lots of variables (deep and large).
 Data might not be stored in one flat file but in multiple connected flat files.
Data Mining Modes
 Incremental (Online):
• Examples presented one at a time, representation structure changes.
• Online learning handles each instance incrementally, the algorithm and knowledge
update with each instance in real-time.
 Non-incremental (Batch): Examples presented all at once; considered together.
Nonstationarity, Selection Bias, and Dependent Observations:
 Large datasets often deviate from being independently and identically distributed (iid).
 Population drift may occur due to changes in the underlying population; for instance, the
pool of bank loan applicants may shift with economic changes. In daily transactions like
supermarket purchases or Telco phone calls, databases are continually evolving.
 Selection bias arises when developing scoring rules, where comprehensive data is
available only for applicants previously deemed good risks by some prior rule. Those
graded as bad may have been rejected, leaving their true status undiscovered.
Spurious Relationships and Automated Data Analysis:
 Pattern searches generate numerous candidate patterns, increasing the likelihood of
identifying chance data configurations as patterns.
 Ultimately, patterns and structures flagged as potentially interesting are presented to
domain experts. Experts assess acceptance or rejection based on substantive domain
context and objectives, not solely on internal statistical structure.
Statistics versus Data Science
 Experimental Statistics – Primary:
• Purpose: Scientific
• Value: Operational
• Origin: Controlled
• Size: Small
• Hygiene: Clean
• Status: Static
 Opportunistic Data Science – Secondary:
• Purpose: Commercial
• Value: Operational
• Origin: Passively observed
• Size: Massive
• Hygiene: Dirty
• Status: Dynamic
4
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
2. The canonical tasks in Data Mining and work process
2.1.
The canonical tasks in Data Mining
Different Tasks:
 Predictive modelling – Supervised Learning:
•
•
Classification
Regression
 Descriptive modelling – Unsupervised Learning:
•
•
•
Clustering
Visualization
Association
Predictive Modelling:
 A real estate agency wants to estimate the price range for each customer based on their
income.
 Training examples:
• Historical data.
• Income vs sold house prices.
5
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Algorithm: Linear regression
 Knowledge representation: Regression line (slope and origin)
 New examples: A customer with an income of x
 Interpretation: Use the line (prediction method) to obtain an estimate
 Regression problem: we have to consider the average deviations produced by the model.
 Classification problem: we only need to count the number of times that the model was
wrong.
Predicted
Positives (1)
Predicted
Negatives (0)
2.2.
Actually
Positive (1)
True Positives
(TPs)
False Negatives
(FNs)
Actually
Negative (0)
False Positives
(FPs)
True Negatives
(TNs)
𝑡𝑝+𝑡𝑛
𝑡𝑝+𝑡𝑛+𝑓𝑝+𝑓𝑛
𝑡𝑝
.𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝+𝑓𝑝
𝑡𝑝
.𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑡𝑝+𝑓𝑛
.𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
The Data Mining Process
Data Mining and Knowledge Discovery Database (KDD) Process:
 Process involves extracting valuable patterns from large datasets.
 It begins with data selection and preparation, applies Data Mining algorithms to identify
patterns, interprets these patterns, and incorporates them into existing knowledge.
 It is an iterative cycle aimed at transforming data into useful knowledge.
6
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
CRISP-DM, Cross Industry Standard Process for Data Mining:
 Standard model to efficiently guide data mining
projects, promoting a systematic and flexible
approach.
• Business Understanding,
• Data Understanding,
• Data Preparation,
• Modeling, Evaluation,
• Deployment
2.3.
Before starting analysis
General aspects of the problem definition:
 Problem definition: emphasizes the need to shift focus from the pursuit of elegant theories
to embracing complexity and leveraging the power of data.
• It highlights the superiority of simple models with abundant data over intricate
models with limited data.
 Key directive is to follow the data:
• Opt for representations suitable for unsupervised learning on unlabelled data.
• Employ nonparametric models to let the data speak for themselves.
 When faced with insufficiently accurate classifiers:
• Design better learning algorithms.
• Pragmatically, gather more data.
 While researchers often prioritize improving algorithms, the quickest path to success
frequently involves acquiring more data.
 Guiding principle: less sophisticated algorithm with ample data outperforms a clever one
with modest amounts, emphasizing the essence of letting data drive the learning process.
Input Space: defined by the input feature vectors.
 Where the algorithms will try to find a solution to the problem
The Curse of Dimensionality:
 Number of Attributes:
• Few Attributes: Inability to distinguish classes.
• Many Attributes:
o Common in Data Mining.
o The curse of dimensionality.
o Difficulty in visualization and emergence of "weird" effects.
• Identifying important vs. redundant attributes: Determining the most crucial
attributes for the task.
 Impact of Dimensionality: As dimensionality increases, space becomes more sparse,
making it challenging to identify groups (requires even more data).
7
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 The Curse of Dimensionality:
• Generalizing correctly becomes exponentially harder as the dimensionality (number
of features) of examples grows.
• A fixed-size training set covers a diminishing fraction of the input space.
• Example: With a dimension of 100 and a massive training set of a trillion examples,
it covers only about 10^(-18) of the input space fraction.
o This is what makes machine learning both necessary and hard.
Input Space Coverage:
 Problem definition: Good coverage of the problem space increases confidence in the
results and in its quality.
 Space coverage: If a model is developed based on a set of examples, but in fact the
examples would be very different, it is natural that the results will be bad.
 Extrapolation vs Interpolation:
•
•
Interpolation: involves predicting a value inside
the domain and/or range of the data.
Extrapolation: involves predicting a value
outside the domain and/or range of data.
8
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Separability and Bayes Error:
 Separable: Ø error possible
 Not separable: always error > Ø
• Bayes error: Lowest possible
error for a classifier
Different Types of Variables:
 Nominal: just labels with no particular order (e.g. ‘red’, ‘green’, ‘blue’)
 Ordinal: have an order (e.g. ‘satisfied’, ‘very satisfied’, ‘extremely satisfied’)
 Discrete: just counting data (e.g. 0, 1, 2, ...)
 Continuous: just measurement data (e.g. 1.23, 0.001, etc)
 Interval data: measured and have constant.
• Equal distances between values, but the zero point is arbitrary.
• The zero isn’t meaningful, it doesn’t mean a true absence of something.
 Ratio scaled variable: has a meaningful ratio between two values of a quantitative
variable.
• Ratio measurement assumes a zero point where there is no measurement.
 Metadata: data that provides information about other data
• Descriptive metadata: describes a resource for purposes such as discovery and
identification.
• Structural metadata: containers of data and indicates how compound objects are put
together (e.g. how pages are ordered to form chapters)
• Administrative metadata: provides information to help manage a resource, such as
when and how it was created, file type and other technical information, and who can
access it.
Spurious Correlations and Confounding Variables: Input variables should
be causally related to the outputs.
 It is important that there is a plausible reason to choose the input variables.
 Spurious correlations:
• Low number of training examples.
• Large number of input variables.
• Example: Correlation between the measuring of a patient’s temperature on admission
to hospital and the probability of his survival
 Confounding variables/ factors: an extraneous variable in a statistical model that
correlates (directly or inversely) with both the dependent variable and the independent
variable.
 A spurious relationship is a perceived relationship between an independent variable and
a dependent variable that has been estimated incorrectly because the estimate fails to
account for a confounding factor.
9
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
3. Exploratory Data Analysis
Univariate:
 Categorical: Analysing the distribution and frequency of individual categorical variables.
 Numerical: Examining the distribution, central tendency, and spread of individual
numerical variables.
Bivariate:
 Categorical: Investigating relationships and patterns between two categorical variables.
 Numerical and Numerical: Assessing correlations or dependencies between two
numerical variables.
 Categorical and Numerical: Analysing the impact of a categorical variable on a numerical
variable or vice versa.
3.1.
Graphics
Five influential aspects:
 Clarity and Simplicity: Prioritizing clear and simple visualizations for easy
understanding.
 Interactivity: Engaging users with interactive features for dynamic exploration of data.
 Visual Consistency: Maintaining uniformity in color, symbols, and formatting for a
cohesive visual language.
 Storytelling: Crafting a narrative within visualizations guides users through data,
providing context and emphasizing key insights.
 Responsive Design: Ensuring adaptability to different platforms and devices for wider
accessibility.
Tufte’s principles of graphical excellence:
 Give the viewer:
• Greatest number of ideas
• In the shortest time
• Least ink
• Smallest space
 Objective: tell the truth about the data
 “lie factor”: measure of the amount of distortion in a graph
.𝐿𝑖𝑒 𝑓𝑎𝑐𝑡𝑜𝑟 =
•
•
•
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑒𝑓𝑓𝑒𝑐𝑡 𝑠ℎ𝑜𝑤𝑛 𝑖𝑛 𝑔𝑟𝑎𝑝ℎ𝑖𝑐
𝑠𝑖𝑧𝑒 𝑜𝑓 𝑒𝑓𝑓𝑒𝑐𝑡 𝑖𝑛 𝑑𝑎𝑡𝑎
Lie factor > 1: the graph is exaggerating the size of the effect.
Tufte requirement: 0.95 < Lie Factor < 1.05.
o If not within this range suggests a graph distortion, indicating visual exaggeration
of the actual data effect size.
Example:
o Line 1: 1.8 miles per gallon in 1970, is 0.6 inches long.
o Line 2: 27.5 miles per gallon in 1970, is 5.3 inches long.
1. 𝑆𝑖𝑧𝑒 𝑜𝑓 𝑒𝑓𝑓𝑒𝑐𝑡 𝑠ℎ𝑜𝑤𝑛 𝑖𝑛 𝑔𝑟𝑎𝑝ℎ𝑖𝑐 =
2. 𝑆𝑖𝑧𝑒 𝑜𝑓 𝑒𝑓𝑓𝑒𝑐𝑡 𝑖𝑛 𝑑𝑎𝑡𝑎 =
3. 𝐿𝑖𝑒 𝑓𝑎𝑐𝑡𝑜𝑟 =
7.833(3)
0.527(7)
27.5−18.0
18.0
5.3−0.6
0.6
= 7.833(3)
= 0.527(7)
= 14.8
10
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Presenting information:
 Bar chart:
 Stacked Bar chart:
 Line chart:
 Scatter Plot:
11
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Bubble Graphics:
 Pie charts:
 Paired column chart (alternative to pie chart):
 Stacked bar chart (alternative to pie chart):
 The slope chart (alternative to pie chart):
 Radar:
12
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Histogram:
 Boxplot:
Correlation Matrices
Parallel Coordinates
13
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Small Multiples:
 A series of similar graphs using the same scale + axes, allowing them to be easily
compared.
 It uses multiple views to show different partitions of a dataset.
Heat maps:
Tree maps:
Liked views:
Geo-visualization:
14
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
4. Data Preparation and Preprocessing
4.1.
Data Preparation
Objective:
 Transform data sets to best expose its information content
 The quality of the models should be better (or at least the same) after preparation
 Good data is a prerequisite for good models
 Some techniques are theoretically based, other are just based on experience
Signal vs. Noise:
 Noise:
• Refers to fluctuations and external disturbances in the flow of information (signal),
causing undesired disturbances in relevant information.
• Is a disturbance that affects a signal, potentially distorting the carried information.
 Most databases are large and high-dimensional, posing challenges in distinguishing signal
from noise.
 Naively adding more features to a dataset might seem beneficial, but in high dimensions,
it can disrupt similarity-based reasoning used by machine learning algorithms.
 For instance, in a nearest neighbour classifier with irrelevant features, the noise from
these features can overwhelm the signal, leading to effectively random predictions.
Real data suffers from several problems:
 Incomplete: Missing values, lacking attributes of interest, levels of aggregation
 Noisy: Errors and outliers
 Inconsistent:
• E.g. Age=42 Birthday=31/07/1997
• Changes in scales
• Duplicate records with different values
Treatment of missing data:
 Dealing with Missing Data:
• Delete variables (loses information).
• Delete records (may introduce bias).
• Manually enter probable values (tedious + infeasible).
• Fill in with a measure of central tendency (mean, median, mode).
• Fill in with a measure of central tendency of a subset (e.g., men and women).
• Fill in with values from similar individuals (nearest neighbors).
• Use predictive models (linear regression, multiple linear regression).
• Code the missing data explicitly.
 The practical approach is to start with the quickest and simplest option.
 After obtaining preliminary results, compare the model's performance in full sample
patterns and those with estimated missing values.
 If errors are significantly higher in certain data, consider alternative methods to improve
results.
15
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Outlier treatment
 Outliers:
• Observation point significantly distant from others.
• May result from measurement variability or indicate experimental errors, sometimes
excluded from the dataset.
• Extreme cases in one or more variables can strongly impact result interpretation.
• May come from:
o Unusual but correct situations (the Bill Gates effect),
o Incorrect measurements.
o Errors in data collection.
o Lack of code for missing data.
 Remove Data Outliers:
• Automatic limitation (thresholding): Imposing maximum and minimum values for
variables (e.g., age between 0 and 100).
• Normal data distribution: 3σ +/- Average.
• Cluster Analysis (K-means).
• Self-Organizing Maps.
Discretization
 Divide the range of a continuous variable into intervals.
• Some classification algorithms only accept discrete attributes.
• Reduce data size.
• Prepare for further analysis.
 Frequently called binning.
 Equal-width binning
•
•
Divides the range into n intervals of equal size.
If A and B are the minimum and the maximum values of the attribute, the width of
the intervals will be: w=(B-A)/N
• Most simple method
• Outliers may dominate.
 Equal-depth binning
•
•
•
•
Divides the range into n intervals, each containing approximately the same number
of samples.
Generally preferred avoids clumps.
Gives more intuitive breakpoints.
Should not break frequent values across bins.
Entropy (also called Expected Information) based discretization:
1. Sort examples in increased order
2. Each value forms an interval (m intervals)
3. Calculate the entropy measure of each discretization.
4. Find the binary split boundary that minimizes the entropy function over all possible
partitions. The split is selected as a binary discretization.
5. Apply the process recursively until some stopping criteria is met.
16
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 In information theory, entropy is a measure of uncertainty associated with a random
variable or a probability distribution.
• High entropy > high degree of uncertainty, randomness, or unpredictability,
• Low entropy > high degree of certainty or predictability.
 Entropy is typically measured using the formula:
• 𝐸 = −𝑝 𝑙𝑜𝑔2 (𝑝)
• p = probability of the examples belongs to a specific class
 Partition entropy: 𝐸𝑛𝑡(𝑆) = − ∑#𝐶
𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 ( 𝑝𝑖 )
 Gain in choosing A attribute:
• 𝐺𝑎𝑖𝑛(𝐸𝑛𝑡𝑛𝑒𝑤 ) = 𝐸𝑛𝑡𝑖𝑛𝑖𝑡𝑖𝑎𝑙 − 𝐸𝑛𝑡𝑛𝑒𝑤
•
𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡(𝑆) − ∑∀∈𝑉𝑎𝑙𝑜𝑟𝑒𝑠(𝐴)
#𝑆𝑣
𝐸𝑛𝑡( 𝑆𝑣 )
#𝑆
 Example:
•
•
13
13
7
7
𝑙𝑜𝑔2 ( ) + 𝑙𝑜𝑔2 ( )) = 0.934
20
20
20
20
#𝑆
1
1
∑∀∈𝑉𝑎𝑙𝑜𝑟𝑒𝑠(𝐴) 𝑣 𝐸𝑛𝑡( 𝑆𝑣 ) = 𝐸𝑛𝑡(𝐴𝑔𝑒 < 25) + 𝐸𝑛𝑡(𝑎𝑔𝑒
#𝑆
2
2
𝐸𝑛𝑡(𝑆) = − (
> 25)
Imbalanced Learning:
 If the minority class is 1% it is possible to have a useless classifier with 99% accuracy
 If the minority class is 0,1% it is possible to achieve a 99.9% accuracy with a trivial
classifier
 Imbalanced learning problem: defined as a classification task for binary or multi-class
datasets where a significant asymmetry exists between the number of instances for the
various classes.
 Dominant class: the majority class (negative cases) while the rest of the classes are called
the minority classes (positive cases).
 Imbalance Ratio (IR): ratio between the majority class and the minority class, (depends
on the type of application and for binary problems values between 100 and 100.000 have
been observed).
 Solutions to Imbalanced Learning:
• Modification/creation of algorithms
• Application of cost sensitive methods
• Modification at the data level
Modification/creation of algorithms:
 Standard learning methods induce bias towards the majority class, assuming a balanced
class distribution.
 Algorithms typically optimize for accuracy, which may not be suitable for imbalanced
datasets.
 Misclassification costs for minority classes are often higher, impacting certain
applications like disease screening tests.
•
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑒+𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑇𝑜𝑡𝑎𝑙𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
17
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Application of cost-sensitive methods:
 Many classification algorithms assume uniform misclassification costs, but in reality,
costs for minority class misclassifications are often much higher.
 Diseases screening tests exemplify situations where false negatives involve significantly
higher costs than false positives.
Modification at the data level:
 Undersampling: Reducing the size of the majority class to balance the class distribution.
 Oversampling: Increasing the size of the minority class to balance the class distribution.
 Hybrid approaches: Combining undersampling and oversampling techniques to address
imbalanced learning challenges.
SMOTE: Synthetic Minority Over-sampling Technique
1.
2.
3.
4.
Randomly selecting a minority class instance x;
Then it defines the set of k-nearest neighbors (xknn);
Randomly selects another minority class sample x’ from the xknn set.
Xgen is generated by using a linear interpolation of x and x’, which can be expressed as:
• 𝑋𝑔𝑒𝑛 = 𝑥 + 𝑎 ∗ (𝑥 ′ − 𝑥)
General aspects of data collection - Use of artificial data:
 It is always preferable to use real data.
• Create data as realistic as possible.
• Make artificial data as representative as possible.
 The quality of the model is constrained by the quality of the data.
 Creating artificial data translates into the introduction of some noise.
4.2.
Data Preprocessing
Reasons:
 Noise Reduction.
 Signal amplification.
 Size Reduction of the Input Space:
• Remove correlated variables.
• Remove irrelevant variables.
 Constructing ratios and derived variables.
 Domain-specific knowledge application.
 Normalization.
18
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Curse of dimensionality: the input space grows exponentially with the number of input variables.
 The larger the input space, the more data and computing power we need.
 When the dimensionality increases, the space becomes more sparse and it becomes more
difficult to find groups:
Dimensionality reduction of the Input Space (or Feature Selection) principles:
 Irrelevance:
• Eliminating irrelevant features that do not contribute significantly to the predictive
power of the model.
• Focus on retaining only the most meaningful and informative features for the
analysis.
 Redundancy:
• Identifying and removing redundant features with similar or duplicate information.
• Reducing redundancy helps streamline the dataset, making it more efficient for
analysis and modelling.
Feature engineering: involves creating new features or modifying existing ones to enhance the
performance of machine learning models.
Information required to identify costumer
behaviour:
• Transaction number
• Date and time of transaction
• Item purchased (identified by UPC code)
• Product price (per item/ unit)
• Quantity purchased.
• Table matching product code to product
name, subgroup code to subgroup name,
and product group to product group name
• Product taxonomy that links product code to
subgroup code, and product subgroup code
to product group code.
Feature engineering:
•
•
•
•
•
Recency: day since last visit/ purchase
Frequency: nº of transactions per costumer
Monetary value: total value of sales (
profit)
Average purchase: average of the purchase per visit
Most frequent store
• Average Time Between Transactions: Transaction
•
Interval
Standard Deviation of Transactional Interval
• Customer Stability Index: Standard Deviation of
•
Transactional
Interval/Average
Transactions
Relative Spend on Each Product
Time
Between
 Normalize Relative Spend (NRS): ratio between what product A represents (%) in the
expenses of customer 1 and the on the average of the database.
 Probability of Return: Tn where n is number of purchases made during the period T.
Input Space Reduction – Relevancy:
 Heuristic feature selection methods:
• Best single features:
o Choose by information gain measures (e.g. entropy)
o A feature is interesting if it reduces uncertainty.
19
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
4.2.1. Filter methods:
Filter Method:
 Used to identify the most important features for a predictive model.
 Use statistical and mathematical techniques to rank or score individual features based on
their intrinsic properties.
 Are computationally less expensive and can be applied before any algorithm is trained.
Correlation-based Feature Selection:
1. Calculate the correlation between each feature and the target variable or between pairs of
features.
2. Select features with high correlation to the target variable or low inter-feature correlation.
 Pros:
• It's a simple and computationally efficient method.
• It's useful for identifying linear relationships between features and the target.
 Cons: may not capture more complex relationships or interactions between features
Chi-squared Test:
 Used to determine the independence between categorical features and a categorical target
variable.
 Features with a high chi-square statistic are considered more important.
 Pros:
• It's well-suited for categorical data.
• It can help select features that are most relevant to a categorical target.
 Cons: It may not be the best choice for continuous data or when relationships are not
purely categorical.
Information Gain and Mutual Information:
 Measure the reduction in uncertainty about the target variable when you know the value
of a feature.
 Features with high information gain or mutual information are considered more
important.
 Pros:
• They are suitable for both categorical and continuous features.
• They can capture non-linear relationships.
 Cons: They may overestimate the importance of noisy features.
ANOVA (Analysis of Variance):
 Whether there are significant differences in the means of a numerical feature among
different classes or categories of the target variable.
 Features with a low p-value are considered more important.
 Pros: It is useful for identifying features that have a significant impact on a numerical
target variable.
 Cons: It may not work well with categorical target variables or capture non-linear
relationships.
20
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
4.2.2. Wrapper methods
Wrapper method:
 Identify the most important features for a predictive model.
 Evaluate model performance based on different features subsets and select the best one.
 Primary goal: Identify features combination for maximum model's predictive power.
 Are computationally expensive and typically involve using a specific machine learning
algorithm to evaluate subsets of features.
Forward Selection:
 Starts with an empty set of features and iteratively adds one feature at a time.
 In each iteration, the algorithm selects the feature that improves the model's performance
the most.
 Pros: it is a simple and intuitive method that can quickly identify important features.
 Cons:
• It may not always find the optimal subset
• It can be time-consuming for datasets with many features.
Backward Elimination:
 Backward elimination starts with all features and iteratively removes one feature at a
time.
 In each iteration, the algorithm removes the feature that has the least impact on the
model's performance.
 Pros: It can be more efficient than forward selection for large datasets.
 Cons: Like forward selection:
• It may not always find the optimal subset
• The choice of the initial feature set can influence results.
Recursive Feature Elimination (RFE):
 More systematic approach to feature selection.
 It starts with all features and recursively eliminates the least important features based on
a specified model (e.g., using a machine learning algorithm).
 Continues until the desired number of features is reached.
 Pros:
• It provides a systematic way to reduce the feature space.
• It can be used with a variety of machine learning models.
 Cons: It can be computationally expensive for large datasets.
Variance Thresholding:
 Remove features with low variance.
 Features with very little variation across the dataset are unlikely to provide valuable
information.
 Pros: It's a simple and quick method for reducing the dimensionality of the dataset.
 Cons: It doesn't consider the relationship between features and the target variable.
21
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
4.2.3. Data Standardization
Standardization or Normalization:
 Models assume that the distances in different directions of the input space have the same
importance.
 Variables come in many different scales (percentages, euros, kilos, meters, days…)
 Normalization: adjusting values measured on different scales to a common scale
𝑦−𝑚𝑖𝑛1
 Min-max: 𝑦 ′ = (𝑚𝑎𝑥1−𝑚𝑖𝑛1) (𝑚𝑎𝑥2 − 𝑚𝑖𝑛2) + 𝑚𝑖𝑛2
𝑦−
 Zscore: 𝑦 ′ = 𝑠𝑡𝑑
5. Data Segmentation Strategies
Segmentation strategies:
 Cohort analysis
 Cell-based
 RFM analysis
5.1.
Cohort analysis
Cohort analysis: involves grouping and analysing data based on shared characteristics or
experiences within a specific period.
 Primary objective: uncover patterns, trends, or behaviours that are unique to specific
cohorts.
 It helps in understanding how groups of individuals or entities evolve over time,
providing valuable insights into their collective actions and responses.
 Used to gain a deeper understanding of their data, enabling more targeted and effective
decision-making in various domains, especially in marketing and user behaviour analysis.
Applications:
 Customer Behaviour Analysis:
• Categorizes customers into cohorts based on characteristics like acquisition date or
first purchase.
• Enables customized marketing strategies for specific customer groups.
 Product Performance:
• Applied in product or service analysis.
• Shows how different launches or updates impact user engagement and satisfaction.
 Employee Retention:
• Tracks employee retention rates based on hiring time.
• Identifies patterns related to employee tenure and turnover.
Process:
 Segmentation: Data is grouped into distinct cohorts based on a chosen characteristic or
event.
 Analysis: Trends and patterns within each cohort are examined to understand their unique
behaviors.
 Comparison: Cohorts are compared to identify differences or similarities in their
responses or outcomes.
 Decision-Making: Insights from cohort analysis inform strategic decision-making,
enabling businesses to customize interventions or strategies based on specific cohort
characteristics.
22
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Data Clustering
6.1.
Cluster analysis:
Cluster Analysis: generic name for a variety of methods that are used to group entities.
 Basic conceptual activity of human beings;
 Fundamental process, essential to the development of scientific theories;
 The possibility of reducing the infinite complexity of real to sets of objects or similar
phenomena, is one of the most powerful tools in the service of mankind.
 Objective: to form groups of objects that are similar to each other.
 From a data collection about a group of entities, seeks to organize them in homogeneous
groups, assessing a "frame" of similarities/differences between units.
Classification:
 Starts out with a pre-classified training set:
• Method has a set of data which contains not only the variables to use in classification
but also the class to which each of the records belongs.
 Attempts to develop a model capable of predicting how a new record will be classified.
Clustering:
 There is no pre-classified data.
 We search for groups of records (clusters) that are similar to one another.
 Underlying is the expectation that similar customers in terms of the variables used will
behave in similar ways.
Grouping:
 Clustering:
• Hierarchical methods
• Partition methods
• DBScan
• Shift mean
• Self-organizing maps
 Cell-based segments:
• Variables
• Through time
• RFM analysis
•
23
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Four basic stages characterize all studies involving cluster analysis:
1. Variables to use: Defining a set of variables over which we assess the similarity
/dissimilarity of the entities.
• Objective of the segmentation:
o Value/Engagement;
o Needs;
o Behaviors/Consumption;
o GeoDemographics/Socio-economic characteristics.
• Deciding which variables to use:
o The type of problem determines the variables to choose;
o If the purpose is to group objects, the choice of variables with discrimination
ability is crucial;
o The quality of any cluster analysis is heavily influenced by the variables used.
o The choice of variables should replicate a theoretical context, a reasoning;
o This process is carried out based on a set of variables that we know to be good
discriminators for the problem at hand;
o First of all, the quality of the cluster analysis reflects the discrimination ability of
the variables we decided to use in our study
2. Similarity criterion: Defining a similarity / dissimilarity criterion between entities (data
normalization)
3. Algorithm: Defining a clustering algorithm to create groups of similar entities;
4. Profiling: Analysis and validation of the resulting solution
6.2.
Similarity Measurements
Metric Axioms in Mathematics:
 In mathematics, a genuine measure of distance, known as a metric, adheres to three
fundamental properties, expressed through metric axioms, denoting the distance between
objects a and b.
•
•
•
Measure is symmetric: dab = dba
Distances are always positive except when the objects are identical:
if and only if a = b → dab  0
Triangle inequality: dab  dac + dca
Euclidian distance: the distance between two elements (i,j) is the square root of the sum of the
squares of the differences between I and j values for all variables (v=1, 2,...., p):
𝑝
 Euclidean (L2): 𝑑𝑖𝑗 = √∑𝑣=1(𝑥𝑖𝑣 − 𝑥𝑗𝑣 )2
𝑝
 City block (L1): 𝑑𝑖𝑗 = ∑𝑣=1 |𝑥𝑖𝑣 − 𝑥𝑗𝑣 |
 Weighted Euclidean distance: weight is assigned to each variable, according to their
importance for the analysis:
•
𝑝
𝑑𝑖𝑗 = √∑𝑣=1 𝑤𝑣 (𝑥𝑖𝑣 − 𝑥𝑗𝑣 )2
24
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Minkowski distance:
 Is defined from the absolute distance.
 Can be considered as a generalization of both the Euclidean and the Manhattan distances.
 It coincides with Euclidean distance when r=2 and with Manhattan distance when r=1:
•
1
𝑝
𝑑𝑖𝑗 = (∑𝑣=1 |𝑥𝑖𝑣 − 𝑥𝑗𝑣 |𝑟 )𝑟
Pearson correlation coefficient: its function is to measure the degree of linear correlation
between two elements, for a few variables:
 𝑝(𝑥𝑖 , 𝑥𝑖′ ) =
6.3.
∑𝑗(𝑥𝑖𝑗 −𝑥̅𝑖 )(𝑥𝑖′𝑗 −𝑥
̅̅̅̅)
𝑖′
2
̅̅̅̅)
√∑𝑗(𝑥𝑖𝑗 −𝑥̅𝑖 )2 ∑𝑗(𝑥𝑖′𝑗 −𝑥
𝑖′
Cell-based segments
Cell-based Segments - Percentiles and Quartiles:
 Two-way: Segments data based on two specific factors or variables.
• Purpose: Identify relationships or patterns between two variables, providing insights
into their joint impact.
 Over Time: Examines changes in segments over different time intervals.
• Application: Useful for analysing trends, seasonality, or shifts in patterns over time.
 A percentile is a statistical measure indicating the value below which a given percentage
of observations in a group fall.
 Quartiles:
• 25th percentile: first quartile (Q1).
• 50th percentile: median or second quartile (Q2).
• 75th percentile: third quartile (Q3).
 Relationship: In general, percentiles and quartiles are
specific types of quantiles, providing insights into the
distribution of observations in a dataset.
RFM Analysis:
 Factors Considered:
• Recency: How recently a customer made a purchase.
• Frequency: How often a customer makes a purchase.
• Monetary Value: The monetary amount spent by a customer.
 Purpose: Evaluates and categorizes customers based on their transaction history.
 Application: Commonly used in retail and e-commerce to identify and target high-value
customers for personalized marketing strategies
 Based on the following principles:
• Customers who have purchased more recently are more likely to purchase again.
• Customers who have made more purchases are more likely to purchase again.
• Customers who have made larger purchases are more likely to purchase again.
 Gains popularity due to simplicity, low cost, and the ability to classify customers based
on their behaviour.
 Provides the opportunity to conduct tests in small, representative groups for each cell.
 While more sophisticated modelling is often better, RFM Analysis remains valuable due
to its simplicity and cost-effectiveness.
25
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 There are two methods:
• Exact Quintiles:
1. Sort the database based on
recency, dividing it into 5
quintiles (5 equal segments).
2. Repeat the sorting process for
the variables of frequency and
monetary.
3. Result: 125 cells of equal size
(5*5*5).
•
6.4.
Hard coding:
o Categories are defined by exact values (e.g. 0-3 months, 4-6 months, 7-9 months).
o More expensive in terms of programming as it involves specifying exact values
for each category.
o Categories tend to change over time.
o Significant variations in quantities may exist from cell to cell, impacting the
uniformity of the categories.
Clustering Techniques
6.4.1. Hierarchical Clustering:
Distance Matrix: Utilizes it to represent the pairwise distances between data points.
Linkage or Aggregation Rules:
 Implements linkage or aggregation rules to determine how clusters are formed.
 Include single linkage, complete linkage, and average linkage.
26
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Dendrogram: tree-like diagram representing the hierarchical relationships between clusters.
 Represents the hierarchical structure of clusters through a dendrogram.
 Illustrates the sequence of cluster formations and their relationships.
Disadvantages:
 An interaction (merge or separation) is irreversible.
 The strictness is beneficial in terms of computational costs, avoiding the complexity of
various combinatorial choices. However, this cost efficiency is linked to the inability to
correct wrong decisions once the clustering process is initiated.
 Due to its calculation requirements, hierarchical clustering may not perform optimally
with large datasets.
 Several operations involving large matrices may encounter limitations in terms of
efficiency and effectiveness.
Ways to Improve Hierarchical Clustering Performance:
 Analysis of Links:
• Conduct a careful analysis of the links produced in each hierarchical partition.
• Methods like CURE and Chameleon emphasize a detailed examination of the linkage
structures to enhance clustering outcomes.
 Integration with Optimization:
• Integrate hierarchical clustering with optimization techniques.
• Employ an agglomerative algorithm initially, followed by refining the results
through iterative optimization.
• The BIRCH method exemplifies this approach, combining hierarchical clustering
with iterative optimization to improve overall performance.
6.4.2. Partitioning Methods (kmeans and k-meadoids)
k-means Algorithm:
 Data Representation: Let the set of data points (or instances) D be {x1, x2, …, xn}
• xi is a vector in a real-valued space X ∈ Rr
• r is the number of attributes.
 The k-means algorithm partitions the given data into k clusters.
• Each cluster has a cluster centre, called centroid.
• k is specified by the user
• k ≪ n.
 Classifies the data into K groups, by satisfying the following requirements:
• each group contains at least one point.
• each point belongs to exactly one cluster.
27
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Partitioning Process:
1. Given k, the algorithm creates an initial partition (typically randomly).
2. Utilizes an iterative relocation technique to improve the partition by moving objects
between clusters.
• The criterion for a good partitioning is that objects within the same cluster should be
close or related to each other.
 Algorithm:
1. Choose the seeds.
2. Associate each individual with the nearest seed.
3. Calculate the centroids of the formed clusters.
4. Repeat steps 2 and 3.
5. End when the centroids cease to be recentred.
 Objective: Minimize intra-group variance (sum of squared error).
𝐾
𝑆𝑆𝐸 = ∑ ∑ 𝑑𝑖𝑠𝑡 2 (𝑚𝑖 , 𝑥)
𝑖=1 𝑥∈𝐶𝑖
•
•
•
•
x is a data point in cluster Ci
mi is the representative point (centroid) for cluster Ci
One easy way to reduce SSE is to increase K (number of clusters)
A good clustering with smaller K can have a lower SSE than a poor clustering with
higher K
 Strengths:
• Simple: easy to understand and to implement
• Efficient: Time complexity O(tkn),
o where n is the number of data points,
o k is the number of clusters, and
o t is the number of iterations.
• Since both k and t are small, k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used. The global optimum is
hard to find due to complexity.
 Weaknesses:
• Very sensitive to the existence of outliers.
• Very sensitive to the initial positions of the seeds.
• Partitioning methods work well with spherical-shaped clusters.
• Partitioning methods are not the most suitable to find clusters with complex shapes
and different densities.
• The need to set from the start the number of clusters to create.
28
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
K-Means Algorithm in figures:
 Individuals measured based on two variables.
 The goal is to group them in homogeneous sets.
 Seeds
 Randomly chosen
 Allocate individuals to the nearest seed
 Recentre the seed so that it stays in the centre of the cloud
of points (called centroid).
 Some individuals change cluster.
 Final solution
 Movement of centroids during optimization process
29
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
K-means and k-medoids algorithms:
 Most algorithms adopt one of two very popular heuristics:
• k-means algorithm, where each cluster is represented by the average of the values of
the points in a cluster.
• k-medoids algorithm, where each cluster is represented by one of the points located
near the centre of the cluster.
The initialization problem:
 The algorithm is sensitive to initial seeds.
• Use multiple forms of initialization.
• Re-initialize several times.
• Use more than one method.
• Use a relatively large number of clusters and proceed to their regrouping by the
choice of centroids.
Weaknesses:
 Have difficulties in dealing with clusters of different size and density:
30
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Each individual either belongs or does not belong to the cluster, having no notion of
probability of belonging.
• In other words, there is no consideration of the quality of the representation of a
particular individual in a given cluster.
The number of clusters:
 Determining the appropriate number of clusters is inherently challenging with no fixed
solutions.
 Approaches to Minimize the Problem:
• Multiple Classifications:
o One approach is to create various classifications with different values of K.
o Evaluate and choose the best-suited number of clusters based on performance
metrics.
• Hierarchical Method:
o Utilize a hierarchical method to determine the number of clusters.
o Examine the dendrogram.
o Select an optimal number of clusters based on dendrogram analysis.
 The choice should be guided by three fundamental criteria:
• Intra-cluster variance (Elbow Curve Method):
• Silhouette Analysis
• Evaluation of the profile of the cluster (subjective),
• Operational considerations.
 Varying k (Number of Clusters):
• Test results by varying k.
• Allows for analyses guiding the choice of the number of clusters.
 Comparing Distances:
• Compare total distances of different solutions.
• Assess the impact of varying the number of clusters on distances.
31
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Operational Considerations:
• Business environment factors influence decisions:
o Choose a number small enough for specific strategy development.
o Have a sufficiently large number of individuals to justify a specific strategy.
• Use a high initial k and proceed with cluster grouping.
 Evaluation Through Profiling:
• Compare mean values of each variable in each cluster with mean values of the
population.
• Highlight significant differences within clusters and the mean population.
• Emphasizes the importance of profiling in the evaluation process.
Intra-cluster variance (Elbow Curve Method):
 Define clusters in a way that minimizes the total intra-cluster variation or the total withincluster sum of squares (WSS).
 Allows the comparison of different values of k (e.g., from 1 to 10).
 Total WSS: Measures the compactness of the clustering, aiming for it to be as small as
possible.
 Process:
• Plot the total WSS for different values of k.
• Look for the elbow point where the rate of decrease in WSS shifts.
• Initially, adding clusters provides substantial information about variance.
• At a certain point, the marginal gain diminishes, creating an angle in the graph.
• The number of clusters is chosen at this elbow point, leading to the "elbow criterion."
 Ambiguity: The identification of the "elbow" may not always be unambiguous and
requires subjective judgment.
Silhouette Analysis:
 The silhouette coefficient or silhouette score kmeans is a measure of how similar a data
point is within-cluster (cohesion) compared to other clusters (separation).
 The equation for calculating the silhouette coefficient for a particular data point:
𝑏(𝑖) − 𝑎(𝑖)
𝑆(𝑖) =
𝑚𝑎𝑥{𝑎(𝑖), 𝑏(𝑖)}
• S(i) is the silhouette coefficient of the data point i.
• a(i) is the average distance between i and all the other data points in the cluster to
which i belongs.
• b(i) is the average distance from i to the nearest cluster to which i does not belong.
 Process:
1. Calculate the average silhouette for every k solution.
2. Then plot the graph between average_silhouette and K.
3. The value of the silhouette coefficient is between [-1, 1].
4. A score of 1 denotes the best, meaning that the data point i is very compact within
the cluster to which it belongs and far away from the other clusters.
5. The worst value is -1. Values near 0 denote overlapping clusters.
32
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
6.4.3. Density-based clustering (DBscan)
Traditional clustering methods:
 have limitations and assumptions that may lead to suboptimal results.
 These methods may inaccurately identify convex regions and struggle with noise or
outliers within clusters.
Density-based clustering algorithms: a group of density connected points.
 Aim to identify clusters without assuming a specific shape, addressing challenges like
clusters with arbitrary shapes (e.g., ring structures).
 These algorithms model clusters as dense regions in data space, separated by sparse
regions, allowing for the identification of clusters with varying shapes and densities.
 We want to be able to cluster data like this:
Characteristics:
 Density: the number of points in the neighbourhood of a given point.
 Physical Intuition: Our physical intuition regarding density involves considering a point
as part of a dense region if there are numerous points in its vicinity.
 Measuring Density:
• Density around a point is quantified by the number of neighbouring points.
• Traditional topological definition of neighbourhood is used.
• The ε-neighbourhood of a point p is the space within a radius ε > 0 centred at p.
Density-Based Clustering Based on Connected Regions with High Density
 DBSCAN takes two input parameters:
•  - the radius defining the neighbourhood
• MinPts - the minimum of points in  - neighbourhood
 DBSCAN important concepts:
• If z is a point that have at least MinPts in its  -neighbourhood is called core point.
• x is border point, if the number of its neighbours is less than MinPts, but it belongs
to the  -neighbourhood of some core point z
• If a point is neither a core nor a border point, then it is called a noise point or an
outlier.
• Example: Assuming MinPts = 6
o x is a core point because  -neighbourhood
(x) = 6,
o y is a border point because  –neighbourhood
(y) < MinPts, but it belongs to the  neighbourhood of the core point x.
o z is a noise point.
33
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Key Terms:
•
•
•
Direct Density Reachable:
o A point "A" is directly density reachable from another point "B" if "A" is in the
ε-neighbourhood of "B," and "B" is a core point.
o Specifies the direct relationship between two points, where one is within the εneighbourhood of the other, and the latter is a core point.
o This condition is crucial in determining direct density reachability in the
DBSCAN algorithm.
Density Reachable:
o A point "A" is density reachable from "B" if there exists a set of core points that
lead from "B" to "A."
o Describes the accessibility of one point from another based on the presence of
core points.
Density Connected:
o Two points "A" and "B" are density connected if there exists a core point "C"
such that both "A" and "B" are density reachable from "C."
o Indicates a connection between two points through a common core point,
emphasizing the role of core points in establishing connections.
DBSCAN Algorithm:
1. Initialization:
• Begins with an unvisited arbitrary data point.
• Extracts the ε-neighbourhood of this point (points within ε distance).
2. Cluster Formation:
• Checks if there are enough points (according to MinPts) within the neighbourhood.
• If sufficient points exist, initiates the clustering process.
• The current data point becomes the first point in the new cluster.
• If the number of points is insufficient, labels the point as noise.
• Marks the point as "visited."
3. Expanding the Cluster:
• For the first point in the new cluster, includes points within its ε-distance
neighbourhood into the same cluster.
• Repeats the procedure for all newly added points to ensure their ε-neighbourhood
points also belong to the cluster.
4. Iteration:
• Repeats steps 2 and 3 until all points in the cluster are determined.
• Ensures all points within the ε-neighbourhood of the cluster have been visited and
labelled.
34
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Advantages
 It does not require a pre-set number of clusters at all.
 It identifies outliers as noises (not affected by),
 It can find arbitrarily sized and arbitrarily shaped clusters quite well.
Disadvantages
 Doesn’t perform well when the clusters are of varying density.
 Setting of the distance threshold  -and MinPTs for identifying the neighbourhood points
will vary from cluster to cluster when the density varies.
 Doesn’t work well in high-dimensional spaces.
6.4.4. Mean-shift algorithm
Characteristics:
 Mean-shift clustering is a sliding-window-based algorithm utilizing Kernel Density
Estimation (KDE).
 Aim: Identify dense areas within the data points.
 Centroid-Based: Seeks to locate centre points (modes) for each group or class.
• Operates by updating candidate centre points to be the mean of points within the
sliding window.
• Candidate windows undergo post-processing to eliminate near-duplicates, forming
the final set of centre points and their corresponding groups.
 Cluster: all the data points in the attraction basin of a mode
 Attraction basin: the region where all trajectories lead to the same mode
Results:
Mean-Shift Clustering Algorithm:
1. Initialization:
• Starts with a circular sliding window centred at a randomly selected point C.
• The window has a radius r, serving as the kernel.
2. Iterative Shifting:
• At each iteration, shifts the sliding window towards regions of higher density.
• The centre point of the window is shifted to the mean of the points within the
window.
• Gradually moves towards areas of higher point density.
3. Continued Shifting:
• Continues shifting the sliding window according to the mean until no direction
allows further accommodation of points inside the kernel.
4. Multiple Windows:
• Repeats the process with many sliding windows until all points lie within a window.
• When multiple sliding windows overlap, preserves the window containing the most
points.
• Clusters data points according to the sliding window in which they reside.
35
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Mean-Shift Algorithm in figures:
Advantages
 Does not assume shape on clusters.
 One parameter choice (window size).
 Robust to outliers.
 Generic technique.
 Find multiple modes.
Disadvantages
 Selection of window size
 Does not scale well with dimension of feature space
 Computationally (relatively) expensive
36
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
6.4.5. Self-Organizing Maps
Self-Organizing Maps (SOM):
 Unsupervised neural networks closely related to clustering.
 Inputs connect to a two-dimensional (or multidimensional) matrix of neurons.
 Each neuron is connected to its neighbours.
Use Cases:
 Multidimensional data visualization.
 Cluster detection.
 Market segmentation.
 Outlier detection.
 Solving problems like TSP, robot control, alarm detection, and more.
Training Process:
 Each neuron is a vector in the input space.
 During training, neurons are adjusted to the positions of input data, influencing their
neighbours.
 The map acts like a rubber sheet, stretched and twisted to fit (or be near) the data patterns.
Pattern Recognition:
 Input patterns are compared with all neurons.
 The closest neuron is the winner, and the input pattern is mapped to it.
 The winning neuron is updated to resemble the input pattern, and its neighbours are also
updated.
Quantization Error: There's a slight difference between the data and the neurons representing
them, known as the quantization error.
Comparisons with biology
 Biological systems have to use some kind of self-organization and adaptation.
 There is evidence of:
• Layered structure in the brain.
• Those layers seem to spatially organize the information.
• Similar “Concepts” are mapped to adjacent areas.
• Experimental work with animal’s points to an organization similar to SOM in the
visual cortex.
37
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
SOM Algorithm:
0. Randomly initialize the weights wij.
• Set the neighbourhood topological parameters.
• Set the learning rate.
1. While stop condition false, do steps 2-7.
2. For each input vector x, do steps 3-5.
3. For each j, execute:
• 𝐷(𝑗) = ∑(𝑤𝑖𝑗 − 𝑥𝑖 )2
4. Find the unit that minimizes D(j).
5. For every j unit within the predefined and for all the i:
• 𝑤𝑖𝑗 (𝑛𝑒𝑤) = 𝑤𝑖𝑗 (𝑜𝑙𝑑) + 𝛼[𝑥𝑖 − 𝑤𝑖𝑗 (𝑜𝑙𝑑)]
6. Update the learning rate.
7. Update (reduce) the radius of the topological neighbourhood.
 The SOM algorithm adapts the output space to represent the relationships between input
patterns in a visually meaningful way.
• Input Space:
o In the input space, data patterns are represented by vectors.
o Optimization aims to adjust neurons to the positions of input data.
• Output Space:
o If we have two variables, optimization in the input space will reflect in the output
space.
o For a 1-dimensional SOM, it organizes along a line, while for a 2-dimensional
SOM, it organizes on a grid.
• Colour Demo:
o Demonstrates grouping cells based on their RGB code (red, green, and blue).
o Each colour represents a unique combination of RGB intensities.
o The output space transforms as individuals (colours) are presented to the network.
Key outputs of the SOM:
 U-Matrices:
38
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
 Component planes:
 Hit Plots:
6.5.
Analysis and validation of clustering solutions
Profiling the Solution: Analyst evaluation is a pivotal step in the clustering process, where
clustering results undergo analysis and scrutiny by domain experts or data analysts.
Reasons underscore the importance of analyst evaluation:
 Ensuring Relevance: Analyst evaluation ensures that clusters align with the business
problem or research question, ensuring both technical accuracy and practical relevance.
 Identifying Anomalies: Analyst evaluation helps identify and analyze outliers or
anomalies that may not fit within the identified clusters, providing insights into potential
issues with the clustering algorithm or the data.
 Improving Interpretability: Clustering algorithms may produce complex or obscure
clusters. Analyst evaluation aids in identifying and explaining the underlying patterns,
making the results more interpretable and communicable to stakeholders.
 Refining the Clustering Solution: Analysts can suggest modifications to the clustering
algorithm or input data based on their evaluation, refining the solution. This may involve
using a different feature set or addressing missing or noisy data.
 Validating Results: Analyst evaluation serves as a validation mechanism for clustering
results. By comparing outcomes to prior knowledge or expert opinion, analysts ensure
the clusters are meaningful and useful.
39
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
6. Association Rules
Association Rules:
 Aims at the extraction of compact patterns that describe subsets of data.
• Events that occur together (market basket analysis).
• The mais purpose is to establish relationships between fields.
• Are rules of the form if X then Y.
• Association rules provide information about things that tend to happen together.
 Are expressed in the form: "If the item A is part of an event, then the item B will also be
part of the event X percent of the time."
• The rules should be interpreted as associations rather than direct causation.
• It is not legitimate to infer rules of causality.
 1st Example:
• Table with a set of purchases from a supermarket, with 5 purchases and 5 items:
Client
1
2
3
4
5
•
Items
Orange juice, Soda
Milk, Orange juice, Glass cleaner
Orange juice
Orange juice, Detergent, Soda
Glass cleaner, Soda
With these data we can create a table of co-occurrences with the number of times
that any pair of products was purchased together.
Orange juice
Glass cleaner
Milk
Soda
Detergent
Orange juice
4
1
1
2
1
Orange juice
1
2
1
1
0
Milk
1
1
1
0
0
Soda
2
1
0
3
1
Detergent
1
0
0
1
2
o Orange juice and soda are more likely purchased together than any other 2 items.
o Detergent is never purchased with milk or window cleaner.
o Milk is never purchased with soda or detergent.
Apriori Algorithm:
 Frequent Itemset Generation: Scan the dataset to count item frequencies and generate a
list of frequent itemsets (L1) exceeding a predefined minimum support threshold..
 Joining to Create Candidate Itemsets: Create candidate itemsets (L2) by joining pairs of
L1 frequent itemsets (e.g., "milk" and "bread" are frequent individually, create a candidate
itemset {milk, bread}). Verify the frequency of subsets of these candidate itemsets.
 Pruning Infrequent Itemsets: Eliminate candidate itemsets whose subsets are not frequent,
reducing the number of itemsets to be examined. This pruning is based on the observation
that if a subset is infrequent, the superitemset cannot be frequent.
 Repeat the Process: Iteratively repeat steps 2 and 3 to generate candidate itemsets of
higher levels (L3, L4, etc.) by joining frequent itemsets from the previous level. Continue
until no more frequent itemsets can be generated or a specified itemset size is reached.
 Association Rule Generation: Once all frequent itemsets are identified, generate
association rules from these itemsets.
 Pruning Rules: Optionally, prune rules that do not meet a minimum confidence threshold.
This step helps filter out rules that may not be statistically significant.
40
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Evaluating the quality of Association Rules:
 Confidence: the strength of an association, the percentage of a consequent appears given
that the antecedent has occurred.
•
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 =
𝐵
𝐴
 Support: shows how frequently the combination occurs in the database.
•
𝐵
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 = 𝑈
 Consequent Support (expected confidence of the c): equal to the number of consequent
transactions, divided by the total number of transactions.
•
𝐶𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 =
𝐶
𝑈
 Lif: equal to the confidence factor divided by the expected confidence. Lift is a factor by
which the likelihood of consequent increases given an antecedent.
•
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
𝐿𝑖𝑓𝑡 = 𝐶𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝑠𝑢𝑝𝑝𝑜𝑟𝑡
•
Lift = 2
o Indicates that items (A and B) occur together twice as often as expected
independently.
o Implies that the two items are twice as likely to be purchased together as expected
independently.
o This strong association indicates a meaningful relationship, making them suitable
candidates for bundling or recommendations.
• Lift > 1: Indicates a positive association between items.
• Lift < 1: Indicates a negative association, and the presence of one item may reduce
the likelihood of the other.
• Lift = 1: Suggests independence between items.
 A credible rule needs good confidence, high support, and lift > 1.
 High-confidence rules with low support should be interpreted cautiously due to potential
idiosyncrasies from a small number of cases.
 Example:
Transaction table:
1.000.000 total transactions
200.000 shoes
50.000 socks
20.000 shoes and socks
Shoes
200.000
Socks
50.000
Shoes and Socks
50.000
Rule:
If a costumer purchase shoes, then 10% of the time he/ she
will buy socks.
Evaluation criteria:
Confidence = shoes and socks / shoes = 10%
Support = shoes and socks / total transactions = 2%
Expected confidence = shoes / total transactions = 5%
Lift = Confidence / Expected confidence = 2
Confidence factor with socks on the left-hand side and
shoes on the right-hand side is:
shoes and socks / socks = 40%
Lift = 2: He/ she is twice as likely to buy socks if he/ she
bought shoes than if he/ she don’t buy shoes.
41
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Types of rules
 Trivial rules: “Customers who purchase maintenance agreements are very likely to
purchase large appliances.”
 Inexplicable Rules: “When a new hardware store opens, one of the most commonly sold
items is toilet bowl cleaners.”
 Actionable Rules: “Wal-Mart customers who purchase Barbie dolls have a 60 percent
likelihood of also purchasing one of three types of candy bars.”
 Rules that result of promotions made.
7. Semi-supervised classification
8.1.
Nearest Neighbors
Instance based classification:
 Simplest form of learning;
 Training instances are searched for instances that most closely resembles new instance;
 The instances themselves represent the knowledge;
 Also called instance-based learning
 Similarity function defines what’s “learned”
Requires three things:
1. The set of stored records
2. Distance Metric to compute distance between records
3. The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
1. Compute distance to other training records
2. Identify k nearest neighbors
3. Use class labels of nearest neighbors to determine the class
label of unknown record (e.g., by taking majority vote)
42
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Compute distance between two points:
 Euclidean distance: 𝑑(𝑝, 𝑞) = √∑𝑖(𝑝𝑖 − 𝑞𝑖 )2
 Determine the class from nearest neighbour list:
• Take the majority vote of class labels among the knearest neighbours.
• Weigh the vote according to distance.
•
1
𝑊 = 2𝑑
K-nn frontiers (and the number k):
 Large k
• Smooth frontiers
• Unable to detect small variations
 Small k
• Very sensitive to outliers
• Crisp frontiers
8.2.
Decision Trees
Classification Trees:









Classification trees are typically considered to be classification and regression tools
One key advantage is its results' simplicity and interpretability.
Therefore, the result of a classification tree can easily be expressed in English or SQL.
A classification tree is a decisional algorithm.
It can be seen as a way of storing knowledge.
The objective is to discriminate between classes.
Obtain leaves as pure as possible.
If possible, each leave should represent only individuals from a specific class.
In each level it divides the set into alternative partitions.
• Using a measure of quality selects the best partition.
 The process is repeated for each element of the partition.
 Stops when a given criteria is reached.
 Assumes the existence of a target variable "Class" for previously classified examples.
Each node specifies a unique attribute used as a test.
• N represents a node, ASET is the Attribute Set, and ISET is the Instance Set.
43
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Strengths:
 Interpretation: We can easily inderstand the reasons behind a specific classification
decision
 May use different types of data: Interval, ordinal, nominal, etc.
 Insensitive to scale factors: Variables measured in different scales may be used without
any type of normalization.
 Automatically defines the most relevant variables: These are the variables used at the top
of the tree.
 Can be adapted to a regression: Each leave becomes a linear model.
 Boundaries are linear and perpendicular to the variables axys.
 Sensitive to small perturbations in the data.
Classification Tree Variants:
 ID3, C4.5 e C5 [Quinlan 86,93]: Iterative Dichotomizer 3
 CART: Classification and regression trees [Breiman 84]
 CHAID [Hartigan 75]: Used in SPSS and SAS…
• In SAS you can choose different parameters to build your tree.
In clustering, classification trees offer insights through:
 Feature Importance: Train a classification tree using cluster assignments as the target
variable to identify critical features for distinguishing clusters.
 Interpretation: Analyse the tree to understand how features contribute to cluster
formation, gaining insights into each cluster's defining characteristics.
 Visualizations: Create visuals to enhance understanding of cluster-feature relationships.
 Validate and Refine: Use tree insights to validate and refine the clustering solution,
considering alignment with domain knowledge.
Impurity Metrics in Classification Trees:
 Entropy: measure of disorder or impurity in a set, calculated using information theory
concepts. It assesses unpredictability. Lower entropy values signify better purity.
• 𝐸 = −𝑝 𝑙𝑜𝑔2 (𝑝)
• p = probability of the examples belongs to a specific class
• Partition entropy: 𝐸𝑛𝑡(𝑆) = − ∑#𝐶
𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 ( 𝑝𝑖 )
• Gain in choosing A attribute:
o 𝐺𝑎𝑖𝑛(𝐸𝑛𝑡𝑛𝑒𝑤 ) = 𝐸𝑛𝑡𝑖𝑛𝑖𝑡𝑖𝑎𝑙 − 𝐸𝑛𝑡𝑛𝑒𝑤
o 𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡(𝑆) − ∑∀∈𝑉𝑎𝑙𝑜𝑟𝑒𝑠(𝐴)
#𝑆𝑣
𝐸𝑛𝑡( 𝑆𝑣 )
#𝑆
 Gini coefficient: measure of impurity or disorder in a set. It quantifies the likelihood of
misclassifying an element chosen randomly. A lower Gini index indicates better purity.
• Similar to entropy
• 𝐺(𝐼) = 1 − ∑𝐶𝑘=1 𝑝𝑘 2
44
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
8. Visualization of Multidimensional Data
Text mining
 Machine learning algorithms operate on a numeric feature space, expecting input as a
two-dimensional array where rows are instances and columns are features.
 To perform machine learning on text, we need to transform our documents into vector
representations such that we can apply numeric machine learning.
• This process is called feature extraction or more simply, vectorization, and is an
essential first step toward language aware analysis.
 If we are to apply machine learning our best option is to make a shift in how we think
about language:
• From a sequence of words
• to points that occupy a high-dimensional semantic space.
 Points in space can be close together or far apart, tightly clustered, or evenly distributed.
• Semantic space is therefore mapped in such a way where documents with similar
meanings are closer together and those that are different are farther apart.
 By encoding similarity as distance, we can begin to derive the primary components of
documents and draw decision boundaries in our semantic space.
• The simplest encoding of semantic space is the bag-of-words model, whose primary
insight is that meaning, and similarity are encoded in vocabulary.
One-Hot Encoding: A boolean vector encoding method that marks a particular vector index with
a value of true (1) if the token exists in the document and false (0) if it does not.
Frequency Vectors: The simplest vector encoding model is to simply fill in the vector with the
frequency of each word as it appears in the document.
Term Frequency–Inverse Document Frequency (TF-IDF): TF–IDF normalizes the frequency
of tokens in a document with respect to the rest of the corpus. This encoding approach accentuates
terms that are very relevant to a specific instance.
Dense Representations (or embeddings): Word2vec implements a word embedding model that
enables us to create these kinds of distributed representations.
Dense (or Distributed) Representations: Continuous bag-of-words (CBOW) or skip-gram
model, such that words are embedded in space along with similar words based on their context.
Multidimensionality Visualization Methods:
 PCA
 t-SNE
 UMAP
9.1.
Principal Component Analysis (PCA)
Principal Component Analysis (Input Space Reduction – Redundancy)
 Method that employs orthogonal transformations to convert set of observations of
possibly correlated variables into a set of linearly uncorrelated variables called principal
components.
 The number of principal components is equal to the number of original variables.
 These components are linear combinations of the original variables, with the first one
capturing the highest variance.
45
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
Steps in PCA:
1. Standardize the Data: If not already standardized, make the data zero mean and unit
variance.
2. Compute the Covariance Matrix: Calculate the covariance matrix C of the standardized
data (X), for m observations.
•
𝐶=
1 𝑇
𝑋 𝑋
𝑚
3. Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix
to get eigenvectors and eigenvalues.
• 𝐶𝑣𝑖 = 𝑖 𝑣𝑖
4. Sort Eigenvectors and Eigenvalues: Sort eigenvectors based on corresponding
eigenvalues in descending order.
5. Select Principal Components: Decide on the number of principal components (k) to
retain, often based on explained variance.
6. Form the Projection Matrix: Create a projection matrix by selecting the top k
eigenvectors.
• 𝑃 = (𝑣1 , … , 𝑣𝑘 )
7. Transform the Data: Multiply the standardized data matrix by the projection matrix to
obtain the new set of uncorrelated variables, the principal components.
• 𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑑 𝑑𝑎𝑡𝑎 = 𝑋 ∗ 𝑃
Principal Component Analysis in figures:
Minimize the
distance of the
points to the line
Maximize
the Sum of
the square
Distances
 Eigenvector – represents the direction of the PC
 Loading Scores - describe how much each variable contributes to a particular principal
component
 Eigenvalue - the variance explained by each principal component
46
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
9.2.
t-distributed Stochastic Neighbour Embedding (t-SNE)
t-distributed Stochastic Neighbour Embedding (t-SNE):
 Statistical method for visualizing high-dimensional data in two or three dimensions.
 It serves as a nonlinear dimensionality reduction technique, capturing relationships
between data points for effective visualization.
 Objective: visualizing high-dimensional data by mapping each datapoint to a location in
a lower-dimensional space.
 Method: nonlinear dimensionality reduction, emphasizing proximity for similar objects
and distance for dissimilar ones.
 t-SNE has been used for visualization in a wide range of applications, including
genomics, computer security research, natural language processing, music analysis,
cancer research, bioinformatics, etc
 While t-SNE plots often seem to display clusters, the visual clusters can be influenced
strongly by the chosen parameterization and therefore a good understanding of the
parameters for t-SNE is necessary.
 Interactive exploration may thus be necessary to choose parameters and validate results.
Steps in t-SNE:
1. Construct a probability distribution over pairs of high-dimensional objects, assigning
higher probability to similar objects.
2. Define a probability distribution over points in the low-dimensional map, minimizing KL
divergence between the two distributions.
Kullback–Leibler Divergence (KL Divergence):
 Also known as relative entropy
 A measure of divergence between two probability distributions.
 Indicates the average difference in log probabilities assigned by the two distributions to
the same events.
 𝐷𝐾𝐿 (𝑃||𝑄) = 0:
•
The distributions are identical; a larger value signifies greater dissimilarity.
 Optimizing the KL divergence is a measure of how one probability distribution is
different from a second, reference probability distribution.
47
DATA MINING
Data science and advanced Analytics, with a specialization in Data Science
“Perplexity” is the main t-SNE parameter
 Basically, defines how to balance attention between local and global aspects of your data.
 In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest
neighbours.
 It is comparable with the number of nearest neighbours k that is employed in many
manifold learners.
 The original paper says, “The performance of SNE is fairly robust to changes in the
perplexity, and typical values are between 5 and 50.”
9.3.
Uniform Manifold Approximation and Projection (UMAP)
Uniform Manifold Approximation and Projection (UMAP):
 Is employed for dimension reduction and visualization, offering an alternative to t-SNE.
 Operates based on assumptions of data distribution and local connectivity
 Uniform Distribution on Riemannian Manifold: Assumes data is uniformly distributed on
a Riemannian manifold.
 Local Connectivity: Assumes local connectivity within the manifold.
 Fuzzy Topological Structure: Models the manifold with a fuzzy topological structure.
 Parameters:
• Number of nearest neighbours: controls how UMAP balances local versus global
structure in the data.
• Minimum distance: controls how tightly UMAP is allowed to pack points together.
It provides the minimum distance apart that points are allowed to be in the low
dimensional representation.
Steps in UMAP:
1. Fuzzy Topological Representation Construction: Constructs a fuzzy topological
representation in the original space.
2. Optimization (Stochastic Gradient Descent): Optimizes
representation to closely match the fuzzy topological structure.
the
low-dimensional
48
Download