Uploaded by Mthokozisi khuzwayo

Topic2 DataToInsights 2024

advertisement
T2 Data to Insights to Decisions
Data Analytics 344
Eldon Burger
Eldon Burger
Data Analytics 344
Data collection from a self-driving car
Source: https://www.greenbiz.com/article/driverless-cars-wont-be-good-environment-if-they-lead-more-auto-use
Summary
Opsomming
As part of the business understanding phase of
the CRISP-DM, (i) a business problem must be
converted to an analytical solution and (ii) the
feasibility of the solution must be determined
As deel van die besigheidbegrip fase van die CRISPDM, moet (i) `n besigheid probleem in `n analitiese
oplossing opskep word en (ii) die haalbaarheid van
die oplossing moet bepaal word
i.
To develop an analytical solution: informative
features must be identified, data that
corresponds to the features must be collected
and the data must be structured in an
analytical base table
i.
Om `n analities oplossing te ontwikkel, moet
insiggewend kenmerke geïdentifiseer word, data
wat die kenmerke voorstel versamel word, en
daarna gestruktureerd word as `n analitiese basis
tabel
ii.
To determine the feasibility of an analytical
solution, the availability of historical data and
the capacity of the business to action the
insights must be considered
ii.
Om die haalbaarheid van `analitiese oplossing te
bepaal, moet die beskikbaarheid van historiese
data en die vermoë van die onderneming om die
model te gebruik oorweeg word
Eldon Burger
Data Analytics 344
Eldon Burger
The business understanding phase
4
The analytic base table
12
Feature selection
22
Data Analytics 344
The business understanding phase
Eldon Burger
Data Analytics 344
An analytic project starts with understanding the business
problem
An analytic problem is not handed to a data
analytics practitioner fully defined; rather a
business problem or need is provided
1
Given a business problem, it is the job of a
data analytics practitioner to decide how to
address the problem using analytics
Business
understanding
2
Data
understanding
3
Data
preparation
4
Modelling
6 Deployment
A key step in any data analytics project is to
understand the business problem that the
organisation wants to solve and then to
determine how an analytical model can help
the business to solve the problem
5
Evaluation
The six phases of the CRISP-DM process and the key
relationships between the six phases
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
A business problem can be converted into an analytical solution
by answering three key questions
1
What is the business problem and what are the goals that the business wants to
achieve?
2
How does the business currently work?
3
In what ways could an analytical model address the business problem?
what type of model will be created,
where and when will the model will be used by the business and
how the model will help to address the business problem
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Illustrative example
1
A predictive model for motor insurance fraud
Despite having a fraud investigation team that investigates 30% of all motor insurance claims; a motor
insurance company is losing money due to fraudulent claims. Investigating claims takes time and has a
cost associated with it
Fraud
Not-Fraud
Investigate
True
positive
False
positive
When the outcome of an investigation
is that the claim is not fraudulent the
company wasted time and money
Do not
investigate
False
negative
True
negative
When claims are not investigated the
company will lose money due to fraud
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Illustrative example
2
A predictive model for motor insurance fraud (1/2)
Currently, the business uses a fraud checklist to decide if a claim should be investigated for fraud or not
No
Lodge
claim
Claim valid?
Yes
Investigate?
Yes
Investigate
No
Fraud
Reject claim
Simplified illustration of a motor claim process
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Not
fraud
Pay-out claim
Illustrative example
2
A predictive model for motor insurance fraud (2/2)
Risk
Rating
Low
Medium
Inconsistencies in
claimant’s
statement
Consistent
statement
Minor
inconsistencies
Significant
contradictions
Support e.g.,
witness
statement
Supportive
Not available
Conflicting
Previous claim
history
Clean claim
history
Frequent claims
Previous
fraudulent claim
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
High
Data Analytics 344
Illustrative example
3
A predictive model for motor insurance fraud (2/2)
Business problem: Despite having a fraud investigation team that investigates 30% of all motor insurance
claims; a motor insurance company is still losing money due to fraudulent claims
No
Current
business
operation
Lodge
claim
Claim valid?
Yes
Investigate?
Yes
Investigate
No
Pay-out claim
Fraud
Reject claim
Simplified illustration of a motor claim process
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Not
fraud
Data Analytics 344
Analytical solution: Build a
model to determine the
likelihood that a claim is fraud
Once the analytic solution has been defined the feasibility of
the solution should be determined
Data availability
Capacity to action
To develop data-driven models; we
require data. For the claim problem, a
large collection of historical claims and
their labels are required e.g. fraudulent or
non-fraudulent
An analytical model is only useful if a
business can use it. For the claim
problem, the model is envisioned to
replace the checklist; helping the fraud
investigation team decide whether a
claim should be investigated or not
Eldon Burger
Data Analytics 344
The analytic base table
Eldon Burger
Data Analytics 344
Illustrative example
After the business understanding phase, data collection and
preparation starts
For the motor insurance fraud example, historical claim information can be collected in a table-like
structure. Each row of the table represents one claim and each column of the table represent a feature; a
measurable property of the object that we want to analyse
Example of data collected for the motor insurance fraud example
Properties
available for a
claim; should
help to decide
if a claim is
fraudulent or
not
ID
Location
Time of day
Weather
Driver licence #
Claim amount
Fraud
1
Dorp
13:00
Sunny
42300JK2456
10 921
No
2
Adam Tas
13:30
Cloudy
600100147J35
20 567
No
3
Joubert Rd
12:30
Sunny
206100082R6M
12 347
Yes
…
…
…
Descriptive features
…
…
Target feature
A Not all analytical solution will have a target feature
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
…
Data Analytics 344
The feature we want our model
to determine for a new claimA
The basic format used to represent data in data analytic
projects is the Analytics Base Table (1/3)
An analytic base table (ABT) is a flat tabular data structure made up of rows and columns.
Rows
Columns
The basic structure of an analytics base table
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
The basic format used to represent data in data analytic
projects is the Analytics Base Table (2/3)
Each column of the ABT represent a feature, while each row represents an instance. The columns are divided
into two sets: descriptive features and a single target feature.
Descriptive features
features A
The basic structure of an analytics base table
A Also referred to as attributes, fields or variables
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Target
feature
The basic format used to represent data in data analytic
projects is the Analytics Base Table (3/3)
Each row contains values for both the descriptive features and the target feature, representing an instance for
which a prediction can be made. The term one-row-per-subject is used often used to describe the structure of
the ABT
Descriptive features
instanceA
features
The basic structure of an analytics base table
A Also referred to as examples, records or observations
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Target
feature
To create an analytical base table, relevant features must be
identified
Target
concept
The selection of features to include in
an analytic base table is based on
domain knowledge AND on analysis of
the relationship between features
To identify features, start by identifying
a set of domain concepts; a high-level
abstraction that describes some
characteristics of the prediction subject
from which we derive a set of concrete
features
Domain
Concept
Target
feature
Domain
Sub concept
Domain
Sub concept
Feature
Feature
Feature
Feature
Analytics
Solution
Domain
Sub concept
Feature
Feature
Domain
Concept
Domain
Sub concept
Feature
Feature
The hierarchical relationship between an analytics solution, domain
concepts and descriptive features
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Illustrative example
Domain concepts for motor insurance fraud
For the motor insurance claim fraud analytical solution,
example domain concepts include
Target
concept
- Policy details: Covers information relating to the policy
held by the claimant e.g. age of the policy
Policy
details
- Claim details: Covers details of the claim itself e.g.
incident type, claim amount
Claim
details
Claim types
Claimant
history
Claim
frequency
Claimant
links
Links with
other claims
Claimant
demographics
Links with
current claims
- Claimant history: Information of previous claims made
by the claimant e.g. number of previous claims
- Claimant links: Relationships or links between the
claimant and other people involved
- Claimant demographics: Demographic details of the
claimant such as age, gender and occupation
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Motor insurance
claim fraud
Fraud
outcome
Example domain concepts for a motor insurance
fraud prediction analytics solution
Illustrative example
Once the domain concept has been identified; the feature can
be designed (1/2)
The descriptive features in an analytical base table can either be raw features or derived features. Raw
features are features that are directly “copied” from the source data. While derived features, are features that
are constructed from one or more raw data sources
Eldon Burger
Raw feature
Derived feature
Location of accident
Accidents “zones”
Data Analytics 344
Once the domain concept has been identified; the feature can
be designed (2/2)
Three data considerations are particularly important when we are designing features:
Considerations
Description
Examples
Availability
 Data must be available to
implement any feature we want
to use
 In an online payments scenario, we might want
to include the average account balance over
the past six months, however, the company
might not track the historical balance
Timing
 Data that will be used to define a
feature, must be available before
the target feature is known
 If we want to predict the ice cream sales for
tomorrow, we cannot include tomorrow’s
temperature as a feature unless we can
predict it accurately
 Data can become stale
 To determine if a loan should be granted to
an applicant, we might want to use salary as
a descriptive feature. However, salaries will
change over an extended time
Longevity
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
An Analytics Base Table is constructed from multiple data
sources
Flat files
Although analytical base tables are the key
structure that we use to develop models, data in
organisations are rarely kept in a neat table
Operational
databases
Instead, an analytical base table needs to be
constructed from the raw data sources available
in the organisation
Data Warehouse &
Data Marts
Creating the final analytics base table may
require joining data sources, filtering rows in a
data source, filtering fields in a data source,
deriving new features by combing or
transforming existing features, and aggregating
data sources
External
data feeds
Analytics Base
Table
Exotic data
Different data sources are combined to create an analytics base table
Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition
Eldon Burger
Data Analytics 344
Feature selection
Eldon Burger
Data Analytics 344
Not all the features collected will be relevant
Feature selection is the process of selecting a subset of the most informative features for use in
model construction
A feature can be classified as either:
Notredundant
Redundant
A
C
Irrelevant
Eldon Burger
B
Relevant
 An informative feature: Features that
A
are correlated with the output target,
but is not correlated with other features
B
 A redundant feature: Features that are
correlated with the output target, but is
correlated with other features
C
 Or an irrelevant feature: Features with
no correlation to the output target
Data Analytics 344
Feature selection aims to remove redundant and irrelevant
features
1. Some models, notably support vector machines and neural
networks, are sensitive to irrelevant features
2. Other models like linear or logistic regression are vulnerable to
correlated predictors i.e. redundant features
3. Even when a model is insensitive to extra predictors, removing
features can be beneficial. Fewer features result in simpler models,
shorter training time and reduce the potential to overfit
Source: Chapter 10 – Feature Engineering and Selection: A practical approach for predictive modeling
Eldon Burger
Data Analytics 344
Feature selection can be performed using various methods
Filter: Perform statistical test
between descriptive and
target feature to identify
relevant featuresA
All features
Wrapper: Add or remove
features to model and
compare performance
All features
Embed: Model performs
feature selection; it is
embedded in the algorithm
All features
Filter
A Also consider test between descriptive features to identify redundant features
Eldon Burger
Data Analytics 344
Subset of
features
Build
Model
Evaluate
Model
Generate
subset
Build
Model
Evaluate
Model
Build
Model
Evaluate
Model
Download