T2 Data to Insights to Decisions Data Analytics 344 Eldon Burger Eldon Burger Data Analytics 344 Data collection from a self-driving car Source: https://www.greenbiz.com/article/driverless-cars-wont-be-good-environment-if-they-lead-more-auto-use Summary Opsomming As part of the business understanding phase of the CRISP-DM, (i) a business problem must be converted to an analytical solution and (ii) the feasibility of the solution must be determined As deel van die besigheidbegrip fase van die CRISPDM, moet (i) `n besigheid probleem in `n analitiese oplossing opskep word en (ii) die haalbaarheid van die oplossing moet bepaal word i. To develop an analytical solution: informative features must be identified, data that corresponds to the features must be collected and the data must be structured in an analytical base table i. Om `n analities oplossing te ontwikkel, moet insiggewend kenmerke geïdentifiseer word, data wat die kenmerke voorstel versamel word, en daarna gestruktureerd word as `n analitiese basis tabel ii. To determine the feasibility of an analytical solution, the availability of historical data and the capacity of the business to action the insights must be considered ii. Om die haalbaarheid van `analitiese oplossing te bepaal, moet die beskikbaarheid van historiese data en die vermoë van die onderneming om die model te gebruik oorweeg word Eldon Burger Data Analytics 344 Eldon Burger The business understanding phase 4 The analytic base table 12 Feature selection 22 Data Analytics 344 The business understanding phase Eldon Burger Data Analytics 344 An analytic project starts with understanding the business problem An analytic problem is not handed to a data analytics practitioner fully defined; rather a business problem or need is provided 1 Given a business problem, it is the job of a data analytics practitioner to decide how to address the problem using analytics Business understanding 2 Data understanding 3 Data preparation 4 Modelling 6 Deployment A key step in any data analytics project is to understand the business problem that the organisation wants to solve and then to determine how an analytical model can help the business to solve the problem 5 Evaluation The six phases of the CRISP-DM process and the key relationships between the six phases Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 A business problem can be converted into an analytical solution by answering three key questions 1 What is the business problem and what are the goals that the business wants to achieve? 2 How does the business currently work? 3 In what ways could an analytical model address the business problem? what type of model will be created, where and when will the model will be used by the business and how the model will help to address the business problem Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Illustrative example 1 A predictive model for motor insurance fraud Despite having a fraud investigation team that investigates 30% of all motor insurance claims; a motor insurance company is losing money due to fraudulent claims. Investigating claims takes time and has a cost associated with it Fraud Not-Fraud Investigate True positive False positive When the outcome of an investigation is that the claim is not fraudulent the company wasted time and money Do not investigate False negative True negative When claims are not investigated the company will lose money due to fraud Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Illustrative example 2 A predictive model for motor insurance fraud (1/2) Currently, the business uses a fraud checklist to decide if a claim should be investigated for fraud or not No Lodge claim Claim valid? Yes Investigate? Yes Investigate No Fraud Reject claim Simplified illustration of a motor claim process Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Not fraud Pay-out claim Illustrative example 2 A predictive model for motor insurance fraud (2/2) Risk Rating Low Medium Inconsistencies in claimant’s statement Consistent statement Minor inconsistencies Significant contradictions Support e.g., witness statement Supportive Not available Conflicting Previous claim history Clean claim history Frequent claims Previous fraudulent claim Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger High Data Analytics 344 Illustrative example 3 A predictive model for motor insurance fraud (2/2) Business problem: Despite having a fraud investigation team that investigates 30% of all motor insurance claims; a motor insurance company is still losing money due to fraudulent claims No Current business operation Lodge claim Claim valid? Yes Investigate? Yes Investigate No Pay-out claim Fraud Reject claim Simplified illustration of a motor claim process Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Not fraud Data Analytics 344 Analytical solution: Build a model to determine the likelihood that a claim is fraud Once the analytic solution has been defined the feasibility of the solution should be determined Data availability Capacity to action To develop data-driven models; we require data. For the claim problem, a large collection of historical claims and their labels are required e.g. fraudulent or non-fraudulent An analytical model is only useful if a business can use it. For the claim problem, the model is envisioned to replace the checklist; helping the fraud investigation team decide whether a claim should be investigated or not Eldon Burger Data Analytics 344 The analytic base table Eldon Burger Data Analytics 344 Illustrative example After the business understanding phase, data collection and preparation starts For the motor insurance fraud example, historical claim information can be collected in a table-like structure. Each row of the table represents one claim and each column of the table represent a feature; a measurable property of the object that we want to analyse Example of data collected for the motor insurance fraud example Properties available for a claim; should help to decide if a claim is fraudulent or not ID Location Time of day Weather Driver licence # Claim amount Fraud 1 Dorp 13:00 Sunny 42300JK2456 10 921 No 2 Adam Tas 13:30 Cloudy 600100147J35 20 567 No 3 Joubert Rd 12:30 Sunny 206100082R6M 12 347 Yes … … … Descriptive features … … Target feature A Not all analytical solution will have a target feature Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger … Data Analytics 344 The feature we want our model to determine for a new claimA The basic format used to represent data in data analytic projects is the Analytics Base Table (1/3) An analytic base table (ABT) is a flat tabular data structure made up of rows and columns. Rows Columns The basic structure of an analytics base table Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 The basic format used to represent data in data analytic projects is the Analytics Base Table (2/3) Each column of the ABT represent a feature, while each row represents an instance. The columns are divided into two sets: descriptive features and a single target feature. Descriptive features features A The basic structure of an analytics base table A Also referred to as attributes, fields or variables Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Target feature The basic format used to represent data in data analytic projects is the Analytics Base Table (3/3) Each row contains values for both the descriptive features and the target feature, representing an instance for which a prediction can be made. The term one-row-per-subject is used often used to describe the structure of the ABT Descriptive features instanceA features The basic structure of an analytics base table A Also referred to as examples, records or observations Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Target feature To create an analytical base table, relevant features must be identified Target concept The selection of features to include in an analytic base table is based on domain knowledge AND on analysis of the relationship between features To identify features, start by identifying a set of domain concepts; a high-level abstraction that describes some characteristics of the prediction subject from which we derive a set of concrete features Domain Concept Target feature Domain Sub concept Domain Sub concept Feature Feature Feature Feature Analytics Solution Domain Sub concept Feature Feature Domain Concept Domain Sub concept Feature Feature The hierarchical relationship between an analytics solution, domain concepts and descriptive features Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Illustrative example Domain concepts for motor insurance fraud For the motor insurance claim fraud analytical solution, example domain concepts include Target concept - Policy details: Covers information relating to the policy held by the claimant e.g. age of the policy Policy details - Claim details: Covers details of the claim itself e.g. incident type, claim amount Claim details Claim types Claimant history Claim frequency Claimant links Links with other claims Claimant demographics Links with current claims - Claimant history: Information of previous claims made by the claimant e.g. number of previous claims - Claimant links: Relationships or links between the claimant and other people involved - Claimant demographics: Demographic details of the claimant such as age, gender and occupation Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Motor insurance claim fraud Fraud outcome Example domain concepts for a motor insurance fraud prediction analytics solution Illustrative example Once the domain concept has been identified; the feature can be designed (1/2) The descriptive features in an analytical base table can either be raw features or derived features. Raw features are features that are directly “copied” from the source data. While derived features, are features that are constructed from one or more raw data sources Eldon Burger Raw feature Derived feature Location of accident Accidents “zones” Data Analytics 344 Once the domain concept has been identified; the feature can be designed (2/2) Three data considerations are particularly important when we are designing features: Considerations Description Examples Availability Data must be available to implement any feature we want to use In an online payments scenario, we might want to include the average account balance over the past six months, however, the company might not track the historical balance Timing Data that will be used to define a feature, must be available before the target feature is known If we want to predict the ice cream sales for tomorrow, we cannot include tomorrow’s temperature as a feature unless we can predict it accurately Data can become stale To determine if a loan should be granted to an applicant, we might want to use salary as a descriptive feature. However, salaries will change over an extended time Longevity Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 An Analytics Base Table is constructed from multiple data sources Flat files Although analytical base tables are the key structure that we use to develop models, data in organisations are rarely kept in a neat table Operational databases Instead, an analytical base table needs to be constructed from the raw data sources available in the organisation Data Warehouse & Data Marts Creating the final analytics base table may require joining data sources, filtering rows in a data source, filtering fields in a data source, deriving new features by combing or transforming existing features, and aggregating data sources External data feeds Analytics Base Table Exotic data Different data sources are combined to create an analytics base table Source: Chapter 2 Fundamentals of Machine Learning for Predictive Data Analytics, 2nd Edition Eldon Burger Data Analytics 344 Feature selection Eldon Burger Data Analytics 344 Not all the features collected will be relevant Feature selection is the process of selecting a subset of the most informative features for use in model construction A feature can be classified as either: Notredundant Redundant A C Irrelevant Eldon Burger B Relevant An informative feature: Features that A are correlated with the output target, but is not correlated with other features B A redundant feature: Features that are correlated with the output target, but is correlated with other features C Or an irrelevant feature: Features with no correlation to the output target Data Analytics 344 Feature selection aims to remove redundant and irrelevant features 1. Some models, notably support vector machines and neural networks, are sensitive to irrelevant features 2. Other models like linear or logistic regression are vulnerable to correlated predictors i.e. redundant features 3. Even when a model is insensitive to extra predictors, removing features can be beneficial. Fewer features result in simpler models, shorter training time and reduce the potential to overfit Source: Chapter 10 – Feature Engineering and Selection: A practical approach for predictive modeling Eldon Burger Data Analytics 344 Feature selection can be performed using various methods Filter: Perform statistical test between descriptive and target feature to identify relevant featuresA All features Wrapper: Add or remove features to model and compare performance All features Embed: Model performs feature selection; it is embedded in the algorithm All features Filter A Also consider test between descriptive features to identify redundant features Eldon Burger Data Analytics 344 Subset of features Build Model Evaluate Model Generate subset Build Model Evaluate Model Build Model Evaluate Model