BSAD343 BUSINESS ANALYTICS Osman Abraz Quinlan School of Business Week 4: Data Preparation LO 4.1: Describe Data Flow and Activities to Understand and Prepare Data LO 4.2: Describe and Explain Business Logic LO 4.3: Describe Data Governance LO 4.4: Describe Basic Statistics Used to Understand Data LO 4.5: Describe Methods to Identify Data Outliers Process Mapping • Problem Recognition • Review of Previous Findings Framing • Business Objectives Solving • Data Collection & Understanding • Data Analysis & Preparation • Model Building • Actionable Results • Communicating Reporting • Presenting Data Flow and High Fidelity Data written by a doctor may be keyed by a nurse. Every stage in the chain of data flow is a potential for error. Big challenge in the health industry to create digital historical records! Know where your data came from and follow the chain! Data is carried from source to destination. As data travels it goes through multiple stages, it is changed, reconstructed and subjected to potential noise and interference at each stage. High fidelity data is about preserving the integrity of the data as it travels, similar to HF audio. Preserving Data Integrity Veracity is critical and also expensive. As in HF Audio transmission we need to have the tools and the metrics to monitor (dashboards, event logs) and measure (scorecards) quality and integrity at separate stages as data travels from source to destination. This becomes more critical and also challenging as data is becoming bigger and 24/7 streaming. Failure to do so will greatly enhance the risks of fast garbage-in garbage-out. 1 Verify Verify 2 Data Flow Ask for HF Data ☺! How to verify? count # characters in message (in Geek terms called checksum!) Add context to message 3 Data Inspection • Once the raw data are extracted it must be reviewed and inspected to assess data quality. • In addition to visually reviewing data, counting and sorting are among the very first tasks most data analysts perform to gain a better understanding and insights into the data. • Counting and sorting data help us verify that the data set is complete or that it may have missing values, especially for important variables. • Sorting data also allows us to review the range of values for each variable. Collection, Understanding , Preparation Collection • Access • Integration • Assess Understanding • Variables • Cases • Business Logic Preparation • Missing • Outliers • Evaluate Data Flow Where ERD belongs? Business Logic • Field Integrity Field Integrity is about integrity of values entered in a field (attribute). For example, a text value entered in a numerical column, or decimal value in an integer column, or a non-defined value in a categorical column • Relationship Integrity Relationship Integrity is about integrity of relationship between fields. A relationship integrity violation is when for example the years owned is greater than age Business Logic Examples Relationship Can you think of other examples like Checking or Age? A checking account may be opened by an individual who has reached the age of majority (age of majority is 18 in most states). Field Data Preparation • Once we have inspected and explored data, we can start the data preparation process. • There are important data preparation techniques like data cleansing, transformation, handling missing values and outliers, and data derivation. • There may be missing values and/or outliers in the key variables that are crucial for subsequent analysis. Data Preparation Activities Data Cleansing – How to clean the data? Delete blanks, duplicates, miscodes, bad business logic, or plain garbage ‘dirty records’ Data Transformation – How to express data variables? Units? Normalize? Unit conversions (currency), standardization of data (no units, common baseline) Cleansing Transformation Detection Reduction Abstraction Filtering Derivation Attributes Data Imputation– How to handle missing values? Delete entire row/record or field (watch for must-have data & nice-to-have data) Data Filtering – What to do about outliers and unwanted data? Remove unnecessary information, or spams → reduce noise Data Preparation Activities Data Abstraction – How to handle temporal or qualitative expressions? Age band, colors, categories → assign a value (itemization) Data Reduction – How much data to be used? Remove duplicates, redundancies, reduce dimensionality Cleansing Transformation Detection Reduction Abstraction Filtering Derivation Data Derivation – Need to create new data variables? Use Longitudinal and Latitude to derive distance or address Data Attributes – How to define? Text, numeric, allowed values, default values Attributes Data Preparation Checklist • No data vendor or provider, no matter how sophisticated or reputable, will not guarantee integrity of the data. Read the fine lines in the disclaimer “use at your own risk”. • Always question and conduct your internal evaluation. Do not assume anything! Cleansing • Correction • Accurate • Current Transformation • Normalize • Units • Absolute Detection • Missing • Deletion • Duplicates Filtering • Outliers • Unwanted Derivation • Derived • Fitting • Private Data Filtering • The process of extracting portions of a data set that are relevant to the analysis is called filtering. • It is commonly used to pre-process the data prior to analysis. • Filtering can also be used to eliminate unwanted data such as observations that contain missing values, low quality data, or outliers. • Sometimes, filtering involves excluding variables that are irrelevant to the problem or variables that contain redundant information or variables with excessive amounts of missing values instead of observations. Missing Values • Missing values are a common quality problem found in data and can lead to a significant reduction in the number of usable observations. • Understanding why the values are missing is the first step in the treatment of missing values. – When working with surveys, respondents might decline to provide the information due to its sensitive nature or some of the items may not apply to every respondent – Can also be caused by human errors, problems associated with data collection, or equipment failures • There are two strategies for dealing with missing values: • Omission: is to exclude observations with missing values. It is appropriate when the amount of missing value is small. • Imputation: replace missing values with some reasonable values Missing Values Handling Missing Values • For numerical values, replace missing values with the mean/average value across across relevant observations. – Doesn’t increase the variability in the data set – If a large number of values are missing, mean imputation will likely distort the relationships among variables, leading to biased results • For categorical variables, the most frequent category is often used. • If the variable that has many missing values is deemed unimportant for further analysis related to the problem, the variable may be excluded from the analysis. Missing Data or Filling Data (Padding) The following is an illustrative example of the obvious missing data and the hard to identify common practice of data padding. This example shows the closing trading price for a particular stock over two months period in 2012 Filling can be desirable but is technically much more challenging to identify filled data versus missing data. How to identify? Data Governance Every major/serious institution working with data has/should have a data governance policy. Correcting data, replacing missing data, and calculating new data should be conducted according to strict guidelines and regulations as defined by the business such as licensing, terms and conditions of use, and storage. Handle with care! ▪ Domain expertise is a must. Data should be handled by those who understand the data! ▪ Maintain an audit trace history: when, what, and by whom ▪ Proper handling of derived and custom data ▪ Proper tagging to differentiate sourced data from corrected, replaced, derived, and custom data ▪ Ability to revert back if needed ▪ Strict rules on who is authorized to access the data, and what is permitted with the data Use Case: Missing Data & Zeros How should it be represented? Blank? Null? NA? 0? Loan Purpose Checking Savings ‘creditrisk.csv Small Appliance Furniture New Car Furniture Education Furniture 0 0 0 638 963 2827 739 1230 389 347 4754 0 If you see a zero value, you should ask yourself is it real or it means ‘nothing’ or ‘missing’. Is it a true 0 or a false 0? If false better not to have, unless it is understood what the 0 stands for. If not ask the source or provider! Suppose 50 rows in the checking column of 100 rows have missing or zero values. Is the Average of the column sum(checking)/50 or sum(checking)/100? It depends again if the zero values are true or not. The difference can be huge! Use Case: What About Units If units are not spelled out, go back to source and find out. Even when spelled out it can sometimes be confusing. Consider the weight unit definitions for ton, and tonne: ‘ton’ as used in the United States is a unit of weight equal to 2,000 pounds (907.18474 kg) ‘tonne’ as used in Britain is metric unit of mass equal to 1,000 kilograms; it is equivalent to approximately 2,204.6 pounds $125M Lost in Translation Los Angeles Times, October 01, 1999 Mars Probe Lost Due to Simple Math Error • NASA lost its $125-million Mars Climate Orbiter because a navigation team at the Jet Propulsion Laboratory (JPL) used the metric system of millimeters and meters in its calculations, while Lockheed Martin Astronautics in Denver, which designed and built the spacecraft, provided crucial acceleration data in the system of inches, feet and pounds. • JPL engineers mistook acceleration readings measured in English units of poundseconds for a metric measure of force called newton-seconds. Attention to details! Data Corruption: A Very Common Scenario How to differentiate between a data sourced and a data that was corrected or edited? Reason why having clear and strict data governance on how data is maintained, by whom, and an audit trail. Data Sourcing 2.3567 037 (LA) Data Editing/Correction 2.4127 2.3570 Data Rounding 2.4127 Data Delivered 37 (NC) Data Conversion Use Case: Text and Numbers Conversion Not all numerals are numbers! Many software, including Excel and other spreadsheets, often make the mistake of treating numerals as numbers and stripping the leading zeroes. This can cause a lot of problems down the road!. Consider ZIP codes used by the UPS Union City New Jersey 07087 ---- stripping---- 7087! Can you think of a business logic to detect such a problem? FIPS 5-digit codes uniquely identify counties in the US. Used by National Weather Service. First two digits are for State and last three for County. 13029 is Georgia State for 13 and 029 for Bryan County in GA 037 is a Los Angeles County ---- stripping the front 0 ---37 is North Carolina State! Data & Basic Statistics Consider the two data series: Series 1 (1,4,8,4,7,2,3,5,24) Series 2 (60,65,45,70,85,70,50,55,90) Mean (Average) = Sum of values / N (number of values) Median = Middle value in a sorted data. For Series 1 (1,2,3,4,4,5,7,8,24) it is 4. What if even count? Mode = Observation that occurs most frequently. For Series 1 it is 4 Dispersion = Range = Max – Min. For Series 1 24-1 = 23 Variance = Good spread measure = Standard Deviation SD (Sigma) = SQRT (Variance) SNR (Signal-to-Noise Ratio) = Mean/SD Basic Statistics - Mean The mean, often called the average, is a common way to measure the center of a distribution of data. The mean, denoted as µ or x̄, can be calculated as where x1, x2, ..., xn represent the n observed values. Basic Statistics - Variance & Standard Deviation The mean was introduced as a method to describe the center of a data set but variability in the data is also important. We introduce two measures of variability: the variance and the standard deviation. Both are very useful in data analysis. Basic Statistics - Variance We call the distance of an observation from its mean its deviation. If we square these deviations and then take an average, the result is equal to the sample variance. Variance is the average squared deviation from the mean. • We divide by n-1, rather than dividing by n when computing a sample’s variance. It has some mathematical reasoning, but the end result is that doing this makes the statistic slightly more reliable and useful. • In the calculation of variance, we use the squared deviation to get rid of negatives so that observations equally distant from the mean are weighed equally. Basic Statistics - Standard Deviation The standard deviation is the square root of the variance, and has the same units as the data. The standard deviation is useful when considering how far the data are distributed from the mean. It represents the typical deviations from the mean. Usually 70% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations. Basic Statistics - Median The median is the value that splits the data in half when ordered in ascending order. If there are an even number of observations, then the median is the average of the two values in the middle. Since the median is the midpoint of the data, 50% of the values are below it and other 50% are above. Hence, it is also the 50th percentile. Data & Outliers Another important data preparation task involves the treatment of extremely small or large values, referred to as outliers. An outlier is a data point that differs significantly from others. A common and sometimes challenging task is to identify data outliers. One method (not best) to detect outlier is to measure how far a value is a certain number of standard deviations (SD) away from the mean. The rule is 3 SD. Frequency of Occurrence (# students) Normal distribution of sample data Possible Outliers Mean Mean – 3*SD Values (grades) Mean + 3*SD Outlier Detection Method Consider the two data series: Series 1 (1,4,6,4,5,2,3,5,24) Series 2 (60,65,45,70,85,70,50,55,90) Mean = 6 SD = 5.7 Mean + 3*SD = 23.1 Mean – 3*SD = -11.1 Mean = 65.5 SD = 15.1 Mean + 3*SD = 110.84 Mean - 3*SD = 20.2 What measures are most affected by an outlier? Consider Mean and Median What about range? Box Plot The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median. Q1, Median, Q3, IQR • The 25th percentile is also called the first quartile, Q1. • The 50th percentile is also called the median. • • The 75th percentile is also called the third quartile, Q3. Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range, or the IQR. IQR = Q3 - Q1 Whiskers and Outliers Whiskers of a box plot can extend up to 1.5 x IQR away from the quartiles. max upper whisker reach = Q3 + 1.5 x IQR max lower whisker reach = Q1 - 1.5 x IQR A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data. Should we remove or keep outliers? What should be done with outliers? • They should be understood in the context of the data. An outlier for a year of data may not be an outlier for the month in which it occurred and vice versa. • They should be investigated to determine if they are in error. The values may have simply been entered incorrectly. If a value can be corrected, it should be. • They should be investigated to determine why they are so different from the rest of the data. For example, were extra sales or fewer sales seen because of a special event like a holiday. Suggested Studies Self-Study Exercises → What is High Fidelity data and why is it important ? → What is data governance and why is it important? → Study use cases of missing data, rounding, and conversion → What is a good way to handle a missing data? → Consider the two Checking and Savings columns in creditrisk.csv. Compute the Mean, SD, Max, and Min for each. Any outliers? Try with both Excel and R. Reconcile the two. → Repeat the above calculation by replacing all zero entries in Checking and Savings with ‘NA’. Try with both Excel and R. Conclusions? → Looking at the Divvy Bikes sharing data (https://www.divvybikes.com/data) can you think of some business logic integrity checks? BSAD Lab 04 Lab Session 04 Two main tasks • Data outliers • Data preparation Lab Session (In-Class and Take-Home) TASK 1. NOTES Data Outliers • Refer to the last ‘creditrisk.csv’ file from Sakai • Calculate the Mean, STD, Max, and Min for the Age column. Use DONE ☐ both Excel and R. Compare results. 2. Data Preparation • Use formula to detect outliers. Are there any? • Download the new original ‘creditriskorg.csv’ to your desktop • Calculate the Mean for Checking in Excel. Explain how the ☐ missing values are treated in the Excel calculation. • Read file in R. Watch for header and empty line. • Calculate the Mean for Checking in R. Observations? • Identify the data issues in R to calculate the mean. Hint: $, NA, and comma (1,235 vs 1235) • Explore ways to resolve. Consider both in Excel and/or in R R Commands Cheat-Sheet: 3. Divvy Data Business Logic and Data Modeling 1. newdata=read.csv("creditriskorg.csv",skip=1,header=TRUE,sep=",") 2. checking = newdata$Checking 3. checking = sub(“,”,””,checking) #substitute for comma 4. checking= sub(“\\$”,””,checking) #substitute for $ 5. checking = as.numeric(checking) #numeric convert 6. mean(checking, na.rm=TRUE) #mean with NA removed • Go to https://www.divvybikes.com/data • Download latest 2016 Q3/Q4 data (zip file) • Check the README file. Open the file Divvy_Trips_2016_Q4 in Excel • Per the Data License Agreement who owns the data, and can the data be correlated with other sources? • Note the size of the file, the number of columns and of rows • Identify the unique entities and fields. • Define a relational business logic for the column field ‘tripduration’ • Using www.erdplus.com draw a star schema using the following three tables: - A Fact table for Trip - A Dimension table for Station - A Dimension table for User ☐