Uploaded by Nairah Khatoon

BSAD343W4

advertisement
BSAD343
BUSINESS ANALYTICS
Osman Abraz
Quinlan School of Business
Week 4: Data Preparation
LO 4.1: Describe Data Flow and Activities to Understand and Prepare Data
LO 4.2: Describe and Explain Business Logic
LO 4.3: Describe Data Governance
LO 4.4: Describe Basic Statistics Used to Understand Data
LO 4.5: Describe Methods to Identify Data Outliers
Process Mapping
• Problem Recognition
• Review of Previous Findings
Framing • Business Objectives
Solving
• Data Collection & Understanding
• Data Analysis & Preparation
• Model Building
• Actionable Results
• Communicating
Reporting • Presenting
Data Flow and High Fidelity
Data written by a doctor may be keyed
by a nurse. Every stage in the chain of
data flow is a potential for error. Big
challenge in the health industry to create
digital historical records!
Know where your data came from and
follow the chain!
Data is carried from source to destination. As data travels it goes through multiple
stages, it is changed, reconstructed and subjected to potential noise and
interference at each stage.
High fidelity data is about preserving the integrity of the data as it travels, similar
to HF audio.
Preserving Data Integrity
Veracity is critical and also expensive.
As in HF Audio transmission we need to have the tools and the metrics to monitor (dashboards, event
logs) and measure (scorecards) quality and integrity at separate stages as data travels from source to
destination.
This becomes more critical and also challenging as data is becoming bigger and 24/7 streaming.
Failure to do so will greatly enhance the risks of fast garbage-in garbage-out.
1
Verify
Verify
2
Data Flow
Ask for HF Data ☺!
How to verify?
count # characters in message
(in Geek terms called checksum!)
Add context to message
3
Data Inspection
• Once the raw data are extracted it must be reviewed and
inspected to assess data quality.
• In addition to visually reviewing data, counting and sorting are
among the very first tasks most data analysts perform to gain a
better understanding and insights into the data.
• Counting and sorting data help us verify that the data set is
complete or that it may have missing values, especially for
important variables.
• Sorting data also allows us to review the range of values for
each variable.
Collection, Understanding , Preparation
Collection
• Access
• Integration
• Assess
Understanding
• Variables
• Cases
• Business
Logic
Preparation
• Missing
• Outliers
• Evaluate
Data Flow
Where ERD belongs?
Business Logic
• Field Integrity
Field Integrity is about integrity of values entered in a field
(attribute). For example, a text value entered in a numerical
column, or decimal value in an integer column, or a non-defined
value in a categorical column
• Relationship Integrity
Relationship Integrity is about integrity of relationship between
fields. A relationship integrity violation is when for example the
years owned is greater than age
Business Logic Examples
Relationship
Can you think of other examples like Checking or Age?
A checking account may be opened by an individual who has
reached the age of majority (age of majority is 18 in most states).
Field
Data Preparation
• Once we have inspected and explored data, we can start the data
preparation process.
• There are important data preparation techniques like data
cleansing, transformation, handling missing values and outliers,
and data derivation.
• There may be missing values and/or outliers in the key variables
that are crucial for subsequent analysis.
Data Preparation Activities
Data Cleansing – How to clean the data?
Delete blanks, duplicates, miscodes, bad
business logic, or plain garbage ‘dirty records’
Data Transformation – How to express data
variables? Units? Normalize?
Unit conversions (currency), standardization
of data (no units, common baseline)
Cleansing
Transformation
Detection
Reduction
Abstraction
Filtering
Derivation
Attributes
Data Imputation– How to handle missing values?
Delete entire row/record or field (watch for must-have data & nice-to-have
data)
Data Filtering – What to do about outliers and unwanted data?
Remove unnecessary information, or spams → reduce noise
Data Preparation Activities
Data Abstraction – How to handle temporal or
qualitative expressions?
Age band, colors, categories → assign a value
(itemization)
Data Reduction – How much data to be used?
Remove duplicates, redundancies, reduce
dimensionality
Cleansing
Transformation
Detection
Reduction
Abstraction
Filtering
Derivation
Data Derivation – Need to create new data variables?
Use Longitudinal and Latitude to derive distance or address
Data Attributes – How to define?
Text, numeric, allowed values, default values
Attributes
Data Preparation Checklist
• No data vendor or provider, no matter how sophisticated or reputable, will not
guarantee integrity of the data. Read the fine lines in the disclaimer “use at your
own risk”.
• Always question and conduct your internal evaluation. Do not assume anything!
Cleansing
• Correction
• Accurate
• Current
Transformation
• Normalize
• Units
• Absolute
Detection
• Missing
• Deletion
• Duplicates
Filtering
• Outliers
• Unwanted
Derivation
• Derived
• Fitting
• Private
Data Filtering
• The process of extracting portions of a data set that are
relevant to the analysis is called filtering.
• It is commonly used to pre-process the data prior to analysis.
• Filtering can also be used to eliminate unwanted data such as
observations that contain missing values, low quality data, or
outliers.
• Sometimes, filtering involves excluding variables that are
irrelevant to the problem or variables that contain redundant
information or variables with excessive amounts of missing
values instead of observations.
Missing Values
• Missing values are a common quality problem found in data and can lead to a
significant reduction in the number of usable observations.
• Understanding why the values are missing is the first step in the treatment of
missing values.
– When working with surveys, respondents might decline to provide the
information due to its sensitive nature or some of the items may not apply to
every respondent
– Can also be caused by human errors, problems associated with data
collection, or equipment failures
• There are two strategies for dealing with missing values:
• Omission: is to exclude observations with missing values. It is appropriate when
the amount of missing value is small.
• Imputation: replace missing values with some reasonable values
Missing Values
Handling Missing Values
• For numerical values, replace missing values with the
mean/average value across across relevant observations.
– Doesn’t increase the variability in the data set
– If a large number of values are missing, mean imputation will likely distort the
relationships among variables, leading to biased results
• For categorical variables, the most frequent category is often
used.
• If the variable that has many missing values is deemed
unimportant for further analysis related to the problem, the
variable may be excluded from the analysis.
Missing Data or Filling Data (Padding)
The following is an illustrative example of the obvious missing data and the
hard to identify common practice of data padding. This example shows the
closing trading price for a particular stock over two months period in 2012
Filling can be desirable but is technically much more challenging to identify
filled data versus missing data. How to identify?
Data Governance
Every major/serious institution working with data has/should have a data
governance policy. Correcting data, replacing missing data, and calculating new
data should be conducted according to strict guidelines and regulations as defined
by the business such as licensing, terms and conditions of use, and storage. Handle
with care!
▪
Domain expertise is a must. Data should be handled by those who understand
the data!
▪
Maintain an audit trace history: when, what, and by whom
▪
Proper handling of derived and custom data
▪
Proper tagging to differentiate sourced data from corrected, replaced, derived,
and custom data
▪
Ability to revert back if needed
▪
Strict rules on who is authorized to access the data, and what is permitted
with the data
Use Case: Missing Data & Zeros
How should it be represented? Blank? Null? NA? 0?
Loan Purpose Checking
Savings
‘creditrisk.csv
Small Appliance
Furniture
New Car
Furniture
Education
Furniture
0
0
0
638
963
2827
739
1230
389
347
4754
0
If you see a zero value, you should ask yourself is it real or it means ‘nothing’
or ‘missing’. Is it a true 0 or a false 0? If false better not to have, unless it is
understood what the 0 stands for. If not ask the source or provider!
Suppose 50 rows in the checking column of 100 rows have missing or zero
values. Is the Average of the column sum(checking)/50 or
sum(checking)/100? It depends again if the zero values are true or not.
The difference can be huge!
Use Case: What About Units
If units are not spelled out, go back to source and find out. Even when
spelled out it can sometimes be confusing. Consider the weight unit
definitions for ton, and tonne:
‘ton’ as used in the United States is a unit of weight equal to 2,000 pounds
(907.18474 kg)
‘tonne’ as used in Britain is metric unit of mass equal to 1,000 kilograms;
it is equivalent to approximately 2,204.6 pounds
$125M Lost in Translation
Los Angeles Times, October 01, 1999
Mars Probe Lost Due to Simple Math Error
• NASA lost its $125-million Mars Climate Orbiter because a navigation team at
the Jet Propulsion Laboratory (JPL) used the metric system of millimeters and
meters in its calculations, while Lockheed Martin Astronautics in Denver, which
designed and built the spacecraft, provided crucial acceleration data in the
system of inches, feet and pounds.
• JPL engineers mistook acceleration readings measured in English units of poundseconds for a metric measure of force called newton-seconds.
Attention to details!
Data Corruption: A Very Common Scenario
How to differentiate between a data sourced and a data that was corrected or edited? Reason why
having clear and strict data governance on how data is maintained, by whom, and an audit trail.
Data Sourcing
2.3567
037
(LA)
Data Editing/Correction
2.4127
2.3570
Data Rounding
2.4127
Data Delivered
37
(NC)
Data Conversion
Use Case: Text and Numbers Conversion
Not all numerals are numbers!
Many software, including Excel and other spreadsheets, often make the mistake of treating
numerals as numbers and stripping the leading zeroes. This can cause a lot of problems down the
road!.
Consider ZIP codes used by the UPS
Union City New Jersey 07087 ---- stripping---- 7087!
Can you think of a business logic to detect such a problem?
FIPS 5-digit codes uniquely identify counties in the US. Used by National Weather Service.
First two digits are for State and last three for County.
13029 is Georgia State for 13 and 029 for Bryan County in GA
037 is a Los Angeles County ---- stripping the front 0 ---37 is North Carolina State!
Data & Basic Statistics
Consider the two data series:
Series 1 (1,4,8,4,7,2,3,5,24)
Series 2 (60,65,45,70,85,70,50,55,90)
Mean (Average) = Sum of values / N (number of values)
Median = Middle value in a sorted data. For Series 1 (1,2,3,4,4,5,7,8,24) it is 4. What if even count?
Mode = Observation that occurs most frequently. For Series 1 it is 4
Dispersion = Range = Max – Min. For Series 1 24-1 = 23
Variance = Good spread measure =
Standard Deviation SD (Sigma) = SQRT (Variance)
SNR (Signal-to-Noise Ratio) = Mean/SD
Basic Statistics - Mean
The mean, often called the average, is a common way to measure the
center of a distribution of data. The mean, denoted as µ or x̄, can be
calculated as
where x1, x2, ..., xn represent the n observed values.
Basic Statistics - Variance & Standard
Deviation
The mean was introduced as a method to describe the center of a
data set but variability in the data is also important.
We introduce two measures of variability: the variance and the
standard deviation.
Both are very useful in data analysis.
Basic Statistics - Variance
We call the distance of an observation from its mean its deviation. If we
square these deviations and then take an average, the result is equal to
the sample variance.
Variance is the average squared deviation from the mean.
• We divide by n-1, rather than dividing by n when computing a sample’s
variance. It has some mathematical reasoning, but the end result is that doing
this makes the statistic slightly more reliable and useful.
• In the calculation of variance, we use the squared deviation to get rid of
negatives so that observations equally distant from the mean are weighed
equally.
Basic Statistics - Standard Deviation
The standard deviation is the square root of the variance, and has the
same units as the data.
The standard deviation is useful when considering how far the data are
distributed from the mean. It represents the typical deviations from the
mean.
Usually 70% of the data will be within one standard deviation of the
mean and about 95% will be within two standard deviations.
Basic Statistics - Median
The median is the value that splits the data in half when ordered in
ascending order.
If there are an even number of observations, then the median is
the average of the two values in the middle.
Since the median is the midpoint of the data, 50% of the values are
below it and other 50% are above.
Hence, it is also the 50th percentile.
Data & Outliers
Another important data preparation task involves the treatment of extremely
small or large values, referred to as outliers.
An outlier is a data point that differs significantly from others.
A common and sometimes challenging task is to identify data outliers. One
method (not best) to detect outlier is to measure how far a value is a certain
number of standard deviations (SD) away from the mean. The rule is 3 SD.
Frequency of
Occurrence
(# students)
Normal distribution of sample data
Possible Outliers
Mean
Mean – 3*SD
Values (grades)
Mean + 3*SD
Outlier Detection Method
Consider the two data series:
Series 1 (1,4,6,4,5,2,3,5,24)
Series 2 (60,65,45,70,85,70,50,55,90)
Mean = 6
SD = 5.7
Mean + 3*SD = 23.1
Mean – 3*SD = -11.1
Mean = 65.5
SD = 15.1
Mean + 3*SD = 110.84
Mean - 3*SD = 20.2
What measures are most affected by an outlier? Consider Mean and
Median
What about range?
Box Plot
The box in a box plot represents the middle 50% of the data, and the
thick line in the box is the median.
Q1, Median, Q3, IQR
•
The 25th percentile is also called the first quartile, Q1.
•
The 50th percentile is also called the median.
•
•
The 75th percentile is also called the third quartile, Q3.
Between Q1 and Q3 is the middle 50% of the data. The range
these data span is called the interquartile range, or the IQR.
IQR = Q3 - Q1
Whiskers and Outliers
Whiskers of a box plot can extend up to 1.5 x IQR away from the quartiles.
max upper whisker reach = Q3 + 1.5 x IQR
max lower whisker reach = Q1 - 1.5 x IQR
A potential outlier is defined as an observation beyond the maximum
reach of the whiskers. It is an observation that appears extreme relative
to the rest of the data.
Should we remove or keep outliers?
What should be done with outliers?
• They should be understood in the context of the data. An outlier for a
year of data may not be an outlier for the month in which it occurred and
vice versa.
• They should be investigated to determine if they are in error. The values
may have simply been entered incorrectly. If a value can be corrected, it
should be.
• They should be investigated to determine why they are so different from
the rest of the data. For example, were extra sales or fewer sales seen
because of a special event like a holiday.
Suggested Studies
Self-Study Exercises
→ What is High Fidelity data and why is it important ?
→ What is data governance and why is it important?
→ Study use cases of missing data, rounding, and conversion
→ What is a good way to handle a missing data?
→ Consider the two Checking and Savings columns in creditrisk.csv.
Compute the Mean, SD, Max, and Min for each. Any outliers?
Try with both Excel and R. Reconcile the two.
→ Repeat the above calculation by replacing all zero entries in Checking and Savings
with ‘NA’. Try with both Excel and R. Conclusions?
→ Looking at the Divvy Bikes sharing data (https://www.divvybikes.com/data)
can you think of some business logic integrity checks?
BSAD Lab 04
Lab Session 04
Two main tasks
• Data outliers
• Data preparation
Lab Session (In-Class and Take-Home)
TASK
1.
NOTES
Data Outliers
•
Refer to the last ‘creditrisk.csv’ file from Sakai
•
Calculate the Mean, STD, Max, and Min for the Age column. Use
DONE
☐
both Excel and R. Compare results.
2.
Data
Preparation
•
Use formula to detect outliers. Are there any?
•
Download the new original ‘creditriskorg.csv’ to your desktop
•
Calculate the Mean for Checking in Excel. Explain how the
☐
missing values are treated in the Excel calculation.
•
Read file in R. Watch for header and empty line.
•
Calculate the Mean for Checking in R. Observations?
•
Identify the data issues in R to calculate the mean.
Hint: $, NA, and comma (1,235 vs 1235)
•
Explore ways to resolve. Consider both in Excel and/or in R
R Commands Cheat-Sheet:
3.
Divvy Data
Business
Logic and
Data
Modeling
1.
newdata=read.csv("creditriskorg.csv",skip=1,header=TRUE,sep=",")
2.
checking = newdata$Checking
3.
checking = sub(“,”,””,checking) #substitute for comma
4.
checking= sub(“\\$”,””,checking) #substitute for $
5.
checking = as.numeric(checking) #numeric convert
6.
mean(checking, na.rm=TRUE) #mean with NA removed
•
Go to https://www.divvybikes.com/data
•
Download latest 2016 Q3/Q4 data (zip file)
•
Check the README file. Open the file Divvy_Trips_2016_Q4 in Excel
•
Per the Data License Agreement who owns the data, and can
the data be correlated with other sources?
•
Note the size of the file, the number of columns and of rows
•
Identify the unique entities and fields.
•
Define a relational business logic for the column field
‘tripduration’
•
Using www.erdplus.com draw a star schema using the following
three tables:
-
A Fact table for Trip
-
A Dimension table for Station
-
A Dimension table for User
☐
Download