Uploaded by Qingyan Cao

DataSummarization and Tidying 2019 Spring (1)

advertisement
Applied Analytics:
Frameworks and Methods 1
Instructor: Guilherme de Oliveira
Email: gd2271@columbia.edu
Data Summarization and Tidying
Week 3
Agenda
• Descriptive Measures
•
•
•
•
Visual Summary
Transformation
Restructuring Data
Missing data
2
Types of Variables
• Nature of the variable dictates the type of analysis.
Variable Type
Class in R
• Metric or numerical: Meaningful arithmetic can be
performed
Discrete
integer
Continuous
numeric
Nominal
factor (unordered)
Ordinal
factor (ordered)
Date, Time
Date, POSIXct,
POSIXt
• Discrete (e.g., number of children)
• Continuous (e.g., income)
• Non-metric or categorical (one that is not numerical)
• Nominal: Unordered categories (e.g., gender)
• Ordinal: Ordered categories (e.g., rank)
• Dummy: Categorical variables with more than two levels
are often dummy coded to simplify analysis. A dummy
variable is binary. (One-hot encoding – most all
algorithms will need for variables to be changed to a
numeric)
• ! Date or Time
3
Types of Data
• Cross-sectional: Data on a cross-section of the population at a distinct
point in time.
• Questions to Answer: Will someone Churn?
• Time-series or longitudinal: Date gathered over time.
• Questions to Answer: What will be our cash position next month?
4
Descriptive
Measures
Descriptive Measures
• Eyeballing even a moderate sized dataset is unlikely to yield useful
insights
• Numerical summaries can yield insights independent of the actual size of
the data
6
Descriptive Measures for Numeric Variables
• Measures of central tendency
• Mean: Average of all values
• Median: Middle observation for sorted data
• Mode: Most frequent value
• Measures of Dispersion
•
•
•
•
•
Range: Difference between maximum and minimum
pth Percentile: Value such that p% values are below it
Interquartile range: Difference between values at 25th and 75th percentiles
Variance: Average of squared deviations from mean
Standard deviation: Square root of variance
7
Descriptive Measures for Categorical Variables
• Generally involve a cross-tabulation which compare categories by
• Count (e.g., count of males vs. females), or
• Descriptive measure of a numerical variable (e.g., median income of males vs.
females)
8
Descriptive Measures: Functions in R
See DescriptiveMeasures.html
9
Visual Summary
Benefits: Visual Summary
• Graphs and Charts are an effective way to inspect the data quickly
• Help to see the big picture
• Spot Trends
• Easily and rapidly comparing values
• See and Compare Patterns
• And sometimes, visualizations may uncover relationships that are not
captured by descriptive measures.
11
I
Obs 1
Obs 2
Obs 3
Obs 4
Obs 5
Obs 6
Obs 7
Obs 8
Obs 9
Obs 10
Obs 11
Anscombe’s
Quartet
Mean
Median
Min
Max
Mode
Variance
Std Dev
Correl
R2
II
III
IV
x
10
8
13
9
11
14
6
4
12
7
5
y
8.04
6.95
7.58
8.81
8.33
9.96
7.24
4.26
10.84
4.82
5.68
x
10
8
13
9
11
14
6
4
12
7
5
y
9.14
8.14
8.74
8.77
9.26
8.1
6.13
3.1
9.13
7.26
4.74
x
10
8
13
9
11
14
6
4
12
7
5
y
7.46
6.77
12.74
7.11
7.81
8.84
6.08
5.39
8.15
6.42
5.73
x
8
8
8
8
8
8
8
19
8
8
8
y
6.58
5.76
7.71
8.84
8.47
7.04
5.25
12.5
5.56
7.91
6.89
9.000
9.000
4.000
14.000
7.501
7.580
4.260
10.840
9.000
9.000
4.000
14.000
7.501
8.140
3.100
9.260
9.000
9.000
4.000
14.000
7.500
7.110
5.390
12.740
7.501
7.040
5.250
12.500
11.000
3.317
4.127
2.032
0.816
67%
11.000
3.317
4.128
2.032
0.816
67%
11.000
3.317
4.123
2.030
0.816
67%
9.000
8.000
8.000
19.000
8.000
11.000
3.317
20
20
20
20
16
16
16
16
12
12
12
12
8
8
8
8
4
4
4
4
0
0
0
4
8
12
16
20
0
0
4
8
12
16
20
4.123
2.031
0.817
67%
0
0
4
8
12
16
20
0
4
8
12
16
20
12
Visualizations
• There are a number of useful graphs and charts to aid in initial exploration
of data.
• This module will review only a few.
• Others will be reviewed along with topics corresponding to it.
• But, before that, let’s review the visualization systems in R
13
Visualizing Data in R
Visualization
Base Graphics
Lattice
ggplot2
ggvis, googleVis,
rbokeh, plotly ..
14
ggplot2
• Developed by Hadley Wickham
• Based on the notion of a Grammar of Graphics described in a book by
Wilkinson et al (2005)
• Here are some resources
• Documentation
• Cheatsheet
• Free book: R Cookbook of Graphics
15
Grammar of ggplot2
• Main components of ggplot2
• Data
• Aesthetic mapping (aes)
• Describes how variables are mapped onto graphical attributes
• Visual attribute of data including x-y axes, color, fill, shape, and alpha.
• Geometric objects (geom)
• Determines how values are rendered graphically, as bars (geom_bar), scatterplot
(geom_point), line (geom_line), etc.
16
Histogram
• Frequency distribution of a binned
numeric variable
• Useful for examining the shape of
a distribution
17
Frequency Polygon
• Outline of a histogram.
• Overcomes the problem of
overlaid histograms
18
Density curve
• Standardized area under the
curve
19
Boxplot
• Another useful chart for
showing distribution of a
numeric variable
• Highlights dispersion in the
data as well as outliers
• Particularly handy for
comparing distributions
20
QQ Plot
• Graphical method to evaluate a
distribution.
• QQ Plots are a useful way to
check data against a distribution
the data is thought to come from.
• Most commonly used to see if
data comes from a normal
distribution by plotting quantiles of
data against quantiles expected
from a normal distribution.
21
Bar Charts
• To compare numerical summaries
across levels of a categorical
variable
22
Scatterplots
• Used for exploring relationship
between a pair of numeric
variables.
23
Scatterplots
Used for exploring relationship between a pair of numeric variables.
24
Illustration
See VisualSummaries.html
25
Data Wrangling
• Data generally doesn’t come in an analysis-ready format. This may be
due to the
• Format or nature of variables
• Structure of the data
• Missing data
• Data Wrangling is the process of preparing the data for analysis. It goes
by many names including
•
•
•
•
•
•
Munging
Cleaning
Carpentry
Manipulation
Transformation
Tidying
26
Data Wrangling
• It is one the less celebrated, yet
critically important tasks of a data
scientist
• Data wrangling will often consume
80% of the time for a data science.
• We now examine three data
wrangling tasks
• Data Transformation
• Tidying
• Handling missing data
27
Data Transformation
Transformation
• Common transformations include
• Dummy coding categorical variables
• Combining levels of categorical variables (e.g., pooling “single”, “separated”,
“divorced” into “single”)
• Mean centering a variable
• Combining variables (e.g., mean, sum, fraction)
• Fixing a skewed distribution (e.g., Box-Cox transformations)
• Data reduction techniques such as principal component analysis.
• Binning a numeric variable
29
Illustration
See DataTransformation.html
30
Restructure Data
Restructure Data
• Sometimes, structure of the data may get in the way of analyses
• Hadley Wickham recommends the use of a tidy data structure which is
one in which
• Each variable has its own column
• Each observation has its own row
• Each values has its own cell
• Choice of structure is often driven by the functions to be used and
analytical technique to be conducted.
32
Tidy Data
Wide Data
Tall Data
33
Tidy Data in R
• gather():
• Convert wide data to tall format
• Most functions and software prefer expect data to be in tall format
• spread()
• Convert tall data to wide format
34
Illustration
See DataStructure.html
35
Missing Data
Missing Data
• As the name suggests, a cell with no data indicates missing data
• In addition, there are times when we may mark data as missing.
• Data that is out of range. E.g., data for a 1-7 questionnaire item containing 11
• Data that is know to be incorrect. E.g., a negative value for age.
• Outlier. An extreme observation may have an important story to tell so only after
careful deliberation should an outlier be marked as missing.
• Certain analytical techniques are sensitive to outliers, and left as such, they may lead to
biased results.
37
Managing Missing Values
• Ignore NA values
• Itemwise or listwise deletion: Filter out entire rows with NA values
• Many R functions contain an argument that automatically drops NA. E.g.,
sum(mtcars$mpg,na.rm = T)
• Many analytical techniques such as tree-based techniques can specifically account
for missing data
• Impute missing values
• Replace missing value with a non-biasing value like mean or median
• Predict missing value using data on other variables (e.g., kNN,
38
Missing Values in R
• R indicates missing values with the symbol NA
• is.na()
• To check for missing values
39
How do various functions deal with NA
Expression
What do you think it
does
What does it really do? Try it in
R
5 + NA
mean(c(5,NA))
NA < 3
NA == 3
NA == NA
40
Illustration
See MissingValues.html
41
Conclusion
• In this module we learnt about
•
•
•
•
•
descriptive measures as a way to explore the data
graphs and charts to visualize data
the need to transform data
how to restructure data
ways to handle missing data
42
Download