Applied Analytics: Frameworks and Methods 1 Instructor: Guilherme de Oliveira Email: gd2271@columbia.edu Data Summarization and Tidying Week 3 Agenda • Descriptive Measures • • • • Visual Summary Transformation Restructuring Data Missing data 2 Types of Variables • Nature of the variable dictates the type of analysis. Variable Type Class in R • Metric or numerical: Meaningful arithmetic can be performed Discrete integer Continuous numeric Nominal factor (unordered) Ordinal factor (ordered) Date, Time Date, POSIXct, POSIXt • Discrete (e.g., number of children) • Continuous (e.g., income) • Non-metric or categorical (one that is not numerical) • Nominal: Unordered categories (e.g., gender) • Ordinal: Ordered categories (e.g., rank) • Dummy: Categorical variables with more than two levels are often dummy coded to simplify analysis. A dummy variable is binary. (One-hot encoding – most all algorithms will need for variables to be changed to a numeric) • ! Date or Time 3 Types of Data • Cross-sectional: Data on a cross-section of the population at a distinct point in time. • Questions to Answer: Will someone Churn? • Time-series or longitudinal: Date gathered over time. • Questions to Answer: What will be our cash position next month? 4 Descriptive Measures Descriptive Measures • Eyeballing even a moderate sized dataset is unlikely to yield useful insights • Numerical summaries can yield insights independent of the actual size of the data 6 Descriptive Measures for Numeric Variables • Measures of central tendency • Mean: Average of all values • Median: Middle observation for sorted data • Mode: Most frequent value • Measures of Dispersion • • • • • Range: Difference between maximum and minimum pth Percentile: Value such that p% values are below it Interquartile range: Difference between values at 25th and 75th percentiles Variance: Average of squared deviations from mean Standard deviation: Square root of variance 7 Descriptive Measures for Categorical Variables • Generally involve a cross-tabulation which compare categories by • Count (e.g., count of males vs. females), or • Descriptive measure of a numerical variable (e.g., median income of males vs. females) 8 Descriptive Measures: Functions in R See DescriptiveMeasures.html 9 Visual Summary Benefits: Visual Summary • Graphs and Charts are an effective way to inspect the data quickly • Help to see the big picture • Spot Trends • Easily and rapidly comparing values • See and Compare Patterns • And sometimes, visualizations may uncover relationships that are not captured by descriptive measures. 11 I Obs 1 Obs 2 Obs 3 Obs 4 Obs 5 Obs 6 Obs 7 Obs 8 Obs 9 Obs 10 Obs 11 Anscombe’s Quartet Mean Median Min Max Mode Variance Std Dev Correl R2 II III IV x 10 8 13 9 11 14 6 4 12 7 5 y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 x 10 8 13 9 11 14 6 4 12 7 5 y 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74 x 10 8 13 9 11 14 6 4 12 7 5 y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 x 8 8 8 8 8 8 8 19 8 8 8 y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89 9.000 9.000 4.000 14.000 7.501 7.580 4.260 10.840 9.000 9.000 4.000 14.000 7.501 8.140 3.100 9.260 9.000 9.000 4.000 14.000 7.500 7.110 5.390 12.740 7.501 7.040 5.250 12.500 11.000 3.317 4.127 2.032 0.816 67% 11.000 3.317 4.128 2.032 0.816 67% 11.000 3.317 4.123 2.030 0.816 67% 9.000 8.000 8.000 19.000 8.000 11.000 3.317 20 20 20 20 16 16 16 16 12 12 12 12 8 8 8 8 4 4 4 4 0 0 0 4 8 12 16 20 0 0 4 8 12 16 20 4.123 2.031 0.817 67% 0 0 4 8 12 16 20 0 4 8 12 16 20 12 Visualizations • There are a number of useful graphs and charts to aid in initial exploration of data. • This module will review only a few. • Others will be reviewed along with topics corresponding to it. • But, before that, let’s review the visualization systems in R 13 Visualizing Data in R Visualization Base Graphics Lattice ggplot2 ggvis, googleVis, rbokeh, plotly .. 14 ggplot2 • Developed by Hadley Wickham • Based on the notion of a Grammar of Graphics described in a book by Wilkinson et al (2005) • Here are some resources • Documentation • Cheatsheet • Free book: R Cookbook of Graphics 15 Grammar of ggplot2 • Main components of ggplot2 • Data • Aesthetic mapping (aes) • Describes how variables are mapped onto graphical attributes • Visual attribute of data including x-y axes, color, fill, shape, and alpha. • Geometric objects (geom) • Determines how values are rendered graphically, as bars (geom_bar), scatterplot (geom_point), line (geom_line), etc. 16 Histogram • Frequency distribution of a binned numeric variable • Useful for examining the shape of a distribution 17 Frequency Polygon • Outline of a histogram. • Overcomes the problem of overlaid histograms 18 Density curve • Standardized area under the curve 19 Boxplot • Another useful chart for showing distribution of a numeric variable • Highlights dispersion in the data as well as outliers • Particularly handy for comparing distributions 20 QQ Plot • Graphical method to evaluate a distribution. • QQ Plots are a useful way to check data against a distribution the data is thought to come from. • Most commonly used to see if data comes from a normal distribution by plotting quantiles of data against quantiles expected from a normal distribution. 21 Bar Charts • To compare numerical summaries across levels of a categorical variable 22 Scatterplots • Used for exploring relationship between a pair of numeric variables. 23 Scatterplots Used for exploring relationship between a pair of numeric variables. 24 Illustration See VisualSummaries.html 25 Data Wrangling • Data generally doesn’t come in an analysis-ready format. This may be due to the • Format or nature of variables • Structure of the data • Missing data • Data Wrangling is the process of preparing the data for analysis. It goes by many names including • • • • • • Munging Cleaning Carpentry Manipulation Transformation Tidying 26 Data Wrangling • It is one the less celebrated, yet critically important tasks of a data scientist • Data wrangling will often consume 80% of the time for a data science. • We now examine three data wrangling tasks • Data Transformation • Tidying • Handling missing data 27 Data Transformation Transformation • Common transformations include • Dummy coding categorical variables • Combining levels of categorical variables (e.g., pooling “single”, “separated”, “divorced” into “single”) • Mean centering a variable • Combining variables (e.g., mean, sum, fraction) • Fixing a skewed distribution (e.g., Box-Cox transformations) • Data reduction techniques such as principal component analysis. • Binning a numeric variable 29 Illustration See DataTransformation.html 30 Restructure Data Restructure Data • Sometimes, structure of the data may get in the way of analyses • Hadley Wickham recommends the use of a tidy data structure which is one in which • Each variable has its own column • Each observation has its own row • Each values has its own cell • Choice of structure is often driven by the functions to be used and analytical technique to be conducted. 32 Tidy Data Wide Data Tall Data 33 Tidy Data in R • gather(): • Convert wide data to tall format • Most functions and software prefer expect data to be in tall format • spread() • Convert tall data to wide format 34 Illustration See DataStructure.html 35 Missing Data Missing Data • As the name suggests, a cell with no data indicates missing data • In addition, there are times when we may mark data as missing. • Data that is out of range. E.g., data for a 1-7 questionnaire item containing 11 • Data that is know to be incorrect. E.g., a negative value for age. • Outlier. An extreme observation may have an important story to tell so only after careful deliberation should an outlier be marked as missing. • Certain analytical techniques are sensitive to outliers, and left as such, they may lead to biased results. 37 Managing Missing Values • Ignore NA values • Itemwise or listwise deletion: Filter out entire rows with NA values • Many R functions contain an argument that automatically drops NA. E.g., sum(mtcars$mpg,na.rm = T) • Many analytical techniques such as tree-based techniques can specifically account for missing data • Impute missing values • Replace missing value with a non-biasing value like mean or median • Predict missing value using data on other variables (e.g., kNN, 38 Missing Values in R • R indicates missing values with the symbol NA • is.na() • To check for missing values 39 How do various functions deal with NA Expression What do you think it does What does it really do? Try it in R 5 + NA mean(c(5,NA)) NA < 3 NA == 3 NA == NA 40 Illustration See MissingValues.html 41 Conclusion • In this module we learnt about • • • • • descriptive measures as a way to explore the data graphs and charts to visualize data the need to transform data how to restructure data ways to handle missing data 42