Analyzing Surveys - LISA

advertisement
Analyzing Surveys
Marcos Carzolio
Associate Collaborator for LISA
PhD Student
Department of Statistics, VT
Laboratory for Interdisciplinary Statistical Analysis
Outline
• Data Cleaning and Preprocessing
• Outlier Detection
• Missing Value Imputation
• Visualizing and Understanding Data
• Boxplots, Histograms, and Scatterplots
• Correlation Matrices
• Analyzing Data
• Contingency Tables
• Analysis of Variance (ANOVA)
• Regression
Laboratory for Interdisciplinary Statistical
Analysis
LISA helps VT researchers benefit
from the use of Statistics
Experimental Design • Data Analysis • Interpreting Results
Grant Proposals • Software (R, SAS, JMP, SPSS...)
Our goal is to improve the quality of
research and the use of statistics at Virginia
Tech.
www.lisa.stat.vt.edu
How can LISA help?
• Formulate research question.
• Screen data for integrity and unusual observations.
• Implement graphical techniques to showcase the
data – what is the story?
• Develop and implement an analysis plan to address
research question.
• Help interpret results.
• Communicate! Help with writing the report or giving
the talk.
• Identify future research directions.
4
Laboratory for Interdisciplinary Statistical
Analysis
LISA helps VT researchers benefit
from the use of Statistics
Designing Experiments • Analyzing Data • Interpreting Results
Grant Proposals • Using Software (R, SAS, JMP, Minitab...)
Walk-In Consulting
Collaboration
From our website request a meeting for personalized
statistical advice
Great advice right now:
Meet with LISA before collecting your data
Monday—Friday 1-3 pm in 401 Hutcheson
Also, Tuesdays 1-3 pm in ICTAS Café X
& Thursdays 1-3 pm in GLC Video Conf. Room
for questions requiring <30 mins
Short Courses
Designed to help graduate students
apply statistics in their research
All services are FREE for VT researchers.
www.lisa.stat.vt.edu
Some Useful Resources
• R Statistical Computing Software
• Can be downloaded for free from: http://www.r-project.org/
• R Studio, a free Integrated Development Environment:
http://rstudio.org/
• For a more interactive and user-friendly experience, try
JMP
• Downloadable from the Virginia Tech software library:
http://www2.ita.vt.edu/software/department/products/sas
/jmp/index.html
• Amelia II: A Program for Missing Data
• Visit: http://gking.harvard.edu/amelia/
Types of Survey Data
Data Type
Description
Examples
Statistics
Nominal
Data with no intrinsic
relative meaning behind
labels
Strawberry,
Banana,
Hispanic
Mode
Ordinal
Data with an ordered
structure
Small, Extra
Large, Likert
Scale*
Median and
Percentiles
Interval
(continuous
or discrete)
Data with meaningful
difference relations
Degrees in
Celsius,
Birthdates, GPS
Coordinates
Mean, Standard
Deviation,
Correlation
Ratio
(continuous
or discrete)
Data with scale relations
Weight, Income, Mean, Standard
Length
Deviation,
Correlation
Outliers are data points that
deviate far from the main body of
data so as to arouse suspicion
about their origins
•
Visualize your data
•
Extremeness in observations is
not in itself cause for data
removal
•
R Package ‘outliers’
Frequency
Only remove outliers that are
verifiable errors
Outlier
100
•
0
• Boxplots, histograms, and
scatterplots
200
300
•
400
Outlier Detection and
Handling
−4
−2
0
2
4
Observations
6
8
10
Missing Value Imputation
• Imputation is the process of filling in
the missing values of a dataset
• Before considering imputation, try
going after respondents for their true
answers
• Can be very tricky (Come to LISA
for help)
• If only one or two missing values are
present in a vast dataset, use the
mean of available values as a “best
guess”
Honaker, James et al., AMELIA II: A Program for Missing Data
Visualizing Your Data
Boxplots
SAS/GRAPH(R) 9.2: Statistical Graphics Procedures Guide, Second Edition
Visualizing Your Data
Histogram
30
20
10
0
Frequency
40
50
Histograms
0
5
10
15
Observations
20
25
30
Visualizing Your Data
10
5
0
Hours of Exercise Last Week
15
20
Scatter Plots
50
55
60
65
Dad Height (inches)
70
75
Understanding Your Data
Correlation Matrices
satisfied
satisfied
crucial
wings
groceries
exercise
mom_height
dad_height
resp_height
1
-0.03958
-0.1968128
0.18081434
-0.1611544
0.3382386
0.04296793
0.01574379
crucial
-0.03958
1
0.01744746
-0.0729952
-0.2033839
0.03976917
0.07193248
0.07126904
wings
-0.1968128
0.01744746
1
0.06423461
0.43046745
-0.2028857
-0.0494084
-0.1175538
groceries
0.18081434
-0.0729952
0.06423461
1
0.12282262
0.22865457
0.29162024
0.27358675
exercise
-0.1611544
-0.2033839
0.43046745
0.12282262
1
-0.0089783
0.12405018
0.13014272
mom_height dad_height
0.3382386 0.04296793
0.03976917 0.07193248
-0.2028857 -0.0494084
0.22865457 0.29162024
-0.0089783 0.12405018
1 0.76119006
0.76119006
1
0.80283251 0.90469003
resp_height
0.01574379
0.07126904
-0.1175538
0.27358675
0.13014272
0.80283251
0.90469003
1
Contingency Tables
• Tabulates the number of responses in each category
• Helps to visualize the distribution of data
• Use χ2 approximate test for independence
Pearson's Chi-squared test
data: tab
X-squared = 0.7658, df = 2, p-value = 0.6819
Warning message:
In chisq.test(tab) : Chi-squared
approximation may be incorrect
Analysis of Variance
• Technique used to test the differences between groups
• Always plot your data before doing analyses
Call:
aov(formula = resp_height ~ gender)
Terms:
gender Residuals
Sum of Squares 297.744 588.567
Deg. of Freedom
1
39
Regression
•
Actually a generalization of ANOVA
•
Again, always plot your data
Call:
lm(formula = exercise ~ dad_height)
Residuals:
Min 1Q Median
3Q Max
-5.9866 -3.4205 -0.3236 2.6709 14.0949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.8573 10.7968 -0.728 0.471
dad_height 0.1938 0.1546 1.253 0.218
Residual standard error: 4.381 on 37 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.04073,
Adjusted R-squared: 0.0148
F-statistic: 1.571 on 1 and 37 DF, p-value: 0.2179
Other Useful Resources
• A PowerPoint on more automated outlier detection
techniques:
• http://www.dbs.ifi.lmu.de/~zimek/publications/KDD
2010/kdd10-outlier-tutorial.pdf
• R Package ‘outliers’:
• http://cran.rproject.org/web/packages/outliers/outliers.pdf
• On multiple imputation:
• http://sites.stat.psu.edu/~jls/mifaq.html#bayes
Download