Lecture 2 - The Department of Statistics and Applied Probability, NUS

advertisement
Exploratory Data Analysis
90
100
70
Male
Female
70
160
170
70
150
180
190
200
50
H eight
50
40
60
Male
Female
Male
Female
140
40
Weight
140
60
80
Weight
90
80
100
90
60
50
40
Weight
80
Height and Weight
150
160
170
180
190
H eight
140
150
160
170
H eight
180
190
200
200
Data exploration and Statistical analysis
1. Data checking, identifying problems and characteristics
Data
Data exploration,
categorical / numerical
outcomes
Analyzing a set of data
• Look at the data (initial checks on the data)
• Downloading data, formatting, data collection,
discrepant data, missing data
• Visualize the data (exploratory data analysis)
• Descriptive statistics, informative tables, wellconstructed figures
• Analyse the data (definitive analysis)
• Formal statistical analysis
• Quantify any interesting results
• Report the findings
Types of Variables
• Often, test to use depends on the type of variable at hand
• Two main classes of variables:
• Categorical
• Numerical
• Categorical variables further divided into two sub-classes:
• Nominal categorical (example: gender, ethnic groups)
• Ordinal categorical (example: size of a car, quality of
teaching)
Numerical variables
• Distinguish between discrete or continuous numerical
variables
Discrete
• Integer values (number of male subjects, number of
episodes of flu outbreaks)
Continuous
• Takes a whole range of values (height, weight)
• Continuous variables treated as discrete (age)
Exploratory Data Analysis
EDA
• Tabular EDA
• Univariate tables, cross-tabulation of categorical
variables
• Numerical EDA
• Location, spread, skewness, covariance and
correlation
• Graphical EDA
• Frequency plots, histograms, boxplots, scatterplots
The precise form of EDA depends on the data at hand.
Tabular EDA
• Useful for summarising categorical data.
For example, the following table shows the classification of 2,555
students from three schools in a study on the GCSE O-level
results in Mathematics:
Dunman High
HCI
RI / RGS
Total
Dunman High
HCI
RI / RGS
Total
6
408
1496
1910
No. of students
No. of students
Small counts are problematic
in categorical data analysis
Tabular EDA
• For two categorical variables: i.e. the distribution of the A,
B and others grades between two schools
School
A
B
Others
Dunman High
HCI
Question: Appears that Dunman High has proportionally more students scoring
A/B grades than HCI. Does this mean anything?
Numerical EDA
• Calculating
informative numbers which summarise the
dataset
• What are the numbers useful for describing the age of
1,059 individuals with diabetes?
• Location parameters (mean, median, mode)
• Spread (range, standard deviation, interquartile
range)
• Skewness
Mean age (54.6 years)
20
30
40
50
60
AGE
70
80
Numerical EDA
Skewness
0.15
0.05
0.0
Frequency
0.25
Right Skew
0
5
10
Observation
Median
Mean
15
Normal distribution
68% of the probability,
1 standard deviation
away
95% of the
probability, 2 SDs
away
40
50
60
70
Exam marks for Mathematics exam
80
Numerical EDA
• Sample Quartiles
Q1: 25th quantile (or value of the 25% ranked data)
Q2: 50th quantile (also known as median of data)
Q3: 75th quantile (or value of the 75% ranked data)
Consider the heights of 1000 people, rank these
heights from shortest to tallest.
Q1
Q2
Q3
Location and spread
• When mean is used as the location parameter, the
standard deviation is the appropriate measure for
spread
• When median is used as the location parameter, the
corresponding measure for spread is the interquartile
range
• Interquartile range (IQR)
IQR = Q3 – Q1
• Minimum, Maximum of data (seldom used to quantify
spread, but more for data QC)
Numerical EDA
• Numbers can be informative to identify potential problems
with the data
Example: Suppose the height for 1,496 individuals
randomly sampled from the population produces the
following summary
IQR = Q3 – Q1 = 188 – 172 = 16
Range = Max – Min = 201 – 0 = 201
Correlation
• Two numerical variables: height and weight
Questions
• Are there any relationship between these variables?
• If there is, how do we quantify this relationship?
• Covariance and Correlation
Measures the degree of association between two
numerical variables.
Covariance and Correlation
• Covariance is scale-dependent, and correlation is unitfree.
• More intuitive to interpret correlation than covariance.
Example: Covariance for height and weight is 2.4 when
assessed using metres and kilograms, but 240,000 when
assessed using centimetres and grams. Correlation is a
constant value at 0.83 for both scenario.
•Correlation is unit-free, and always bounded between -1
and 1 inclusive.
• Useful for investigating relationships between variables,
(e.g. weight and height)
Example
Graphical EDA
• Visual summaries of the data
• Flagging outliers, obvious relationships, check for
distribution
Boxplots
• Univariate boxplot: for 1 numerical variable
Ends of box: Q1 and Q3
Length of box: IQR
White line: Sample
median
Whiskers: 1.5 times IQR
Lines outside whiskers:
Outliers
Circles: Extreme outliers
Boxplots
• Multivariate boxplots: for 1 numerical variable across
different levels of a categorical variable
• Graphical comparison
Scatterplots
70
60
50
Male
Female
40
Weight
80
90
100
• Graphical representation for 2 numerical variables
140
150
160
170
H eight
180
190
200
Scatterplots
Perfect Negative Correlation
-10
5
-25
10
-20
-15
y
20
15
y
25
-5
30
0
Perfect Positive Correlation
2
4
6
8
10
2
4
6
8
x
Correlation = 0.8
Correlation = -0.3
8
y
7
7.5
6
7.0
6.5
y
8.0
9
8.5
9.0
x
10
4.0
4.5
5.0
x
5.5
6.0
6.5
4.0
5.0
6.0
x
7.0
Scatterplots
6
7
8
y
9
10
Correlation = 0.0
3
4
5
6
x
7
Exploratory Data Analysis in
RExcel and SPSS
Comparing height of children
• Height data for 30 children, from 3 groups
• Interest to compare height of children between groups
• Useful (and not useful!) data exploration
Comparing height of children
• Height data for 30 children, from 3 groups
• Interest to compare height of children between groups
• Useful (and not useful!) data exploration
Coding numerical variables as factors
Retain numbers as categories, or to
define new names for the categories
Note the deliberate
mistake here! Always
know your variables
well!
Stratified analysis by group
Click on this to define the
variable that contains the
grouping information for
stratification
Boxplots
Choose this to produce
separate boxplots for the three
groups (stratified analysis)
Maximum
2nd quartile
25%
Interquartile
range
Median
25%
1st quartile
Minimum
An excellent way to observe
graphical/preliminary evidence
of any differences between the
groups!
No comments can be made if the boxes
overlap. Only when two boxes (or more)
do not overlap can we say there is
graphical evidence of a difference
between the two (or more) groups
What about SPSS?
Never choose this when plotting
a histogram to get a gauge of the
distribution of the dataset
To perform a stratified analysis,
place the grouping variable under
Factor List
Default is Stem-and-leaf,
remember to change it to
Histogram
Check this to perform a
quantitative test for normality
Numerical
summaries
Statistical test for departure from normality
Tests of Normality
a
heig ht
group
1.00
2.00
3.00
Kolmog orov-Smirnov
Statistic
df
Sig .
.182
10
.200*
.176
10
.200*
.134
10
.200*
Shapiro-Wilk
Statistic
df
.899
10
.948
10
.955
10
*. This is a lower bound of the true significance.
a. Lilliefors Significance Correction
For the time being:
- If values are > 0.05, interpret as
normality assumption is valid;
- If values are < 0.05, interpret as
normality assumption is not valid, and the
variable does not follow a normal
distribution.
Statistical evidence, known as
significance levels or P-values
Sig .
.211
.648
.730
There really isn’t much difference between
RExcel and SPSS
Data checking
Omega 3 consumption and mathematical
abilities
• Students from 3 schools participated in a study to assess
the effects of omega 3 on mathematical abilities
• For each student, there is information on:
• school
• gender
• marks before
• marks after
• daily omega 3 consumption (mg)
Zeroes are important to take note of, but
how do we decide whether they are
plausible values or problematic values?
For flagging outliers in boxplots
Now we need to
exclude these
datapoints
Students should be able to
• realise that data exploration prior to formal statistical analysis
is important;
• know what to look out for in data checking of categorical
variables
• know what to look out for in data checking of numerical
variables
• understand the use of frequencies (percentages) for
categorical data summary
• understand which location and variability metrics to use for
numerical data
• understand the use and interpretation of histograms
Cont...
• interpret boxplots, for variable summary and for graphical
comparisons
• know the usage and interpretation of scatterplots
• perform data entry in RExcel and SPSS
• perform exploratory data analysis in RExcel and SPSS
• identify and remove problematic data in RExcel and SPSS
• generate useful summary tables and figures for a dataset in
investigating research hypotheses
• interpret generated summary tables and figures of a dataset
for investigating research hypotheses
Download