Introduction to Descriptive Statistics

advertisement
Introduction to Descriptive
Statistics
17.871
Spring 2006
First, Some Words about
Graphical Presentation
• Aspects of graphical integrity (following
Edward Tufte, Visual Display of
Quantitative Information)
– Represent number in direct proportion to
numerical quantities presented
– Write clear labels on the graph
– Show data variation, not design variation
– Deflate and standardize money in time series
Population vs. Sample Notation
Population
Greeks
, , 
Vs
Sample
Romans
s, b
Types of Variables
Nominal
(Qualitative)
U&H: “categorical”
~Nominal
(Quantitative)
Ordinal
Interval or
ratio
Describing data
Moment
Center
Spread
Mean
Variance
(standard
deviation)
Skew
Skewness
Non-mean based
measure
Mode, median
Range,
Interquartile
range
--
Peaked
Kurtosis
--
Mean
n
x
 i
i 1
n
X
Variance, Standard Deviation
( xi   )
2
 ,

n
i 1
2
n
( xi   )


n
i 1
n
2
Variance, S.D. of a Sample
( xi   )
2
s ,

n 1
i 1
2
n
Degrees of freedom
( xi   )
s

n 1
i 1
n
2
The z-score
or the
“standardized score”
z
x x
x
Skewness
Symmetrical distribution
• IQ
• SAT
Frequency
• “No skew”
• “Zero skew”
• Symmetrical
Value
Skewness
Asymmetrical distribution
• GPA of MIT students
Frequency
• “Negative skew”
• “Left skew”
Value
Skewness
(Asymmetrical distribution)
• Income
• Contribution to
candidates
• Populations of
countries
• “Residual vote” rates
Frequency
Value
• “Positive skew”
• “Right skew”
Skewness
n
 x  x 
3
i
i 1
n
(n  1)

n2
3/ 2
 x  x 
i 1
3/ 2

i
(mean  mod e) / s  3  (mean  median) / s
Skewness
Frequency
Value
Kurtosis
k>3
leptokurtic
Frequency
k=3
k<3
Value
Beware the “coefficient of excess”
mesokurtic
platykurtic
A few words about the normal
curve
• Skewness = 0
• Kurtosis = 3
1
 ( x   ) / 2 2
f ( x) 
e
 2
More words about the normal
curve
34% 34%
47%
49%
47%
49%
“Empirical rule”
Range
s
6
SEG example
The instructor and/or section leader:
Mean
s.d.
Skew
Kurt
Gives well-prepared, relevant presentations 6.0
0.69
-1.7
8.5
Explains clearly and answers questions
well
5.9
0.68
-1.0
4.8
Uses visual aids well
5.6
0.85
-1.8
8.9
Uses information technology effectively
5.5
0.91
-1.1
5.0
Speaks well
6.1
0.69
-1.5
6.8
Encourages questions & class participation 6.1
0.66
-0.88
3.7
Stimulates interest in the subject
5.9
0.76
-1.1
4.7
Is available outside of class for questions
5.9
0.68
-1.3
6.3
Overall rating of teaching
5.9
0.67
-1.2
5.5
Graph
Graph some SEG variables
The instructor and/or section leader:
s.d.
Skew
Kurt
5.6
0.85
-1.8
8.9
Graph
.6
Fraction
Uses visual aids well
Mean
0
1
6.1
0.66
-0.88
3.7
.6
Fraction
Encourages
questions & class
participation
7
(mean) q3
0
1
7
(mean) q6
Binary data
X  prob( X )  1  proportion of time x  1
s  x (1  x )  s x  x (1  x )
2
x
Commands in STATA for getting
univariate statistics
• summarize varname
• summarize varname, detail
• histogram varname, bin() start() width()
density/fraction/frequency normal
• graph box varnames
• tabulate [NB: compare to table]
Example of Sophomore Test Scores
• High School and Beyond, 1980: A
Longitudinal Survey of Students in the
United States (ICPSR Study 7896)
• totalscore = % of questions answered
correctly on a battery of questions
• recodedtype = (1=public school, 2=religious
private private, 3 = non-sectarian private)
Explore totalscore some more
. table recodedtype,c(mean totalscore)
-------------------------recodedty |
pe
| mean(totals~e)
----------+--------------1 |
.3729735
2 |
.4475548
3 |
.589883
--------------------------
Graph totalscore
1
.5
0
Density
1.5
2
. hist totalscore
-.5
0
.5
totalscore
1
Divide into “bins” so that each bar
represents 1% correct
1
.5
0
Density
1.5
2
• hist totalscore,width(.01)
• (bin=124, start=-.24209334, width=.01)
-.5
0
.5
totalscore
1
Add ticks at each 10% mark
1
.5
0
Density
1.5
2
histogram totalscore, width(.01) xlabel(-.2 (.1) 1)
(bin=124, start=-.24209334, width=.01)
-.2
-.1
0
.1
.2
.3
.4
.5
totalscore
.6
.7
.8
.9
1
Superimpose the normal curve
(with the same mean and s.d. as the
empirical distribution)
1
.5
0
Density
1.5
2
. histogram totalscore, width(.01) xlabel(-.2 (.1) 1)
normal
(bin=124, start=-.24209334, width=.01)
-.2
-.1
0
.1
.2
.3
.4
.5
totalscore
.6
.7
.8
.9
1
Do the previous graph by school
types
.histogram totalscore, width(.01) xlabel(-.2 (.1)1)
by(recodedtype)
(bin=124, start=-.24209334, width=.01)
2
Density
0
1
2
3
1
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
1
2
3
3
-.2 -.1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
totalscore
Graphs by recodedtype
Main issues with histograms
• Proper level of aggregation
• Non-regular data categories (see next)
A note about histograms with
unnatural categories (start here)
From the Current Population Survey (2000), Voter and Registration Survey
How long (have you/has name) lived at this address?
-9
-3
-2
-1
1
2
3
4
5
6
No Response
Refused
Don't know
Not in universe
Less than 1 month
1-6 months
7-11 months
1-2 years
3-4 years
5 years or longer
Simple graph
Fraction
.557134
0
1
6
PES8
Solution, Step 1
Map artificial category onto
“natural” midpoint
-9
-3
-2
-1
1
2
3
4
5
6
No Response  missing
Refused  missing
Don't know  missing
Not in universe  missing
Less than 1 month  1/24 = 0.042
1-6 months  3.5/12 = 0.29
7-11 months  9/12 = 0.75
1-2 years  1.5
3-4 years  3.5
5 years or longer  10 (arbitrary)
Graph of recoded data
Fraction
.557134
0
0
1
2
3
4
5
longevity
6
7
8
9
10
Density plot of data
Total area of last bar = .557
Width of bar = 11 (arbitrary)
Solve for: a = w h (or)
.557 = 11h => h = .051
0
0
1
2
3
4
5
6
longevity
7
8
9
10
15
Density plot template
Category
F
X-min
X-max
X-length
Height
(density)
< 1 mo.
.0156
0
1/12
.082
.19*
1-6 mo.
.0909
1/12
½
.417
.22
7-11 mo.
.0430
½
1
.500
.09
1-2 yr.
.1529
1
2
1
.15
3-4 yr.
.1404
2
4
2
.07
5+ yr.
.5571
4
15
11
.05
* = .0156/.082
Draw the previous graph with a box
plot
1
. graph box totalscore
-.5
0
.5
Upper quartile
Median
Lower quartile
}
}
1.5 x IQR
Inter-quartile
range
Draw the box plots for the different
types of schools
. graph box totalscore,by(recodedtype)
-.5
0
.5
1
1
-.5
0
.5
1
3
Graphs by recodedtype
2
Draw the box plots for the different
types of schools using “over” option
-.5
0
.5
1
graph box totalscore,over(recodedtype)
1
2
3
Issue with box plots
• Sometimes overly highly stylized
Three words about pie charts:
don’t use them
So, what’s wrong with them
• For non-time series data, hard to get a
comparison among groups; the eye is very
bad in judging relative size of circle slices
• For time series, data, hard to grasp crosstime comparisons
Time series example
An exception to the no pie chart rule
The worst graph ever published
Download