descriptive

advertisement
Stat 11
January 24, 2008
Descriptive Statistics
(single variable)
Notation
n
x1 , x2 ,
number of items in sample
, xn
xi

x1 ', x2 ',
values in the sample (or use y, z, etc., all lower case)
i-th value (where i = 1, …, n)
(Greek “alpha”) – some number between 0 and 1, usually near 0,
representing, for example, a fraction of observations
, xn '
“order statistics” – same values as x1 , x2 , , xn , but sorted in
increasing order. For example, xn ' is the largest of the
numbers x1 , x2 , , xn . Always: x1 '  x2 '   xn '.
Graphical representations
Dot plot
Histogram
Bars should normally have equal width. If not, then be sure
that equal areas represent equal quantities (“equal area principle”).
This means that the vertical axis has units of “number of
observations per x-axis unit.”
Box plot ——————————>
or, per Tufte—>
1
Measures of Location
Mean = Average = Arithmetic mean (AM)
x

1 n
 xi
n i 1
x1 

 xn
n
Other kinds of means:
Geometric mean (GM) =
n
x1 x2
(interesting only if all xi’s are positive)
xn
Harmonic mean (HM) = inverse of average of inverses
=
n
1  1 
x1
x2
Root-mean-square (RMS) =
 1
xn
x12  x2 2 
n
 xn 2
Grammatically challenging. We’ll sometimes call it the RMS mean.
middle value,
if n is odd 

Median: x = 

average of two middle values, if n is e ven 
Midmean: average of the middle half
(exclude largest 25% and smallest 25%; report the average of the rest)
Trimmed mean: (really, “-trimmed mean”)
x = average of remaining values, after  n  lowest values
and  n  highest values are removed
(sensible only if α < 1/2.)
Weighted averages:
weighted average = w1 x1  w2 x2 
where each wi  0 and
 wn xn
n
w
i 1
i
1.
2
Percentiles
Percentiles ( = quantiles = fractiles ):
x = value such that fraction α of observations are ≤ x
and fraction 1 – α of observations are  x
(not well defined for certain values of α)
Quartiles:
Q1 = lower quartile, same as x0.25
Q3 = upper quartile, same as x0.75
Extremes:
xmin = minimum value among x1 , x2 , , xn
xmax = maximum value among x1 , x2 , , xn
Five-number summary:
minimum value, Q1, median, Q3, maximum value.
(Not quite standard. Some would use some extreme percentiles in place
of the minimum and maximum, especially in large data sets prone to outliers)
3
Measures of Dispersion
Variance (or “population variance,” VARP( ) in Excel):
2 
1 n
( xi  x ) 2 = “mean square deviation”

n i 1
Standard deviation (or “population standard deviation,” STDEVP( ) in Excel):
   2 = “root-mean-square deviation”
(“standard” often means root-mean-square in statistics,
so “standard deviation” is a formula as well as a name)
Sample variance (VAR( ) in Excel):
s2 
1 n
( xi  x ) 2

n  1 i 1
Sample standard deviation (STDEV( ) in Excel):
s  s2
xmax  xmin
Range:
(largest value minus smallest)
(not really standard usage; some use the word
“range” to refer to the pair xmin, xmax.)
Interquartile range (IQR):
Q3 – Q1 = x0.75  x0.25
(not well defined if n is multiple of 4)
Mean absolute deviation:
mean absolute deviation =
1 n
 xi  x
n i 1
4
Download