descriptive

advertisement
Stat 61
October 29, 2007
Descriptive Statistics
(single variable)
Notation
n
x1 , x2 ,
number of items in sample
, xn
xi

x1 ', x2 ',
values in the sample (or use y, z, etc., all lower case)
i-th value (where i = 1, …, n)
(Greek “alpha”) – some number between 0 and 1, usually near 0,
representing, for example, a fraction of observations
, xn '
“order statistics” – same values as x1 , x2 , , xn , but sorted in
increasing order. For example, xn ' is the largest of the
numbers x1 , x2 , , xn . Always: x1 '  x2 '   xn '.
Graphical representations
Dot plot
Histogram
Bars should normally have equal width. If not, then be sure
that equal areas represent equal quantities (“equal area principle”).
This means that the vertical axis has units of “number of
observations per x-axis unit.”
Stem and leaf (obsolete)
Box plot ——————————>
or, per Tufte—>
1
Empirical cdf (= empirical cumulative distribution function)
F(x) = fraction of observations that are ≤ x
1
In general, the “empirical distribution” corresponding to x1 , x2 , , xn is
a discrete distribution with p(x) = (number of xi’s equal to x) / n. It is the
distribution corresponding to picking one of the xi’s at random.
Empirical survival function
S(x) = 1 – F(x) = fraction of observations that are > x
Measures of Location
Mean = Average = Arithmetic mean (AM)
x

1 n
 xi
n i 1

x1 
 xn
n
Other kinds of means:
Geometric mean =
n
x1 x2
xn
(interesting only if all xi’s are positive)
Harmonic mean = inverse of average of inverses
=
n
1  1 
x1
x2
Root-mean-square =
x12  x2 2 
n
 1
xn
 xn 2
Grammatically challenging. We’ll sometimes call it the RMS mean.
Transformed means, in general:
(1) transform the individual xi’s by applying some function f;
(2) calculate the arithmetic mean of the transformed values;
(3) apply the reverse transformation, f -1.
For the geometric mean, use f(x) = log x
For the harmonic mean, use f(x) = 1 / x
2
For the root-mean-square, use f(x) = x2.
middle value,
if n is odd 

Median: x = 

average of two middle values, if n is e ven 
Trimmed mean: (really, “-trimmed mean”)
x = average of remaining values, after  n  lowest values
and  n  highest values are removed
(sensible only if α < 1/2.)
Weighted averages:
weighted average = w1 x1  w2 x2 
where each wi  0 and
 wn xn
n
w
i 1
i
1.
A weighted average is also called a convex combination of the xi’s.
(Without the restrictions on the weights the same expression is
called a linear combination of the xi’s, but it would not
then be called a weighted average.)
Percentiles (quantiles, fractiles):
x = value such that fraction α of observations are ≤ x
and fraction 1 – α of observations are  x
(not well defined for certain values of α)
Quartiles:
Q1 = lower quartile, same as x0.25
Q3 = upper quartile, same as x0.75
Extremes:
xmin = minimum value among x1 , x2 , , xn
xmax = maximum value among x1 , x2 , , xn
Five-number summary:
minimum value, Q1, median, Q3, maximum value.
(Not quite standard. Some would use some extreme percentiles in place
of the minimum and maximum, especially in large data sets prone to outliers)
3
Measures of Dispersion
Variance (or “population variance,” VARP( ) in Excel):
2 
1 n
( xi  x ) 2 = “mean square deviation”

n i 1
Standard deviation (or “population standard deviation,” STDEVP( ) in Excel):
   2 = “root-mean-square deviation”
(“standard” often means root-mean-square in statistics,
so “standard deviation” is a formula as well as a name)
Sample variance (VAR( ) in Excel):
s2 
1 n
 ( xi  x )2
n  1 i 1
Sample standard deviation (STDEV( ) in Excel):
s  s2
xmax  xmin
Range:
(largest value minus smallest)
(not really standard usage; some use the word
“range” to refer to the pair xmin, xmax.)
Interquartile range:
Q3 – Q1 = x0.75  x0.25
(not well defined if n is multiple of 4)
Mean absolute deviation:
1 n
mean absolute deviation =  xi  x
n i 1
4
Moments
The “k-th moment of the xi’s about zero” is the average value of xik:
1 n
M k   xi k
for k = 1, 2, …
( 0-th moment would always be 1)
n i 1
The “k-th moment about the mean” is the average value of
mk 
So:
1 n
k
 xi  x 

n i 1
 xi  x 
k
:
for k = 2, 3, …
Mean = 1st moment about 0;
Variance = 2nd moment about mean.
Skewness:
m3
Kurtosis:
m4
3
4
5
Download