Stat11-20Jan06 - Swarthmore College

advertisement
Jan. 20-23
Shapes of distributions…
“Statistics” for one quantitative variable…
Mean and median
Percentiles
Standard deviations
Transforming data…
Rescale:
Y = c times X
Recenter:
Y = X plus a
adding variables to each other
other transformations
Shape of a distribution…
Outliers
Unimodal --- Bimodal --- Multimodal
Symmetrical
Skew - right or left?
Colleges – Datadesk histogram
More
1.5
1.3
1.1
0.9
0.7
0.5
0.3
0.1
-0.1
-0.3
-0.5
-0.7
GE daily changes ($/share)
10
9
8
7
6
5
4
3
2
1
0
NH polls, 1/26/04 - errors
Errors from 1/26 NH polls
12
10
8
6
4
2
0.1
0.08
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
-0.1
0
Population
vs.
Sample
A
statistic
anything that can be computed from data.
is
STATISTICS of a single quantitative variable
MEAN
MEDIAN
QUARTILES ( Q1, Q3 )
Five-number summary
Boxplots
Interquartile range
PERCENTILES / QUANTILES / FRACTILES
(“quantiles” and “fractiles” are synonyms for “percentiles” for people who don’t
like the implied multiplication by 100)
STANDARD DEVIATION
VARIANCE
Statistics of one variable…
MEAN — Sum of values, divided by n
MEDIAN — Middle value
(when values are ranked, smallest to
largest)
(or, average of two middle values)
Number of Colleges (ranked)
1
1
2
6
8
12
1
1
4
6
8
12
1
1
5
6
8
12
1
1
5
6
8
13
1
1
5
6
8
14
1
1
5
7
8
14
1
1
5
7
9
14
1
1
5
7
9
1
1
5
7
10
1
1
5
7
10
1
1
5
7
10
1
1
6
7
10
1
2
6
8
12
Colleges – Datadesk histogram
median —
5
mean —
5.36
Salaries
20000
50000
80000
30000
50000
80000
30000
50000
80000
30000
50000
80000
30000
50000
85000
30000
50000
90000
30000
60000
100000
30000
60000
100000
35000
60000
100000
35000
60000
120000
40000
60000
125000
40000
60000
150000
40000
65000
150000
40000
70000
150000
40000
70000
200000
45000
70000
250000
45000
72500
400000
50000
75000
500000
50000
75000
600000
50000
75000
1000000
salaries
median —
mean —
60,000
106,875
So, which measure of “center” is best?
All the measures agree (roughly) when the distribution is
symmetrical
Mean has attractive mathematical properties
Also, the mean is related to the total, if that’s what you care about
Median may be more “typical” when the distribution is nonsymmetrical
A measure is “robust” if it works reasonably well under a wide
variety of circumstances
Medians are robust
Jan. 23
RMS, Geometric mean
Percentiles, Quartiles (Q1, Q3), BOX PLOTS
Measures of spread:
IQR (range containing middle half)
Standard deviation ( , s )
Variance
Transforming data…
Rescale:
Y = c times X
Recenter:
Y = X plus a
adding variables to each other
other transformations
“STANDARDIZING” a variable
NORMAL DISTRIBUTIONS
Computing percentiles
To calculate 20-th percentile:
Rank the values from smallest to largest
Compute 20% of n…
20% of 72 = 14.4
Count off that many values (from lowest)…
The value at which you stop is the 20-th percentile.
What if you stop between values ?
Number of Colleges
1
1
2
6
8
12
1
1
4
6
8
12
1
1
5
6
8
12
1
1
5
6
8
13
1
1
5
6
8
14
1
1
5
7
8
14
1
1
5
7
9
14
1
1
5
7
9
1
1
5
7
10
1
1
5
7
10
1
1
5
7
10
1
1
6
7
10
1
2
6
8
12
QUARTILES
Lower quartile (Q1) = 25-th percentile
Upper quartile (Q3) = 75-th percentile
( What’s Q2 ? )
INTERQUARTILE RANGE ( IQR ) = Q3 minus Q1
Five-number summary
—
maximum
—
Q3
—
—
median
Q1
—
minimum
VARIANCE and STANDARD DEVIATION
VARIANCE (s2):
n
s2 
 (x  x )
i 1
2
i
n 1
STANDARD DEVIATION (s):
n
s
 (x  x )
i 1
i
n 1
2
Linear Transformations
If you MULTIPLY or DIVIDE a variable by a constant…
Y = c times X
Y=X/c
then…
measures of center are multiplied or divided by c
measures of spread are multiplied or divided by |c|
If you ADD or SUBTRACT a constant from a variable…
Y=X+a
Y=X–a
then…
measures of center are increased (decreased) by a
measures of spread are UNCHANGED.
More transformations
ADDING VARIABLES:
W = X + Y
Mean (W) = Mean (X) + Mean (Y)
Standard Deviation of (W) — anything can happen
OTHER TRANSFORMATIONS:
Y = X squared ?
Y = log (X) ?
…NO RELIABLE RULES for mean
or std. dev.
Standardized Variables
Write
x
and S for mean, standard deviation of X
Then form transformed variable:
Z = (X -
x
) / S
Then…
mean (Z)
= 0
std dev (Z) = 1
Z answers the question: How many standard deviations is this value
above (or below) the mean?
Jan. 25
More on transforming and standardizing variables
More on normal distributions
Jan. 27++
Relations among variables --scatterplots
“independent” variables
correlations
linear regressions (best fit lines)
Normal Density Function


X ~ (,)
 = mean,  = std. dev.
(Why Greek? Why not x-bar, s?)
Trying the integral
Standard normal: mean = 0, std. dev. = 1
1
0
Density curve:
1
 x2
1
f ( x)  (
)e 2
2
…so the area between a and b is:
1
b
 x2
1
(
)  e 2 dx
2 a
The core computation
If X ~ N(,), what fraction of values are between
a and b ?
a
Rule of 68 – 95 – 99.7
Standardizing
Tables and computers
Reversing the calculation
b
Standardizing
Same Question:
Is X between a and b ?
Is (X-)/ between (b-)/ and (b-)/ ?
But Z = (X-)/ is a variable with a standard normal
distribution (mean 0, standard deviation 1).
So, if we can answer this question for standard normals,
we can answer it for all normals.
Download