NUMERICALLY SUMMARIZING DATA

advertisement
NUMERICALLY
SUMMARIZING
DATA
NOTATION







N = SIZE OF POPULATION
n = SIZE OF SAMPLE
µ = MEAN OF POPULATION
X = MEAN OF SAMPLE
Σ = SUM OF INDIVIDUALS
σ = POPULATION STANDARD DEVIATION
S = SAMPLE STANDARD DEVIATION
MEASURES OF
CENTRAL TENDENCY (3.1)

MEAN (or average) of
a POPULATION :
N

X
i 1
i
N
n

MEAN (or average) of
a SAMPLE:
X 
X
i 1
n
i
MEASURES OF
CENTRAL TENDENCY
5.3
5.5
5.6
5.7
5.7
5.8
5.9
6.2
6.3
6.3
6.4
6.6
6.6
6.7
6.8
7.1
7.1
7.3
7.6
7.9
n
X
X
i 1
n
i
128.4

 6.42
20
MEASURES OF
CENTRAL TENDENCY
 The
Sample Mean is ONLY an
estimation of the (real)
Population Mean.

To know the (real) Population Mean, all
the individuals of the population must be
used in the calculation.
MEASURES OF
CENTRAL TENDENCY
 LAW
OF LARGE NUMBERS:
AS n  N THEN X  
In
other words, as the sample
size gets closer to the population
size the sample mean gets
closer to the real population
mean.
MEASURES OF
CENTRAL TENDENCY


TRIM MEAN: Remove the minimum value
and the maximum value and the find the
sample mean.
Used to remove possible outliers and find
a more reasonable mean.
MEASURES OF
CENTRAL TENDENCY

MEDIAN: Middle value (if n is odd) or the
average of the two middle values (if n is
even).
5.3
5.5
5.6
5.7
5.7

5.8
5.9
6.2
6.3
6.3
6.4
6.6
6.6
6.7
6.8
7.1
7.1
7.3
7.6
7.9
MEDIAN=Average of 6.3 & 6.4 = 6.35
MEASURES OF
CENTRAL TENDENCY

MODE: The most frequent value(s).
Could be none or several.
5.3
5.5
5.6
5.7
5.7

5.8
5.9
6.2
6.3
6.3
6.4
6.6
6.6
6.7
6.8
MODES: 5.7, 6.3, 6.6, 7.1
7.1
7.1
7.3
7.6
7.9
MEASURES OF
CENTRAL TENDENCY
SKEWED RIGHT
FREQUENCY
20
15
10
5
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
VALUES

MODE = 0.2
MEDIAN = 0.3 MEAN = 0.39
MEASURES OF DISPERSION
(3.2)

RANGE = MAXIMUM – MINIMUM
5.3
5.5
5.6
5.7
5.7
5.8
5.9
6.2
6.3
6.3
6.4
6.6
6.6
6.7
6.8
7.1
7.1
7.3
7.6
7.9
RANGE = 7.9 – 5.3 = 2.6
MEASURES OF DISPERSION


Need a better way. Two distributions can
have the same range, but the can have
significantly different dispersions.
Example using Dot Plots:
MEASURES OF DISPERSION


Using the average distance from the mean
for all data points would be better.
Need to find the mean and then find each
difference x  x . Then add them all
together and divide by the number of
points. But there are a few problems.








MEASURES OF DISPERSION



STANDARD DEVIATION can be thought of
as the average distance of the values from
the mean.
N
2
Population:
i1  X i   

Sample:
N
 X
n
s
i 1
i
X
n 1

2
MEASURES OF DISPERSION
X
X - MEAN =
(X - MEAN)^2
5.3
5.3 - 6.42 =
-1.12
1.25
5.5
5.5 - 6.42 =
-0.92
0.85
*
*
*
*
*
*
*
*
*
*
*
*
7.3
7.3 - 6.42 =
0.88
0.77
7.6
7.6 - 6.42 =
1.18
1.39
7.9
7.9 - 6.42 =
1.48
2.19
SUM(X-MEAN)^2 /(n-1) =
0.53
SQRT[SUM(X-MEAN)^2 /(n-1)] =
0.73
MEASURES OF DISPERSION

ALTERNATIVE FORMULAS:
 
 N

X
 i 
N
2

X i   i 1

N
i 1
N
2
2


X
2
 i 
n
128.4 

2
 i 1 
X

834.44 

i
n
20
s  i 1

n 1
19
n
MEASURES OF DISPERSION

VARIANCE: The square of the Standard
Deviation.

Population VARIANCE   2

2
Sample VARIANCE  s
SYMBOL SUMMARY
ITEM
SIZE
MEAN
STANDARD
DEVIATION
VARIANCE
POPULATION
PARAMETER
SAMPLE
STATISTIC
N
ns

x

2
s
s2
USING THE CALCULATOR

STAT  EDIT: Enter Data Into L1
STAT  CALC  1: 1-Var Stats ENTER
2nd “1” (L1) ENTER

EXAMPLE


EMPERICAL RULE




FOR A DISTRIBUTION NORMALLY
DISTRIBUTED:
Approximately 68% of the population are
within the range of   1
Approximately 95% of the population are
within the range of   2
Approximately 99.7% of the population are
within the range of   3
EMPERICAL RULE
EMPERICAL RULE
EXAMPLE: Let Mean = 25 and
Std. Dev. = 5.
Find the % of population:







Greater than 25
Between 20 and 30
Between 15 and 25
Between 20 and 35
Less than 15
CHEBYSHEV’S INEQUALITY


FOR ANY DISTRIBUTION:
THE PERCENT OF THE POPULATION WITHIN +/- K
STANDARD DEVIATIONS OF THE MEAN IS GIVEN BY
EXAMPLE: IF K=2.5 THEN
1 

1  2  *100%
 K 
84% OF THE POPULATION IS WITHIN +/- 2.5
STANDARD DEVIATIONS OF THE MEAN
MEASURES OF CENTRAL
TENDENCY AND DISPERSION
OF GROUPED DATA (3.3)
MEAN AND STANDARD DEVIATION FROM
FREQUENCY DISTRIBUTIONS

MEAN:
x * f 


f
i
i
i

STANDARD DEVIATION:

 ( X  )
f
i 1
i
i




N
N
2
* fi

i 1
 N

X
*
f



i
i


i 1
2


X i * fi 
 fi
2

f
i
IF FROM CONTINUOUS FREQUENCY DISTRIBUTION, USE
THE MIDPOINT FROM EACH CLASS.
TO DO ON CALCULATOR, ENTER TABLE IN L1 & L2. THEN
DO
STAT  CALC  1: 1-Var Stats ENTER L1,L2
WEIGHTED AVERAGE

Calculated like mean for frequency distribution.
X * f 


f
i
i
i

Example using Grade Point Average
GRADE
GRADE VALUE (x)
CREDIT HOURS (f)
x*f
C
2
3
6
B
3
4
12
A
4
3
12
A
4
2
8
B
3
4
12
16
50
50
   3.1
16
WEIGHTED AVERAGE

Calculating Grades in a Course:
 Labs
Worth 10%
 Homework Worth 8%
 Tests Worth 60%
 Final Worth 22%
MEASURES OF RELATIVE
POSITION (3.4)

Defined as where a data point is (on a
number line) relative to the other data
points in the distribution.
MEASURES OF RELATIVE
POSITION
Z-SCORE:
How far a data point is from the
Mean in terms of Std. Dev.’s
X  X  X
Z


s
MEASURES OF RELATIVE
POSITION





Used to compare relative position of data in two
separate groups.
“A” has a score of 78 in a class with a mean of
84 and a std. dev. of 6.
“B” has a score of 86 in a class with a mean of
90 and a std. dev of 3.
Who did better relative to their class?
How would you compare baseball pitchers?
MEASURES OF RELATIVE
POSITION



Percentile: The value for which k% of the
data set is ≤ Pk.
For instance if P18=7.6, then 18% of the
sample or population is less than or equal
to 7.6 and 82% are greater than 7.6.
If your MATH SAT score was in the 92
percentile, then 92% of the population
had a score less than OR equal to yours.
MEASURES OF RELATIVE
POSITION

Three important percentiles:

P25 = Q1: 25% of the data ≤ Q1

P50 = Q2: 50% of the data ≤ Q2 (median)

P75 = Q3: 75% of the data ≤ Q3
MEASURES OF RELATIVE
POSITION
5.3
5.5
5.6
5.7
5.7

5.8
5.9
6.2
6.3
6.3
6.4
6.6
6.6
6.7
6.8
7.1
7.1
7.3
7.6
7.9
Using Calculator:
STAT  EDIT: Enter Data Into L1
 STAT  CALC  1: 1-Var Stats ENTER
 2nd “1” (L1) ENTER

Five Number Summary & Box Plots
(3.5)
FIVE NUMBER SUMMARY
MIN, Q1, MEDIAN, Q3, MAX
BOX PLOT
Good for comparing distributions
UNUSUAL VALUES





Inter Quartile Range: IQR = Q3 – Q1.
  1.5IQR
Any Value Less Than
Is Considered Unusual (called lower fence).
  1.5IQR
Any Value Greater Than
Is Considered Unusual (called upper fence).
OTHER DETERMINATIONS OF
UNUSUAL VALUES



If the Z-Score is less than – 2 or greater
than +2 the value that corresponds to that
x
x

x
Z


Z-Score is unusual. Recall
s
 .
Another way to say this is that if a value is
outside the boundaries created by
x  2s or   2
is considered unusual.
QUOTES



“Facts are stubborn, but statistics are more pliable.”
Mark Twain
“Statistics are used much like a drunk uses a lamppost:
for support, not illumination.” Vin Scully
“In baseball, my theory is to strive for consistency, not
to worry about the numbers. If you dwell on statistics
you get shortsighted, if you aim for consistency, the
numbers will be there at the end.” Tom Seaver
Download