Chapter 1 (modified)

advertisement
Chapter 1
Overview
and
Descriptive Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
1.1
Populations,
Samples,
and Processes
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Populations and Samples
A population is a well-defined collection
of objects.
When information is available for the
entire population we have a census. A
subset of the population is a sample.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Data and Observations
Univariate data consists of observations
on a single variable (multivariate – more
than two variables).
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Branches of Statistics
Descriptive Statistics – summary and
description of collected data.
Inferential Statistics – generalizing
from a sample to a population.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Relationship Between Probability
and Inferential Statistics
Probability
Population
Sample
Inferential
Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
1.2
Pictorial
and
Tabular Methods
in Descriptive Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Stem-and- Leaf Displays
1. Select one or more leading digits for
the stem values. The trailing digits
become the leaves.
2. List stem values in a vertical column.
3. Record the leaf for every observation.
4. Indicate the units for the stem and
leaf on the display.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Stem-and-Leaf Example
Observed values:
9, 10, 15, 22, 9, 15, 16, 24,11
0 99
1 10556
2 24
Stem: tens digit
Leaf: units digit
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Stem-and- Leaf Displays
• Identify typical value
• Extent of spread about a value
• Presence of gaps
• Extent of symmetry
• Number and location of peaks
• Presence of outlying values
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Another stem-and-leaf example
The decimal point is 1 digit to the right of the |
2 | 55
3 | 05888
4 | 03558888888
5 | 00000000333335555577777
6 | 00033333555588
7 | 0003333335555588
8 | 00000335588
9 | 088
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Dotplots
Represent data with dots.
Observed values:
9, 10, 15, 22, 9, 15, 16, 24,11
5
10
15
20
25
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Types of Variables
A variable is discrete if its set of possible
values constitutes a finite set or an infinite
sequence. A variable is continuous if its
set of possible values consists of an entire
interval on a number line.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Discrete Data
Determine the frequency and relative
frequency for each value of x. Then
mark possible x values on a horizontal
scale. Above each value, draw a
rectangle whose height is the relative
frequency of that value.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Ex. Students from a small college were asked how
many charge cards they carry. x is the variable
representing the number of cards and the results are
below.
x
#people Rel. Freq
0
12
0.08
1
2
3
4
42
57
24
9
0.28
0.38
0.16
0.06
5
6
4
2
0.03
0.01
Frequency
Distribution
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Credit card results:
Relative Frequency
x
Rel. Freq.
0
1
0.08
0.28
0.4
2
3
4
5
0.38
0.16
0.06
0.03
0.2
6
0.01
0.3
xi
0.1
0
0
1
2
3
4
5
6
Number of Cards
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Continuous Data:
Equal Class Widths
Determine the frequency and relative
frequency for each class. Then mark
the class boundaries on a horizontal
measurement axis. Above each class
interval, draw a rectangle whose height
is the relative frequency.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histogram example
10
5
0
Frequency
15
Histogram of e1scores
20
40
60
80
100
e1scores
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histograms
Continuous Data: Unequal Widths
After determining frequencies and
relative frequencies, calculate the height
of each rectangle using:
relative frequency of the class
rectangle height =
class width
The resulting heights are called densities
and the vertical scale is the density scale.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histogram Shapes
symmetric unimodal
positively skewed
bimodal
negatively skewed
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Histogram example:
symmetric, slightly bimodal
10
5
0
Frequency
15
Histogram of e1scores
20
40
60
80
100
e1scores
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
1.3
Measures
of
Location
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
The Mean
The average (mean) of the n numbers
x1, x2 ,..., xn is x where
n
x1  x2  ...  xn
x

n
 xi
i 1
n
Population mean: 
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Median
The sample median, x, is the middle
value in a set of data that is arranged in
ascending order. For an even number
of data points the median is the average
of the middle two.
Population median: 
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Median example
• In a class of 85 exam scores, the median, x,
is the 43rd number if the scores are listed in
ascending order. (Note: In this case there are
42 above the median and 42 below the
median.)
40
41
42
43
44
45
46
57.5 57.5 60.0 60.0 60.0 62.5 62.5
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Three Different Shapes for a
Population Distribution

symmetric

negative skew

positive skew
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Slight positive skew
10
5
0
Frequency
15
Histogram of e1scores
20
40
60
80
100
Median=60.0, Mean=61.4
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
1.4
Measures
of
Variability
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Sample Variance
Variance is a measure of the spread of the
data.
The sample variance of the sample x1, x2,
…xn of n values of X is given by
x

x



i
2
s 
n 1
2
S xx

n 1
We refer to s2 as being based on n – 1 degrees
of freedom.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Sample variance example
• First, find sample mean:
x  61.35
• Next, add up squared deviations from mean:
(62.5  61.35)2  (90.0  61.35)2   21,531.9
• Divide by n-1, where n is the number of
observations (in this case, 85):
21,531.9
 256.3
84
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Standard Deviation
Standard deviation is a measure of the
spread of the data using the same units as
the data.
The sample standard deviation is the
square root of the sample variance:
s s
2
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Standard deviation example
2
s  s  256.3  16.0
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Formula for s2
An alternative expression for the
numerator of s2 is

S xx   xi  x
 
2
xi2
xi 



2
n
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Formula for s2: Shortcut example
• First, sum the
scores:
• Next, sum the
squares:
• Numerator of
variance equals
n
x
i 1
 5215
i
n
x
i 1
2
i
 341, 487.5
2
5215
341, 487.5 
 21,531.9
85
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Properties of s2
Let x1, x2,…,xn be any sample and c be
any nonzero constant.
1. If y1  x1  c,..., yn  xn  c, then s 2y  sx2
2
2 2
2. If y1  cx1,..., yn  cxn , then s y  c sx ,
2
where s x is the sample variance of the x’s
2
and s y is the sample variance of the y’s.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Upper and Lower Fourths
After the n observations in a data set are
ordered from smallest to largest, the
lower (upper) fourth is the median of the
smallest (largest) half of the data, where
the median x is included in both halves if
n is odd. A measure of the spread that is
resistant to outliers is the fourth spread fs
= upper fourth – lower fourth.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Third and first quartiles
After the n observations in a data set are
ordered from smallest to largest, the first
(third) quartile is the median of the
smallest (largest) half of the data, where
the median x is included in both halves if
n is odd. A measure of the spread that is
resistant to outliers is the interquartile
range or IQR fs = 3rd quartile – 1st
quartile.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Outliers
Any observation farther than 1.5fs from
the closest fourth is an outlier. An
outlier is extreme if it is more than 3fs
from the nearest fourth, and it is mild
otherwise.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Boxplots
lower fourth
extreme
outliers
mild
outliers
upper fourth
median
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
60
40
e1scores
80
100
Boxplot example
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
0.65
0.60
0.55
0.50
0.45
e1questions
0.70
0.75
Another boxplot example
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
Download