Review: Chapter 1, Section 1, Describing distributions with graphs

advertisement
Review: Chapter 1, Section 1,
Describing distributions with graphs
Stat 226 – Introduction to Business Statistics I
Spring 2009
Professor: Dr. Petrutza Caragea
Section A
Tuesdays and Thursdays 9:30-10:50 a.m.
Pie charts
Bar graphs
Pareto graphs
Histograms
Stemplots
Chapter 1, Section 1.2
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
1 / 29
Match the histograms to the best description
Stat 226 (Spring 2009, Section A)
Numbers of medals won by
countries in the 1992
Winter Olympics.
2
Last digit of each of 500
students’ social security
numbers.
3
Age at death of a sample
of 45 persons.
4
The SAT scores of 500
students.
5
The heights in inches of
500 college students.
6
Time on hold at a help line
Chapter 1, Section 1.2
Introduction to Business Statistics I
Chapter 1, Section 1.2
2 / 29
Chapter 1.2 – Describing Distributions with Numbers
1
Introduction to Business Statistics I
Stat 226 (Spring 2009, Section A)
want to describe NUMERICALLY
1
CENTER of the data
2
SPREAD of the data
Measuring the center:
associated with locating the “middle” of the data
finding the value that is most typical for the data
three common measures:
mean – average value of all data points
median – “middle” value of all data points
3 / 29
mode – data point(s) with highest frequency (most popular)
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
4 / 29
Chapter 1.2 – Sample mean
Chapter 1.2 – Sample mean
Notation: x̄
The sample mean of a set of observations x1 , x2 , . . . , xn is the arithmetic
average of all observations
Sometimes, the mean is not an appropriate measure of the center, because
it simply does not reflect a typical value of the data.
x̄ =
x1 + x2 + . . . + x n
1
=
n
n
n
!
xi
This is almost always the case when we have unusually large or small
observations in the data (called outliers).
Example: starting salaries of 5 people after graduating from college
i=1
Example: # of sick days employees took in a small local business
0,
1,
2,
0,
4,
0,
1,
2,
35,000;
37,000;
35,000;
33,000;
210,000
3
x̄
35,000 + 37,000 + 35,000 + 33,000 + 210,000
5
= 70,000
=
70,000 is certainly not a typical starting salary for all 5 people, it is “just”
the average.
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
5 / 29
Chapter 1.2 – Sample mean
after removing the salary of 210,000 the new sample mean is
Introduction to Business Statistics I
Chapter 1, Section 1.2
6 / 29
A measure of center that is more robust against outliers is the so-called
median.
Notation: median, M
The median corresponds to the value of the data that occupies the middle
position when all observations are ordered from smallest to largest.
35,000 + 37,000 + 35,000 + 33,000
=
4
= 35,000
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1.2 – Median
Note: The sample mean x̄ is sensitive toward outliers, i.e. it gets pulled
toward the extreme values in a data set.
x̄new
Stat 226 (Spring 2009, Section A)
Chapter 1, Section 1.2
7 / 29
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
8 / 29
Chapter 1.2 – Median
Chapter 1.2 –Median
Finding the median:
Examples:
Example 1: Salary data ordered
1
order all observations from smallest to largest.
2
assess whether the total number of observations is odd or even.
locate middle value of data
3
33,000;
35,000;
35,000;
37,000;
210,000
odd:
median M is the middle observation in the ordered list, i.e. the
"
n+1
2
#th
observation.
Example 2:
even:
median M corresponds to the average of the two middle observations in
the ordered list; i.e. the average of the
$ n %th
2
Stat 226 (Spring 2009, Section A)
and
$n
2
%th
+1
33,000;
35,000;
35,000;
37,000;
39,000;
210,000
observation.
Introduction to Business Statistics I
Chapter 1, Section 1.2
9 / 29
Chapter 1.2 – Mean vs. Median
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
10 / 29
Chapter 1.2 – Mean vs. Median
Note: Salary data in Example 1
x̄ = 70,000
and
M = 35,000
⇒ the median M is obviously less influenced by outliers.
we should not conclude though that the median should always be
preferred over the mean simply because of its robustness against
outliers.
the mean and median measure the center of a data set in different
ways – they are both useful depending on the situation/application
Example: sick days data: 0, 1,
one more data point of x = 56
Stat 226 (Spring 2009, Section A)
2,
0,
4,
Introduction to Business Statistics I
0,
1,
2,
3 — add
Chapter 1, Section 1.2
11 / 29
If costs are directly associated with the amount of sick days, then the
mean would clearly be a better measure as it takes the extreme
observation into account
If we are just interested in the typical number of sick days for all
employees, the median is probably the more representative measure.
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
12 / 29
Chapter 1.2 – Mean vs. Median
Chapter 1.2 – Mode
Relation between the shape of a distribution and mean/median
The more symmetric a distribution is, the closer the mean and the median
will be
perfectly symmetric
The mode corresponds to the value of the variable that occurs most
frequently.
Most useful for categorical data with a relatively small number of
possible values.
Example: Stat 226 – classification
Fr – 4
So – 31
J – 41
S–7
skewed to the right
skewed to the left
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
13 / 29
Chapter 1.2 – Measuring spread/variation
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
14 / 29
Chapter 1.2 – Measuring spread/variation
Measures of spread
variation is always present in real data
it is important to know how spread out the data are as this tells us
something about the behavior of a variable
furthermore, describing data just using the measures of
location/center is not sufficient — totally different data sets can still
have the same mean/median
Example: Number of sick days for 9 employees, two data sets
Data set 1: 0, 0, 0, 1, 1, 2, 2, 3, 4
are 3 numbers that divide the ordered observations into 4 equally
sized groups (i.e. each group contains 25% of all observations)
Data set 2: 0, 0, 0, 0, 0, 0, 0, 0, 13
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Quartiles Q1 , Q2 , Q3
describe the position of a specific data value in relation to the rest of
the data
Chapter 1, Section 1.2
15 / 29
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
16 / 29
Chapter 1.2 – Quartiles
Chapter 1.2 – Quartiles
Finding quartiles:
If an additional observation of x = 56 is added (now total number of
observations is even)
Q1 : median of all observations to the left of the median M
Q2 : corresponds to the median M
0,
Q3 : median of all observations to the right of the median M
0,
0,
1,
1,
2,
2,
3,
4,
56
Example: sick days (total number of observations is odd)
0,
0,
0,
1,
1,
2,
2,
3,
4
Quartiles are also less influenced by outliers
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
17 / 29
Chapter 1.2 – Five-number summary
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
18 / 29
Chapter 1.2 – Boxplots
A graphical display of the 5-number summary is a so-called boxplot
convenient tool to describe both, the center and the spread in a data set
5-number summary
The 5-number summery consists of the following measures
Min
Q1
Median
Q3
Max
Example: sick days
0,
0,
0,
1,
1,
2,
2,
3,
Note:
4
Min = Q1 coincide here due to the nature of the data (this is more
the exception than it is the rule)
boxplots can be either vertical or horizontal
side-by-side boxplots to compare different groups
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
19 / 29
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
20 / 29
Chapter 1.2 – Boxplots
Chapter 1.2 –Boxplots
Example: data on the # of surgeries performed by male and female
surgeons in a hospital
side-by-side boxplots:
female: 5, 7, 10, 14, 18, 19, 25, 29, 31, 32
male: 20, 25, 25, 27, 28, 31, 33, 34, 36, 36, 37, 44, 50, 59, 85, 86
5-number summary:
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
21 / 29
Chapter 1.2 – Boxplots
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
22 / 29
Chapter 1.2 – More measures of spread
Measuring spread: the range, IQR, variance and standard deviation
Using boxplots to describe distributions
Need to describe the amount of spread or variability that is present in the
data
less variability among female surgeons
distribution is also more symmetric for female surgeons
more variability among male surgeons
Note: Any measure of spread will take the value of zero only if all
observations in the data set have the same value!
mean/median is much higher for male surgeons than for female ones
in general:
for a symmetric distribution: Q1 and Q3 are about equally apart from M
for a skewed to the right distribution: Q3 will be further away from M than Q1
(as well as Min and Max)
for a skewed to the left distribution: Q1 will be further away from M than Q3 (as
well as Min and Max)
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
23 / 29
range, R
The range R corresponds to the difference between the highest and
lowest value.
Example: # of surgeries performed by the 16 male surgeons
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
24 / 29
Chapter 1.2 – Interquartile range
Chapter 1.2 – More measures of spread: Sample variance
Note: the range shows the full range of spread in the data, but the range
depends on the smallest/largest observation which could be outliers!!
Improve the description of SPREAD by looking at the deviations of each
single observations from the mean, i.e. how far is an observation away
from the overall mean x̄
Alternatively, we can use the so-called interquartile range, IQR
IQR
IQR = Q3 − Q1
sample variance s 2
The sample variance corresponds to the sum of all squared deviations
of each observations from the sample mean x̄
corresponding to range of the middle 50% of the data.
Example: 16 male surgeons
s2 =
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
25 / 29
Chapter 1.2 – Sample standard deviation
Stat 226 (Spring 2009, Section A)
=
Introduction to Business Statistics I
Chapter 1, Section 1.2
26 / 29
Chapter 1.2 – Variance and Standard deviation
standard deviation, s
The standard deviation is the positive square root of the variance s 2
Note:
the variance s 2 (and hence s) can only be greater or equal to zero
(as based on squared deviations)
Why work with s instead of s 2 ?
s has the same units of measurements as observations in data set
s 2 and s measure the spread about the sample mean x̄
Example: # of surgeries – female surgeons
s 2 = s = 0 only if all observations are of same value
5, 7, 10, 14, 18, 19, 25, 29, 31, 32
s 2 and s are strongly influenced by outliers; one outlier can cause s 2
and s to drastically increase in value
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
27 / 29
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
28 / 29
Choosing a numerical summary
Choice of an appropriate measure of center/spread heavily relies on
the shape of the distribution
the presence of outliers
⇒ If the data are reasonably symmetric and no outliers are present, then
the sample mean x̄ and the standard deviation s can be used
⇒ If the data are skewed and/or outliers are present, the 5-number
summary should be used
Stat 226 (Spring 2009, Section A)
Introduction to Business Statistics I
Chapter 1, Section 1.2
29 / 29
Download