Uploaded by AlReem AlMutawa

STAT130 LCN Unit1

advertisement
Statistics for Business
STAT130
Unit 1: Introduction and
Descriptive Statistics
Chapter 1
An Introduction to Business
Statistics
Applications in Business and Economics

Accounting


Production


Public accounting firms use statistical sampling
procedures when conducting audits for their
clients.
A variety of statistical quality control charts are
used to monitor the output of a production process.
Marketing

Electronic point-of-sale scanners at retail checkout
counters are being used to collect data for a variety
of marketing research applications.
3
Applications in Business and Economics

Economics


Economists use statistical information in making
forecasts about the future of the economy or some
aspect of it.
Finance

Financial advisors use a variety of statistical
information, including price-earnings ratios and
dividend yields, to guide their investment
recommendations.
4
Key Definitions






A population is the collection of all items or
things under consideration –people or objects
A sample is a portion of the population selected
for analysis
A parameter is a summary measure that
describes a characteristic of the population
A statistic is a summary measure computed from
a sample
A survey is the gathering of data about a
particular group of people or items
A census is a survey of the entire population
5
Exercise

A manufacturer of children toys claims that less
than 5% of his products are defective. When 500
toys were drawn from a large production run, 8%
were found to be defective.
a)
b)
c)
d)
e)
What is the population of interest?
What is the sample?
What is the parameter?
What is the statistic?
Does the value 5% refer to the parameter or the
statistic?
f) Is the value 8% a parameter or a statistics?
g) Explain briefly how the statistic can be used to make
inferences about the parameter to test the claim.
6
What is Statistics?

Statistics is a science that deals with collecting
and analyzing data, drawing conclusions, and
making decisions.

There are two main areas of Statistics:


Descriptive statistics:
provides tabular and graphical techniques and
numerical measures for describing data.
Inferential statistics:
provides procedures for analyzing data and
making decisions.
7
Descriptive Statistics
 Collect

data
e.g. Survey
 Present

data
e.g. Tables and graphs
 Characterize

data
e.g. Sample mean = 
Xi
n
8
Inferential Statistics

Estimation


e.g.: Estimate the
population mean weight
using the sample mean
weight
Hypothesis testing

e.g.: Test the claim that the
population mean weight is
over 120 pounds
Drawing conclusions and/or making decisions
concerning a population based on sample results.
9
Inferential Statistics

Making statements about a population
by examining sample results
Sample statistics
(known)
Population parameters
Inference
Sample
(unknown, but can
be estimated from
sample evidence)
Population
10
Sources of data

The most popular sources of data are:


Published material, observational studies,
experimental studies and surveys.
Published material
found in books, in scientific journals, on
tapes, on CDs, on the Internet, etc…


Data published by the organization that
collected the data are called PRIMARY DATA
Data published by an organization other than
the organization that collected the data are
called SECONDARY DATA.
11
Sources of data

Observational studies:


Experimental studies:


are studies in which the sample elements are observed
and the information is recorded without controlling any
of the factors that might affect the information or
measurements.
are studies which the measurements are recorded while
controlling some factors that might influence the results
of the study.
Surveys:

are questionnaires designed to solicit information from
people, by means of (face-to-face interview, telephone
interview, postal mail, e-mail, fax)
12
Types of data


Data are the facts, figures, or records that
are collected from the sample elements.
Data can be classified:

Qualitative data are labels or names used to
identify attributes of the sample elements.
The labels can be numbers with no real
numerical meaning.


Examples: gender, marital status, race, ..
Quantitative data are numbers (with real
meaning),
representing
measurements,
obtained from the sample elements.

Examples: salary, age, number of branches,..
13
Measurement Scales

Nominal data if the order is not important.


Examples: data representing marital status,
gender, work sector (public, private), get
promoted (yes, no), etc …
Ordinal data if the order is important.

Examples: data representing job performance
(excellent, good, fair, poor), income level (low,
medium, high), educational level (less than
high school, high school, college), etc…
14
Measurement Scales

Interval data: All of the characteristics of ordinal
plus…

Measurements are on a numerical scale with an arbitrary
zero point



Can only meaningfully compare values by the interval
between them



The “zero” is assigned: it is nonphysical and not
meaningful
Zero does not mean the absence of the quantity that we
are trying to measure
Cannot compare values by taking their ratios
“Interval” is the arithmetic difference between the values
Example: temperature


0 F means “cold,” not “no heat”
80 F is not twice as warm as 40 F
15
Measurement Scales

Ratio data: All the characteristics of interval
plus…

Measurements are on a numerical scale with a
meaningful zero point


Values can be compared in terms of their interval and
ratio



Zero means “none” or “nothing”
$30 is $20 more than $10
$0 means no money
In business and finance, most quantitative variables are
ratio variables, such as anything to do with money

Examples: Earnings, profit, loss, age, distance, height,
weight
16
Exercise

After the graduation ceremonies at a university, six
Business graduates were asked whether they will join
an MBA program next year. Some information about
these graduates is shown below.
Graduate
Huda
Mohamed
Sara
Ali
Fatima
Samer
Sex
F
M
F
M
F
M
Age
52
24
33
38
25
19
MBA
1
1
0
0
1
0
Rank
1
2
4
20
3
8
a)How many elements are in the data set?
b)How many variables are in the data set?
c) How many observations are in the data set?
d)Classify the above variables (qualitative/ quantitative).
17
Sampling

Reasons for Drawing a Sample



It may cost too much to collect information from each
element of the population.
The population may be too large and it would take a
long time to collect information.
It may not be possible to obtain information from
some elements of the population.
Probability Samples
Simple
Systematic
Stratified
Cluster
18
Simple Random Samples

Every individual or item from the frame
has an equal chance of being selected.

Selection may be with replacement or
without replacement.

Samples obtained from computer random
number generators.
19
Systematic Samples

Decide on sample size: n

Divide frame of N individuals into groups of k
individuals: k=N/n

Randomly select one individual from the 1st
group.

Select every kth individual thereafter
N = 64
n=8
k=8
First Group
20
Stratified Samples



Population divided into two or more subgroups
(called strata) according to some common
characteristic.
Simple random sample selected from each
subgroup.
Samples from subgroups are combined into
one.
Population
Divided
into 4
strata
Sample
21
Cluster Samples

Population is divided into “clusters,” each
representative of the population

A simple random sample of clusters is
selected

All items in the selected clusters can be used, or items
can be chosen from a cluster using another probability
sampling technique
Population
divided into
16 clusters.
Randomly selected
clusters for sample
22
Advantages and Disadvantages

Simple random sample and systematic sample



Stratified sample


Simple to use
May not be a good representation of the
population’s underlying characteristics that have
small probabilities
Ensures representation of individuals across the
entire population
Cluster sample


More cost effective
Less efficient (need larger sample to acquire the
same level of precision)
23
Chapter 2
Descriptive Statistics: Tabular
and Graphical Methods
Organizing and Presenting Data

Data in raw form are usually not easy to use
for decision making

Some type of organization is needed



Table
Graph
Techniques reviewed here:




Stem-and-Leaf Display
Frequency Distributions and Histograms
Bar charts and pie charts
Contingency tables and Scatter Diagrams
25
Representing Qualitative Data
Qualitative Data
Graphing Data
Tabulating Data
Frequency
Table
Bar
Charts
Pie
Charts
26
Frequency Tables



A frequency table consists of two columns,
one of which shows the categories or classes
and the other specifies the frequency for each
category.
In a frequency table, all frequencies must add
up to the sample size (n).
A relative frequency table consists of two
columns, one of which shows the categories
or classes and the other specifies the relative
frequency for each category.
The relative frequency=(Frequency/sample size)
27
Example

The following table lists all 251 vehicles sold
in 2006 by the greater Cincinnati Jeep dealers
Jeep Model
Frequency
Commander
71
Grand Cherokee
70
Liberty
80
Wrangler
30
251
28
Example: Relative Frequency Table
Jeep Model
Relative
Frequency
Percent
Frequency
Commander
0.2829
28.29%
Grand Cherokee
0.2789
27.89%
Liberty
0.3187
31.78%
Wrangler
0.1195
11.95%
1.0000
100.00%
29
Bar Charts and Pie Charts

Bar chart: A vertical or horizontal rectangle
represents the frequency for each category



Height can be frequency, relative frequency, or
percent frequency
What to Look For: Frequently and infrequently
occurring categories.
Pie chart: A circle divided into slices where the
size of each slice represents its relative frequency
or percent frequency

What to Look For: Categories that form large and
small proportions of the data set.
30
Excel Bar Chart
31
Excel Pie Chart
32
Exercise

A random sample of 25 female shoppers was
selected on a given day and each shopper
was
asked:
“what
is
your
favorite
shampoo?”. The data were as follows:
p, p, s, d, s, d, d, s, p, d, p, d, d, s, d, p, s,
s, d, s, p, d, d, s, d,
where d= Dove, p= Pantene and s= Sunsilk.
Construct a frequency table, a bar chart and
a pie chart and comment on the plots.
33
Representing Quantitative Data
Quantitative Data
Ordered Array
Stem and Leaf
Display
Frequency Distributions
and
Cumulative Distributions
Histogram
Polygon
Ogive
34
Frequency Distributions

A frequency distribution is a list or a table



containing class groupings (categories or ranges
within which the data falls)
and the corresponding frequencies with which data
falls within each grouping or category
Why Use Frequency Distributions?




A frequency distribution is a way to summarize
data
The distribution condenses the raw data into a
more useful form
allows for a quick visual interpretation of the data
and easy graphical display
35
Class Intervals and Class Boundaries

If each class grouping has the same width

Determine the width of each interval by
Width of interval 



range
number of desired class groupings
Use at least 5 but no more than 15 groupings
Class boundaries never overlap
Round up the interval width to get desirable
endpoints
36
Frequency Distribution Example

A manufacturer of insulation randomly selects 20
winter days and records the daily high
temperature

Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find range: 58 - 12 = 46

Select number of classes: 5 (usually 5 to 15)

Compute class interval (width): 10 (46/5 then roundup)

Compute class boundaries (limits): 10, 20, 30, 40, 50, 60

Compute class midpoints: 15, 25, 35, 45, 55

Count observations & assign to classes
37
Frequency Distribution Example
Ordered Data:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class
10 but less than 20
20 but less than 30
30 but less than 40
40 but less than 50
50 but less than 60
Total
Frequency
Relative
Frequency
3
6
5
4
2
20
.15
.30
.25
.20
.10
1.00
Percentage
15
30
25
20
10
100
38
The Histogram





A graph of the data in a frequency distribution is
called a histogram
The class boundaries (or class midpoints) are
shown on the horizontal axis
frequency is measured on the vertical axis
Bars of the appropriate heights can be used to
represent the number of observations within each
class
What to Look For: Central or typical value, extent
of spread or variation, general shape, location and
number of peaks, presence of gaps and outliers.
39
Histogram Example
Class
10 but less than 20
20 but less than 30
30 but less than 40
40 but less than 50
50 but less than 60
Class
Midpoint Frequency
15
25
35
45
55
3
6
5
4
2
Histogram : Daily High Tem perature
7
6
Frequency
6
(No gaps
between
bars)
5
5
4
4
3
3
2
2
1
0
0
0
5
15
25
35
45
55
More
40
Shapes of Histograms
symmetric histograms
skewed histograms
41
Frequency Polygons


Plot a point above each
class midpoint at a
height equal to the
frequency of the class
Useful when comparing
two
or
more
distributions
42
Cumulative Distributions and Ogive



Another way to summarize a distribution is to
construct a cumulative distribution
Rather than a count, we record the number of
measurements that are less than the upper
boundary of that class
Ogive: A graph of a cumulative distribution



Plot a point above each upper class boundary at height
of cumulative frequency
Connect points with line segments
Can also be drawn using


Cumulative relative frequencies
Cumulative percent frequencies
43
Cumulative Frequency
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Freq
%
Class
Cumulative
Frequency
10 - <20
3
15
less than 20
3
15
20 - <30
6
30
less than 30
9
45
30 - <40
5
25
less than 40
14
70
40 - <50
4
20
less than 50
18
90
50 - <60
2
10
less than 60
20
100
20
100
Class
Total
Cumulative
%
44
Graphing Cumulative Frequencies:
The Ogive (Cumulative % Polygon)
less than 10
less than 20
less than 30
less than 40
less than 50
less than 60
10
20
30
40
50
60
0
15
45
70
90
100
Ogive: Daily High Temperature
100
Cumulative Percentage
Class
Lower
Cumulative
class
boundary Percentage
80
60
40
20
0
10
20
30
40
50
60
45
Exercise

A random sample of 25 stocks was selected from
the New York Stock Exchange and the book value
(net worth divided by The number of outstanding
shares) was recorded for each stock. The data
were as follows:
10
11



8
16
14
4
10
8
12
9
7
14
13
7 10 17 8 11 9 15 8 6 18 9 12
Construct a frequency table
Construct a histogram and describe the distribution.
Determine the cumulative frequency table
46
Stem and Leaf Display

Purpose is to see the overall pattern of the data,
by grouping the data into classes




the variation from class to class
the amount of data in each class
the distribution of the data within each class
What to look for: The display conveys
information about a representative to a typical
value in the data set, the extent of spread about
such a value, the presence of any gaps in the
data, the extent of symmetry in the distribution
of values, the number and location of peaks, and
the presence of any outliers (unusual points).
47
Example
Data in ordered array:
21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Here, use the 10’s digit for the stem unit:
Stem Leaf


21 is shown as
38 is shown as
Stem
2
1
3
8
Leaves
2
1 4 4 6 7 7
3
0 2 8
4
1
48
Car Mileage: Results


Refer to the Car Mileage
Case (Table 2.14)
Looking at the stem-andleaf
display,
the
distribution
appears
almost “symmetrical”

The upper portion (29, 30,
31) is almost a mirror
image of the lower portion
of the display (31, 32, 33)


Stems 31, 32*, 32, and
33*
But not exactly a mirror
reflection
49
Crosstabulation Tables

Classifies data on two dimensions



Rows classify according to one dimension
Columns classify according to a second
dimension
Requires three variable
1.
2.
3.
The row variable
The column variable
The variable counted in the cells
50
Example: The Investor Satisfaction Case


Investment broker sells several kinds of
investments (stock fund, bond fund, tax-deferred
annuity)
Wishes to study whether satisfaction depends on
the type of investment product purchased
Fund Type
High
Medium
Low
Total
Bond Fund
15
12
3
30
Stock Fund
24
4
2
30
1
24
15
40
40
40
20
100
Tax Deferred Annuity
Total
51
More on Crosstabulation Tables



Row totals provide a frequency distribution for
the different fund types
Column totals provide a frequency distribution for
the different satisfaction levels
One way to investigate relationships is to
compute row and column percentages


Compute row percentages by dividing each cell’s
frequency by its row total and expressing as a
percentage
Compute column percentages by dividing by the column
total
52
Row Percentage for Each Fund Type
Fund Type
High
Medium
Low
Total
Bond Fund
50.0%
40.0%
10.0%
100%
Stock Fund
80.0%
13.3%
6.7%
100%
2.5%
60.0%
37.5%
100%
Tax Deferred
Annuity
53
Scatter Plots

Scatter plots are used for bivariate numerical
data


The Scatter plot:


Bivariate data consists of paired observations
taken from two numerical variables
one variable (dependent) is measured on the
vertical axis and the other variable (independent)
is measured on the horizontal axis.
What to look for:

Describe the type of the relationship (linear,
nonlinear), the direction (positive, negative) and
the strength (strong, moderate, weak).
54
Examples of Scatter Plots
Describing direction
Describing strength
55
Scatter Plot Example
Volume
per day
Cost per
day
23
125
26
140
29
146
33
160
38
167
42
170
50
188
55
195
60
200
Strong positive linear relationship
56
Chapter 3
Descriptive Statistics:
Numerical Methods
Summary Measures
Describing Data Numerically
Center and Location
Measures of
Relative Standing
Mean
Median
Mode
Variation
Range
Percentiles
Interquartile Range
Quartiles
Variance
Standard Deviation
Coefficient of
Variation
58
Measures of Central Tendency

In addition to describing the shape of a
distribution, want to describe the data set’s
central tendency

A measure of central tendency represents the
center or middle of the data
Central Tendency
Mean
Median
Mode
59
Mean (Arithmetic Average)

The Mean is the arithmetic average of data
values

Sample mean
n = Sample Size
n
x
i
x

i 1
n
Population mean
x1  x2    xn

n
N = Population Size
N
x
i

i 1
N
x1  x2    x N

N
60
Arithmetic Mean



The most common measure of central tendency
Mean = sum of values divided by the number of
values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1  2  3  4  5 15

3
5
5
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1  2  3  4  10 20

4
5
5
61
Median

Not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3

In an ordered array, the median is the
“middle” number (50% above, 50% below)
62
Finding the Median

The location of the median:
n 1
Median position 
position in the ordered array
2



If the number of values is odd, the median is the middle
number
If the number of values is even, the median is the
average of the two middle numbers
Note that (n+1)/2 is not the value of the median,
only the position of the median in the ranked data
63
Mode






A measure of central tendency
Value that occurs most often
Not affected by extreme values
Mainly used for grouped numerical data or
categorical data
There may may be no mode
There may be several modes
No Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
64
Review Example

Five houses on a hill by the beach
$2,000 K
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
$500 K
$300 K
$100 K
$100 K
65
Example: Summary Statistics

Mean:

Median: middle value of
ranked data
= $300,000

Mode: most frequent value
= $100,000
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
($3,000,000/5)
= $600,000
Sum 3,000,000
66
Which measure is the “best”?




Mean is generally used, unless extreme values
(outliers) exist
Then median is often used, since the median is
not sensitive to extreme values.
For a relatively small number of extreme
observations (either very small or very large,
but not both), the median is usually better.
Choosing:



The mode is meaningful on a nominal scale.
The median is meaningful on an ordinal scale.
The mean is meaningful on an interval/ratio scale.
67
Shape of a Distribution


Describes how data is distributed
Symmetric or skewed

If the distribution is symmetric, then mean=median.

If the distribution is skewed to right, then

mode < median < mean
If the distribution is skewed to left, then
mode > median > mean
68
Exercise

The following data represent the ages of 20
randomly selected managers:
43 44 49 37 45 35 46
32 47 42 39 40 41 45
41 43 50 47 41 51
a) Find the mean, median and mode for the
above data.
b) Which measure would you choose to describe
the data? Why?
69
Measures of Variability
Variability
Range
Variance
Standard
Deviation
Coefficient
of Variation
70
Measures of Variation


Knowing the measures of center is not enough
Both of the distributions below have identical
measures of central tendency
Variation
Range
Variance
Standard
Deviation
Coefficient
of Variation
71
Range
Simplest measure of variation
 Difference between the largest and the
smallest observations:

Range = maximum – minimum
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
13 14
Range = 14 - 1 = 13
72
Disadvantages of the Range

Ignores the way in which data are distributed
7
8
9
10
11
12
Range = 12 - 7 = 5

7
8
9
10
11
12
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
73
Variance

Average of squared deviations of values from
the mean

Population variance:
n
N
σ 
2
 (X
Sample variance:
i
 μ)
2
i 1
S2 
N

(X i  X ) 2
i 1
n -1
Where
Where
μ = population mean
X = arithmetic mean
N = population size
n = sample size
Xi = ith value of the variable X
Xi = ith value of the variable X
74
Standard Deviation
Most commonly used measure of variation
 The square root of the variance
 Shows variation about the mean
 Has the same units as the original data


Sample standard deviation:
n
S
 (X
i
 X)
2
i 1
n -1
75
Example: Sample Standard Deviation
Sample
Data (Xi) :
10
12
14
n=8
S 
15
17
18
18
24
Mean = X = 16
(10  X)2  (12  X)2  (14  X)2    (24  X)2
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2
8 1

126
7

4.2426
76
Comparing Standard Deviations
Data A
11
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
S = 3.338
20 21
Mean = 15.5
S = .9258
20 21
Mean = 15.5
S = 4.57
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
13
14
15
16
17
18
19
77
Coefficient of Variation

Measures relative variation

Always a percentage (%)

Shows variation relative to mean

Is used to compare two or more sets of
data measured in different units
S
  100%
CV  

X
78
Comparing Coefficients of Variation


Stock A:
 Average price last year = $50
 Standard deviation = $5
S
$5


CVA     100% 
 100%  10%
$50
X
Stock B:


Average price last year = $100
Standard deviation = $5
S
$5
CVB     100% 
 100%  5%
$100
X
Both stocks
have the same
standard
deviation, but
stock B is less
variable relative
to its price
79
The Empirical Rule

If the data distribution is bell-shaped, then
the interval:
a) (-, +) contains about 68.26% of the values in
the population.
b) (-2, +2) contains about 95.44% of the values
in the population.
c) (-3, +3) contains about 99.74% of the values
in the population.
80
Example


IQs measured on the Stanford Revision of the Binet–
Simon Intelligence Scale have a mean of 100 points and a
standard deviation of 16 points. The interval:
a) (84, 116) contains about 68.26% of the IQ scores.
b) (68, 132) contains about 95.44% of the IQ scores.
c) (52, 148) contains about 99.74% of the IQ scores.
The scores of 25 randomly selected people are shown
below.
66 82 86 88 91 95 96 96
101 102 102 104 105 106 111
116 118 121 124 127 129
97 98
112 115
a) 18 scores (72%) fall in the interval (84, 116).
b) 24 scores (96%) fall in the interval (68, 132).
c) 25 scores (100%) fall in the interval (52, 148).
81
Exercise

The exam scores for the students in an
introductory statistics course are as follows.
88
90
67
63
64
89
76
90
86
84
85
81
82
96
39
100
75
70
34
96
a) Compute the descriptive statistics for the
given exam scores.
b) Apply the empirical rule and check the
consistency with the sample results. Explain
your conclusion.
82
Measures of Relative Standing
Measures of
Relative Standing
Percentiles
The pth percentile in a data:


Quartiles

1st quartile = 25th percentile
p% are less than or equal to
this value

2nd quartile = 50th percentile
(100 – p)% are greater than
or equal to this value

= median
3rd quartile = 75th percentile
(where 0 ≤ p ≤ 100)
83
Percentiles

The pth percentile in an ordered array of n
values is the value in ith position, where
p
i
(n  1)
100

Example: The 60th percentile in an ordered array
of 19 values is the value in 12th position:
p
60
i
(n  1) 
(19  1)  12
100
100

In Excel, write =percentile(array, k), where
array is the range of data and k is the percentile
84
value in the range 0-1.
Quartiles

Quartiles split the ranked data into 4 equal
groups
25% 25%
Q1

25%
Q2
25%
Q3
Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1= 25th percentile, so find the (25/100)(9+1) = 2.5 position
so use the value half way between the 2nd and 3rd values,
so
Q1= 12.5
85
Interquartile Range and Fences


Difference between the first and third
quartiles
IQR = Q3 – Q1
Inner fences: Located 1.5IQR away from
the quartiles:



Q1 – (1.5  IQR)
Q3 + (1.5  IQR)
Outer fences: Located 3IQR away from the
quartiles:


Q1 – (3  IQR)
Q3 + (3  IQR)
86
Outliers

Outliers are measurements that are very different
from other measurements


Outliers lie beyond the fences of the box-andwhiskers plot



They are either much larger or much smaller than most
of the other measurements
Measurements between the inner and outer fences are
mild outliers
Measurements beyond the outer fences are severe
outliers
The adjacent values are:


The smallest data point falls above the lower fence.
The largest data point falls below the upper fence.
87
Box and Whisker Plot (Boxplot)

A Graphical display of data using 5-number
summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum

The box plots the:



First quartile (Q1), median (Md), third quartile (Q3).
Inner fences, outer fences
The “whiskers” are dashed lines that plot the range
of the data


A dashed line drawn from the box below Q1 down to
the minimum
Another dashed line drawn from the box above Q3 up
to the maximum.
88
Distribution shapes and boxplots
89
How to construct a Boxplot?
1. Determine the quartiles.
2. Determine the potential outliers and the
adjacent values.
3. Draw a horizontal axis on which the numbers
obtained in Steps 1 and 2 can be located. Above
this axis, mark the quartiles and the adjacent
values with vertical lines.
4. Connect the quartiles to each other to make a
box, and then connect the box to the adjacent
values with lines.
5. Plot the potential outlier with an asterisk.
90
Example: Box-and-Whiskers Plots
91
Example

A sample of 20 people yielded the weekly
viewing times, in hours,
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21

The five-number summary is
5





24
30.5 35.75 66
IQR=35.75-24=11.75
1.5*IQR=1.5*13.5=17.625
Lower Fence=Q1-1.5*IQR=24-17.625=6.375
Upper Fence=Q3+1.5*IQR=35.75+17.625=53.375
The observations, 5 and 66, lie beyond the inner
fences and hence should be classified as outlier.
The adjacent values are 15 and 43.
92
Example: Excel output

The distribution of the viewing
times is right skewed with two
outliers.
93
Exercise

IQs measured on the Stanford Revision of the
Binet–Simon Intelligence Scale. The scores of 25
randomly selected people are shown below.
66 82 86 88 91 95 96 96 97
98 101 102 102 104 105 106 111
112 115 116 118 121 124 127 129
Identify potential outliers, if any, and construct
and interpret a boxplot
94
Download