Statistics for Managers Using Microsoft Excel, 4/e

advertisement
STATISTICS
for
MANAGERS
Fellowship Course on
Health System Management
A Keshtkar MD, MPH, PhD
Assistant Professor of Epidemiology
Why a Manager Needs to
Know about Statistics
To know how to:

properly present information

draw conclusions about populations based
on sample information

improve processes

obtain reliable forecasts
Key Definitions

A population (universe) is the collection of all
items or things under consideration

A sample is a portion of the population
selected for analysis

A parameter is a summary measure that
describes a characteristic of the population
Population vs. Sample
Population
a b
Sample
cd
b
ef gh i jk l m n
o p q rs t u v w
x y
z
c
gi
o
n
r
u
y
Measures used to describe the
population are called parameters
Measures computed from
sample data are called
statistics
For example: population MEAN
For example: sample MEAN
Two Branches of Statistics

Descriptive statistics


Collecting, summarizing, and describing data
Inferential statistics

Drawing conclusions and/or making decisions
concerning a population based only on sample
data
Descriptive Statistics
3 major Functions:

Collect data


Present data


e.g., Survey
e.g., Tables and graphs
Characterize data

e.g., Sample mean =
X
n
i
Inferential Statistics
2 major Functions:

Estimation


e.g., Estimate the population
mean weight using the sample
mean weight
Hypothesis testing

e.g., Test the claim that the
population mean weight is 120
pounds
Drawing conclusions and/or making decisions
concerning a population based on sample results.
Data Sources
Primary
Secondary
Data Collection
Data Compilation
Print or Electronic
Observation
Survey
Experimentation
Reasons for Drawing a Sample

Less time consuming than a census

Less costly to administer than a census

Less cumbersome and more practical to
administer than a census of the targeted
population
Types of Sampling Methods

Non-probability Sampling


Items included are chosen without regard to
their probability of occurrence
Probability Sampling

Items in the sample are chosen on the basis
of known probabilities
Types of Samples Used
(continued)
Samples
Non-Probability
Samples
Judgement
Quota
Chunk
Convenience
Probability Samples
Simple
Random
Stratified
Systematic
Cluster
Probability Sampling

Items in the sample are chosen based on
known probabilities
Probability Samples
Simple
Random
Systematic
Stratified
Cluster
Simple Random Samples

Every individual or item from the frame has an
equal chance of being selected

Selection may be with replacement or without
replacement

Samples obtained from table of random
numbers or computer random number
generators
Systematic Samples

Decide on sample size: n

Divide frame of N individuals into groups of k
individuals: k=N/n

Randomly select one individual from the 1st
group

Select every kth individual thereafter
N = 64
n=8
k=8
First Group
Stratified Samples

Divide population into two or more subgroups (called
strata) according to some common characteristic

A simple random sample is selected from each subgroup,
with sample sizes proportional to strata sizes

Samples from subgroups are combined into one
Population
Divided
into 4
strata
Sample
Cluster Samples


Population is divided into several “clusters,”
each representative of the population
A simple random sample of clusters is selected

All items in the selected clusters can be used, or items can be
chosen from a cluster using another probability sampling
technique
Population
divided into
16 clusters.
Randomly selected
clusters for sample
Advantages and Disadvantages

Simple random sample and systematic sample



Stratified sample


Simple to use
May not be a good representation of the population’s
underlying characteristics
Ensures representation of individuals across the
entire population
Cluster sample


More cost effective
Less efficient (need larger sample to acquire the
same level of precision)
Types of Data
Data
Categorical
Numerical
Examples:



Marital Status
Political Party
Eye Color
(Defined categories)
Discrete
Examples:


Number of Children
Defects per hour
(Counted items)
Continuous
Examples:


Weight
Voltage
(Measured characteristics)
Levels of Measurement
and Measurement Scales
Differences between
measurements, true
zero exists
Ratio Data
Differences between
measurements but no
true zero
Interval Data
Ordered Categories
(rankings, order, or
scaling)
Ordinal Data
Categories (no
ordering or direction)
Nominal Data
Highest Level
Strongest forms of
measurement
Higher Level
Lowest Level
Weakest form of
measurement
Definition of SURVEY
A “survey” is a study type that usually has two
characteristics:
1.
Representativeness is an important goal
2.
Data collection tool & method is questionnaire
and interview/ QA-ing (Questioning &
Answering) respectively.
Evaluating Survey Worthiness






What is the purpose of the survey?
Is the survey based on a probability sample?
Coverage error – appropriate frame?
Non-response error – follow up
Measurement error – good questions elicit good
responses
Sampling error – always exists
Types of Survey Errors

Coverage error or selection bias


Non response error or bias


People who do not respond may be different from those who do
respond
Sampling error


Exists if some groups are excluded from the frame and have no
chance of being selected
Variation from sample to sample will always exist
Measurement error

Due to weaknesses in question design, respondent error, and
interviewer’s effects on the respondent
Types of Survey Errors
(continued)

Coverage error
Excluded from
frame

Non-response error
Follow up on
nonresponses

Sampling error
Random
differences from
sample to sample

Measurement error
Bad or leading
question
Organizing and Presenting
Data Graphically

Data in raw form are usually not easy to use
for decision making

Some type of organization is needed



Table
Graph
Techniques reviewed here:



Frequency Distributions and Histograms
Bar charts and pie charts
Contingency tables
Tables and Charts for
Numerical Data
Numerical Data
Continuous Data
Discrete Data
Line or
Polygon
Frequency Distributions and
Cumulative Distributions
Histogram
Polygon
Box
plot
Tabulating Numerical Data:
Frequency Distributions
What is a Frequency Distribution?

A frequency distribution is a list or a table …

containing class groupings (categories or
ranges within which the data falls) ...

and the corresponding frequencies with which
data falls within each grouping or category
Why Use Frequency Distributions?

A frequency distribution is a way to
summarize data

The distribution condenses the raw data
into a more useful form...

and allows for a quick visual interpretation
of the data
Class Intervals
and Class Boundaries


Each class grouping has the same width
Determine the width of each interval by
range
Width of int erval 
number of desired class groupings



Use at least 5 but no more than 15 groupings
Class boundaries never overlap
Round up the interval width to get desirable
endpoints
Frequency Distribution Example
Example: A manufacturer of insulation randomly
selects 20 winter days and records the daily
high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Frequency Distribution Example
(continued)

Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find range: 58 - 12 = 46

Select number of classes: 5 (usually between 5 and 15)

Compute class interval (width): 10 (46/5 then round up)

Determine class boundaries (limits): 10, 20, 30, 40, 50, 60

Compute class midpoints: 15, 25, 35, 45,

Count observations & assign to classes
55
Frequency Distribution Example
(continued)
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class
10 but less than 20
20 but less than 30
30 but less than 40
40 but less than 50
50 but less than 60
Total
Frequency
Relative
Frequency
3
6
5
4
2
20
.15
.30
.25
.20
.10
1.00
Percentage
15
30
25
20
10
100
Graphing Numerical Data:
The Histogram

A graph of the data in a frequency distribution
is called a histogram

The class boundaries (or class midpoints)
are shown on the horizontal axis

the vertical axis is either frequency, relative
frequency, or percentage

Bars of the appropriate heights are used to
represent the number of observations within
each class
Histogram Example
Class
Midpoint Frequency
Class
10 but less than 20
20 but less than 30
30 but less than 40
40 but less than 50
50 but less than 60
15
25
35
45
55
3
6
5
4
2
Histogram : Daily High Tem perature
7
6
Frequency
6
(No gaps
between
bars)
5
5
4
4
3
3
2
2
1
0
0
0
5
15
25
35
45
Class Midpoints
55
More
Histograms in Excel
1
Select
Tools/Data Analysis
Histograms in Excel
(continued)
2
Choose Histogram
(
Input data range and bin
range (bin range is a cell
3
range containing the upper class
boundaries for each class
grouping)
Select Chart Output
and click “OK”
Questions for Grouping Data
into Classes

1. How wide should each interval be?
(How many classes should be used?)

2. How should the endpoints of the
intervals be determined?



Often answered by trial and error, subject to
user judgment
The goal is to create a distribution that is
neither too "jagged" nor too "blocky”
Goal is to appropriately show the pattern of
variation in the data
How Many Class Intervals?
Many (Narrow class intervals)
3
2.5
2
1.5
1
0.5
60
Temperature
Few (Wide class intervals)


may compress variation too much and
yield a blocky distribution
can obscure important patterns of
variation.
12
10
Frequency

8
6
4
2
0
0
30
60
More
Temperature
(X axis labels are upper class endpoints)
More
56
52
48
44
40
36
32
28
24
20
16
8
0
4

may yield a very jagged distribution
with gaps from empty classes
Can give a poor indication of how
frequency varies across classes
12

3.5
Frequency

Graphing Numerical Data:
The Frequency Polygon
Class
Midpoint Frequency
Class
10 but less than 20
20 but less than 30
30 but less than 40
40 but less than 50
50 but less than 60
15
25
35
45
55
3
6
5
4
2
Frequency Polygon: Daily High Temperature
7
(In a percentage
polygon the vertical axis
would be defined to
show the percentage of
observations per class)
Frequency
6
5
4
3
2
1
0
5
15
25
35
Class Midpoints
45
55
More
Tabulating Numerical Data:
Cumulative Frequency
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class
Frequency Percentage
Cumulative Cumulative
Frequency Percentage
10 but less than 20
3
15
3
15
20 but less than 30
6
30
9
45
30 but less than 40
5
25
14
70
40 but less than 50
4
20
18
90
50 but less than 60
2
10
20
100
20
100
Total
Graphing Cumulative Frequencies:
The Ogive (Cumulative % Polygon)
Less than 10
10 but less than 20
20 but less than 30
30 but less than 40
40 but less than 50
50 but less than 60
10
20
30
40
50
60
0
15
45
70
90
100
Ogive: Daily High Temperature
100
Cumulative Percentage
Class
Lower
Cumulative
class
boundary Percentage
80
60
40
20
0
10
20
30
40
50
60
Class Boundaries (Not Midpoints)
Scatter Diagrams

Scatter Diagrams are used for
bivariate numerical data


Bivariate data consists of paired
observations taken from two numerical
variables
The Scatter Diagram:
 one variable is measured on the vertical
axis and the other variable is measured
on the horizontal axis
Scatter Diagram Example
Cost per
day
23
125
26
140
29
146
33
160
38
167
42
170
50
188
55
195
60
200
Cost per Day vs. Production Volume
250
Cost per Day
Volume
per day
200
150
100
50
0
0
10
20
30
40
Volume per Day
50
60
70
Scatter Diagrams in Excel
1
Select the chart wizard
2
Select XY(Scatter) option,
then click “Next”
3
When prompted, enter the
data range, desired
legend, and desired
destination to complete
the scatter diagram
Tables and Charts for
Categorical Data
Categorical
Data
Graphing Data
Tabulating Data
Summary
Table
Bar
Charts
Pie
Charts
Pareto
Diagram
The Summary Table
Summarize data by category
Example: Current Investment Portfolio
Investment
Amount
Percentage
Type
(in thousands $)
(%)
(Variables are
Categorical)
Stocks
Bonds
CD
Savings
46.5
32.0
15.5
16.0
42.27
29.09
14.09
14.55
Total
110.0
100.0
Bar and Pie Charts

Bar charts and Pie charts are often used
for qualitative (category) data

Height of bar or size of pie slice shows the
frequency or percentage for each
category
Bar Chart Example
Current Investment Portfolio
Investment
Type
Amount
Percentage
(in thousands $)
(%)
Stocks
Bonds
CD
Savings
46.5
32.0
15.5
16.0
42.27
29.09
14.09
14.55
Total
110.0
100.0
Investor's Portfolio
Savings
CD
Bonds
Stocks
0
10
20
30
Amount in $1000's
40
50
Pie Chart Example
Current Investment Portfolio
Investment
Type
Amount
Percentage
(in thousands $)
(%)
Stocks
Bonds
CD
Savings
46.5
32.0
15.5
16.0
42.27
29.09
14.09
14.55
Total
110.0
100.0
Savings
15%
Stocks
42%
CD
14%
Bonds
29%
Percentages
are rounded to
the nearest
percent
Pareto Diagram

Used to portray categorical data

A bar chart, where categories are shown in
descending order of frequency

A cumulative polygon is often shown in the
same graph

Used to separate the “vital few” from the “trivial
many”
Pareto Diagram Example
45%
100%
40%
90%
80%
35%
70%
30%
60%
25%
50%
20%
40%
15%
30%
10%
20%
5%
10%
0%
0%
Stocks
Bonds
Savings
CD
cumulative % invested
(line graph)
% invested in each category
(bar graph)
Current Investment Portfolio
Tabulating and Graphing
Multivariate Categorical Data

Contingency Table for Investment Choices ($1000’s)
Investment
Category
Investor A
Investor B
Investor C
Total
Stocks
46.5
55
27.5
129
Bonds
CD
Savings
32.0
15.5
16.0
44
20
28
19.0
13.5
7.0
95
49
51
Total
110.0
147
67.0
324
(Individual values could also be expressed as percentages of the overall total,
percentages of the row totals, or percentages of the column totals)
Tabulating and Graphing
Multivariate Categorical Data
(continued)

Side by side bar charts
C o m p arin g In vesto rs
S a vin g s
CD
B onds
S to c k s
0
10
In ve s to r A
20
30
In ve s to r B
40
50
In ve s to r C
60
Side-by-Side Chart Example

Sales by quarter for three sales territories:
East
West
North
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
20.4
27.4
59
20.4
30.6
38.6
34.6
31.6
45.9
46.9
45
43.9
60
50
40
East
West
North
30
20
10
0
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
Principles of Graphical Excellence





Present data in a way that provides substance,
statistics and design
Communicate complex ideas with clarity,
precision and efficiency
Give the largest number of ideas in the most
efficient manner
Excellence almost always involves several
dimensions
Tell the truth about the data
Errors in Presenting Data


Using “chart junk”
Failing to provide a relative
basis in comparing data

between groups
Compressing or distorting the vertical axis

Providing no zero point on the vertical axis
Chart Junk
Bad Presentation
Good Presentation
Minimum Wage
1960: $1.00
1970: $1.60
1980: $3.10
$
4
2
0
1960
1990: $3.80
Minimum Wage
1970
1980
1990
No Relative Basis
listen
Bad Presentation
Freq.
A’s received by
students.
300
200
100
 Good Presentation
%
30%
A’s received by
students.
20%
10%
0
0%
FR SO
JR SR
FR SO JR SR
FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior
Compressing Vertical Axis
Bad Presentation
Good Presentation
Quarterly Sales
200
$
Quarterly Sales
50
100
25
0
0
Q1 Q2
Q3 Q4
$
Q1
Q2
Q3 Q4
No Zero Point On Vertical Axis
Bad Presentation
$Good Presentations
Monthly Sales
45
Monthly Sales
45
$
39
36
42
0
39
36
42
or
J F M A M J
J
F
J
F
M
A
M
J
$
60
40
Graphing the first six months of sales
20
0
M
A
M
J
Different Measures for Describing Data

Measures of central tendency, variation, and
shape





Mean, median, mode, geometric mean
Quartiles
Range, interquartile range (IQR), variance and
standard deviation, coefficient of variation (CV)
Symmetric and skewed distributions
Population summary measures



Mean, variance, and standard deviation
Normal Distribution versus Non-normal Distribution
The empirical ND rule and Chebyshev rule
Summary Measures
Describing Data Numerically
Central Tendency
Quartiles
Variation
Arithmetic Mean
Range
Median
Interquartile Range
Mode
Variance
Geometric Mean
Standard Deviation
Shape
Skewness
Coefficient of Variation
Measures of Central Tendency
Overview
Central Tendency
Arithmetic Mean
Median
Mode
n
X
X
i1
n
Geometric Mean
XG  ( X1  X 2    Xn )1/ n
i
Midpoint of
ranked
values
Most
frequently
observed
value
Arithmetic Mean

The arithmetic mean (mean) is the most
common measure of central tendency

For a sample of size n:
n
X
Sample size
X
i1
n
i
X1  X2    Xn

n
Observed values
Arithmetic Mean
(continued)



The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1  2  3  4  5 15

3
5
5
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1  2  3  4  10 20

4
5
5
Median

In an ordered array, the median is the “middle”
number (50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3

Not affected by extreme values
Finding the Median

The location of the median:
n 1
Median position 
position in the ordered data
2



If the number of values is odd, the median is the middle number
If the number of values is even, the median is the average of
the two middle numbers
n  1 is not the value of the median, only the
2
position of the median in the ranked data
Note that
Mode






A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Review Example

Five houses on a hill by the beach
$2,000 K
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
$500 K
$300 K
$100 K
$100 K
Review Example:
Summary Statistics
House Prices:
$2,000,000
500,000
300,000
100,000
100,000

Mean:

Median: middle value of ranked data
= $300,000

Mode: most frequent value
= $100,000
Sum 3,000,000
($3,000,000/5)
= $600,000
Which measure of location
is the “best”?

Mean is generally used, unless
extreme values (outliers) exist

Then median is often used, since
the median is not sensitive to
extreme values.

Example: Median home prices may be
reported for a region – less sensitive to
outliers
Geometric Mean

Geometric mean

Used to measure the rate of change of a variable
over time
XG  ( X1  X 2    Xn )
1/ n

Geometric mean rate of return

Measures the status of an investment over time
RG  [(1  R1 )  (1  R 2 )    (1  Rn )]1/ n  1

Where Ri is the rate of return in time period i
Example
An investment of $100,000 declined to $50,000 at the
end of year one and rebounded to $100,000 at end
of year two:
X1  $100,000
X2  $50,000
50% decrease
X3  $100,000
100% increase
The overall two-year return is zero, since it started and
ended at the same level.
Example
(continued)
Use the 1-year returns to compute the arithmetic
mean and the geometric mean:
Arithmetic
mean rate
of return:
( 50%)  (100%)
X
 25%
2
Geometric
mean rate
of return:
RG  [(1  R1 )  (1  R 2 )    (1  Rn )]1/ n  1
Misleading result
 [(1  ( 50%))  (1  (100%))]1/ 2  1
 [(. 50)  (2)]1/ 2  1  11/ 2  1  0%
More
accurate
result
Quartiles

Quartiles split the ranked data into 4 segments with
an equal number of values per segment
25%
Q1



25%
25%
Q2
25%
Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile
Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:
Q1 = (n+1)/4
Second quartile position: Q2 = (n+1)/2 (the median position)
Third quartile position:
Q3 = 3(n+1)/4
where n is the number of observed values
Quartiles

Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 = is in the (9+1)/4 = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so
Q1 = 12.5
Q1 and Q3 are measures of noncentral location
Q2 = median, a measure of central tendency
Measures of Variation
Variation
Range

Interquartile
Range
Variance
Standard
Deviation
Coefficient
of Variation
Measures of variation give
information on the spread
or variability of the data
values.
Same center,
different variation
Range


Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
Range = 14 - 1 = 13
13 14
Disadvantages of the Range

Ignores the way in which data are distributed
7
8
9
10
11
12
Range = 12 - 7 = 5

7
8
9
10
11
12
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Interquartile Range

Can eliminate some outlier problems by using
the interquartile range

Eliminate some high- and low-valued
observations and calculate the range from the
remaining values

Interquartile range = 3rd quartile – 1st quartile
= Q3 – Q1
Interquartile Range
Example:
X
minimum
Q1
25%
12
Median
(Q2)
25%
30
25%
45
X
Q3
maximum
25%
57
Interquartile range
= 57 – 30 = 27
70
Variance

Average (approximately) of squared deviations
of values from the mean
n

Sample variance:
S 
2
Where
 (X  X)
i1
X = arithmetic mean
n = sample size
Xi = ith value of the variable X
i
n -1
2
Standard Deviation



Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data

n
Sample standard deviation:
S
 (X
i1
i
 X)
n -1
2
Calculation Example:
Sample Standard Deviation
Sample
Data (Xi) :
10
12
14
n=8
S 
15
17
18
18
24
Mean = X = 16
(10  X)2  (12  X)2  (14  X)2    (24  X)2
n 1

(10  16)2  (12  16)2  (14  16)2    (24  16)2
8 1

126
7

4.2426
A measure of the “average”
scatter around the mean
Measuring variation
Small standard deviation
Large standard deviation
Comparing Standard Deviations
Data A
11
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
S = 3.338
20 21
Mean = 15.5
S = 0.926
20 21
Mean = 15.5
S = 4.570
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
13
14
15
16
17
18
19
Advantages of Variance and
Standard Deviation

Each value in the data set is used in the
calculation

Values far from the mean are given extra
weight
(because deviations from the mean are squared)
Coefficient of Variation

Measures relative variation

Always in percentage (%)

Shows variation relative to mean

Can be used to compare two or more sets of
data measured in different units
 S
  100%
CV  

X 
Comparing Coefficient
of Variation

Stock A:
 Average price last year = $50
 Standard deviation = $5
S
$5
CVA     100% 
 100%  10%
$50
X

Stock B:


Average price last year = $100
Standard deviation = $5
S
$5
CVB     100% 
 100%  5%
$100
X
Both stocks
have the same
standard
deviation, but
stock B is less
variable relative
to its price
Shape of a Distribution

Describes how data is distributed

Measures of shape

Symmetric or skewed
Left-Skewed
Symmetric
Right-Skewed
Mean < Median
Mean = Median
Median < Mean
Using Microsoft Excel

Descriptive Statistics can be obtained
from Microsoft® Excel

Use menu choice:
tools / data analysis / descriptive statistics

Enter details in dialog box
Using Excel
Use menu choice:

tools / data analysis /
descriptive statistics
Using Excel
(continued)

Enter dialog box
details

Check box for
summary statistics

Click OK
Excel output
Microsoft Excel
descriptive statistics output,
using the house price data:
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
Population Summary Measures

Population summary measures are called parameters

The population mean is the sum of the values in the
population divided by the population size, N
N

Where
X
i1
N
i
X1  X2    XN

N
μ = population mean
N = population size
Xi = ith value of the variable X
Population Variance

Average of squared deviations of values from
the mean
N

Population variance:
σ2 
Where
 (X  μ)
i1
μ = population mean
N = population size
Xi = ith value of the variable X
i
N
2
Population Standard Deviation



Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data

Population standard deviation:
N
σ
2
(X

μ)
 i
i1
N
The Empirical Rule


If the data distribution is bell-shaped, then
the interval:
μ  1σ contains about 68% of the values in
the population or the sample
68%
μ
μ  1σ
The Empirical Rule


μ  2σ contains about 95% of the values in
the population or the sample
μ  3σ contains about 99.7% of the values
in the population or the sample
95%
99.7%
μ  2σ
μ  3σ
Chebyshev Rule

Regardless of how the data are distributed,
at least (1 - 1/k2) of the values will fall within
k standard deviations of the mean (for k > 1)

Examples:
At least
within
(1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% ………. k=3 (μ ± 3σ)
Exploratory Data Analysis

Box-and-Whisker Plot: A Graphical display of
data using 5-number summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
Example:
25%
Minimum
Minimum
25%
1st
Quartile
1st
Quartile
25%
Median
Median
25%
3rd
Quartile
3rd
Quartile
Maximum
Maximum
Shape of Box-and-Whisker Plots

The Box and central line are centered between the
endpoints if data are symmetric around the median
Min

Q1
Median
Q3
Max
A Box-and-Whisker plot can be shown in either vertical
or horizontal format
Distribution Shape and
Box-and-Whisker Plot
Left-Skewed
Q1
Q2 Q3
Symmetric
Q1 Q2 Q3
Right-Skewed
Q1 Q2 Q3
Box-and-Whisker Plot Example

Below is a Box-and-Whisker plot for the following
data:
Min
0
Q1
2
2
Q2
2
00 22 33 55

3
3
Q3
4
5
5
Max
10
27
27
This data is right skewed, as the plot depicts
27
The Sample Covariance

The sample covariance measures the strength of the
linear relationship between two variables (called
bivariate data)

The sample covariance:
n
cov ( X , Y ) 
 ( X  X)( Y  Y )
i1
i
i
n 1

Only concerned with the strength of the relationship

No causal effect is implied
Interpreting Covariance

Covariance between two random variables:
cov(X,Y) > 0
X and Y tend to move in the same direction
cov(X,Y) < 0
X and Y tend to move in opposite directions
cov(X,Y) = 0
X and Y are independent
Coefficient of Correlation

Measures the relative strength of the linear
relationship between two variables

Sample coefficient of correlation:
n
r
 ( X  X)( Y  Y )
i1
i
i
n
n
 ( X  X)  ( Y  Y )
2
i1
i
i 1
i
2
cov ( X , Y )

SX SY
Features of
Correlation Coefficient, r

Unit free

Ranges between –1 and 1

The closer to –1, the stronger the negative linear
relationship

The closer to 1, the stronger the positive linear
relationship

The closer to 0, the weaker any positive linear
relationship
Scatter Plots of Data with Various
Correlation Coefficients
Y
Y
Y
X
X
r = -1
r = -.6
Y
r=0
Y
Y
r = +1
X
X
X
r = +.3
X
r=0
Using Excel to Find
the Correlation Coefficient

Select
Tools/Data Analysis

Choose Correlation from
the selection menu

Click OK . . .
Using Excel to Find
the Correlation Coefficient
(continued)


Input data range and select
appropriate options
Click OK to get output
Interpreting the Result

Scatter Plot of Test Scores
r = .733
100


There is a relatively
strong positive linear
relationship between
test score #1
and test score #2
Test #2 Score
95
90
85
80
75
70
70
75
80
85
90
Test #1 Score
Students who scored high on the first test tended
to score high on second test
95
100
Pitfalls in Numerical
Descriptive Measures

Data analysis is objective


Should report the summary measures that best meet
the assumptions about the data set
Data interpretation is subjective

Should be done in fair, neutral and clear manner
Ethical Considerations
Numerical descriptive measures:



Should document both good and bad results
Should be presented in a fair, objective and
neutral manner
Should not use inappropriate summary
measures to distort facts
Download