Course Number B01.1305
Course Section 60
Meeting Time Monday 6-9:30 pm
2
Introduction to the instructor
Introduction to the class
•
Review of syllabus
• Introduction to statistics
• Class Goals
Types of data
Graphical and numerical methods for univariate series
Minitab Tutorial
Professor S. D. Balkin -- May 20, 2002
3
Ph.D. in Business Administration, Penn State
Masters in Statistics, Penn State
Mathematics/Economics and Music, Lafayette College
Employment
• Pfizer Inc.
– Management Science Group; Sept. 2001 – current
• Ernst & Young
– Quantitative Economics and Statistics Group; June 1999 – August 2001
Professor S. D. Balkin -- May 20, 2002
4
STATISTICS: A body of principles and methods for extracting useful information from data, for assessing the reliability of that information, for measuring and managing risk, and for making decisions in the face of uncertainty.
POPULATION: set of measurements corresponding to the entire collection of units
SAMPLE: set of measurements that are collected from a population
OBJECTIVES:
• To make inferences about a population from a sample, including the extent of uncertainty
• Design the data collection process to facilitate drawing valid inferences
Professor S. D. Balkin -- May 20, 2002
5
Typically due to prohibitive cost of contacting millions of people or performing costly experiments
• Election polls query about 2,000 voters to make inferences regarding how all voters cast their ballots
Sometimes the sampling process is destructive
• Sampling wine quality
Professor S. D. Balkin -- May 20, 2002
6
Monthly Unemployment Rates (BLS)
Consumer Price Index
Presidential Approval Rating
Quality and Productivity Improvement
Scientific Inquiry
• Training effectiveness
• Advertising impact
Professor S. D. Balkin -- May 20, 2002
7
“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write”.
– (H. G. Wells)
“There are three kinds of lies -- Lies, damn lies, and statistics”.
– (Benjamin Disraeli)
“You’ve got to know when to hold ‘em, know when to fold
‘em.”
– (Kenny Rogers, in The Gambler)
“The average U. S. household has 2.75 people in it.”
– (U. S. Census Bureau, 1980)
“4 out of 5 dentists surveyed recommended Trident
Sugarless Gum for their patients who chew gum.”
– (Advertisement for Trident)
Professor S. D. Balkin -- May 20, 2002
8
Understanding data
•
Intro to descriptive statistics, interpreting data, and graphical methods
Dealing with and quantifying uncertainty
• Random variables and probability
Using samples to make generalizations about populations
• Assessing whether a change in data is beyond random variation
Modeling relationships and predicting
• Using sample data to create models that give predictions for all values of a population
Professor S. D. Balkin -- May 20, 2002
9
To gain an understanding of descriptive statistics, probability, statistical inference, and regression analysis so that it may be applied to your job
To be able to identify when statistical procedures are required to facilitate your business decision making
To be able to identify both good and poor use of statistics in business
Professor S. D. Balkin -- May 20, 2002
10
To teach you statistics and data analysis effectively
To improve my effectiveness as an instructor
Professor S. D. Balkin -- May 20, 2002
11
I will not teach you anything in this class that is not regularly used in business and industry
If you ask, “Where is this used?” I will have a real example for you
Professor S. D. Balkin -- May 20, 2002
12
Data
Qualitative / Categorial
Qualitative trait only classifiable into categories
Quantitative / Continuous
Characteristic measurement on a numerical scale
Cable Appointment (Made, Missed)
Employment Status (employed, unemployed)
Bond Ratings (1, 2, 3, or 4 stars)
Service Quality (poor, good , excellent)
Cable Appointment Waiting Time (hours)
Employment Tenure (months)
Bond Return (percentage)
Cost (dollars)
Professor S. D. Balkin -- May 20, 2002
13
Business Horizons (1993) conducted a comprehensive survey of 800 CEOs who run the country's largest global corporations. Some of the variables measured are given below. Classify them as quantitative or qualitative.
• State of birth
• Age
• Educational Level
• Tenure with Firm
• Total Compensation
• Area of Expertise
• Gender
Professor S. D. Balkin -- May 20, 2002
14
Univariate Data
Data sets with just one piece of information
What is a typical value?
How do the values vary?
GMAT Scores for students in this class
Incomes in a zipcode
Returns for a stock over this past year
Respondent ages from market research
Variables
Bivariate Data
Data sets with two pieces of information
Is there a relationship?
How strong is the relationship?
Is there a predictive relationship?
GMAT scores and college GPA
Incomes and age in a zipcode
Returns and volume for a stock
MR respondent age and purchase intent
Multivariate Data
Data sets with three or more pieces of information
Are there relationships?
How strong are the relationships?
Do predictive relationships exist?
GMAT Scores, Salary, Gender, Job Tenure,
Job Category, House Ownership, etc...
Professor S. D. Balkin -- May 20, 2002
Summarizing Data about
One Variable
16
Unorganized mass of numbers is difficult to interpret
First task in understanding data is summarizing it
•
Graphically
• Numerically
Professor S. D. Balkin -- May 20, 2002
17
Distinguish between qualitative and quantitative variables
Learn graphic representations of univariate data
Learn numerical representations of univariate data
Investigate data acquired over time
Professor S. D. Balkin -- May 20, 2002
18
Distribution is essentially how many times each possible data values occur in a set of data.
Methods for displaying distributions
•
Qualitative data
– Frequency table
– Bar charts
• Quantitative data
– Histograms
– Stem-Leaf diagrams
– Boxplots
Professor S. D. Balkin -- May 20, 2002
19
Background : A question on a market research survey asked 17 respondents the size of their households
Data : 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
Frequency Table
Household
Size
Number of
Households
1
2
3
5
5
6
3
4
0
1
6
2
Professor S. D. Balkin -- May 20, 2002
20
Barchart : Plot of frequencies each category occurs in the data set
Number of Households
7
6
3
2
1
0
5
4
1 2 4 5 6
Professor S. D. Balkin -- May 20, 2002
21
Background : Forbes magazine published data on the best small firms in 1993. These were firms with annual sales of more than five and less than
$350 million. Firms were ranked by five-year average return on investment. The data are the annual salary of the chief executive officer for the first 60 ranked firms.
Data (in thousands) :
145 621 262 208 362 424 339 736 291 58 498 643 390 332 750
368 659 234 396 300 343 536 543 217 298 1103 406 254 862 204
206 250 21 298 350 800 726 370 536 291 808 543 149 350 242
198 213 296 317 482 155 802 200 282 573 388 250 396 572
Professor S. D. Balkin -- May 20, 2002
22
(cont.)
Histograms are constructed in the same way as bar charts except:
• User must create classes to count frequencies
• Bars are adjacent instead of separated with space
Professor S. D. Balkin -- May 20, 2002
(cont.)
CEO Salary Histogram
23
0 200 400 600
Salary (in thousands)
800 1000 1200
Professor S. D. Balkin -- May 20, 2002
24
(cont.)
Questions:
• What is the typical value of CEO salary?
• How much variability is there around this value?
• What is the general shape of the data?
Histogram characteristics:
• Central tendency
• Variability
• Skewness
• Modality
•
Outliers
Professor S. D. Balkin -- May 20, 2002
25
26
Symmetric Distribution
28 30
Data
Right Skewed Distribution
32
0
60
34
10
Data
20
Left Skewed Distribution
30
70
Data
80 90 100
Professor S. D. Balkin -- May 20, 2002
26
Unimodal Distribution
26 28 30
Data
Bimodal Distribution
32
8 10 12
Data
14 16 18
Professor S. D. Balkin -- May 20, 2002
Distribution with Outlier
27
28 30 32
Data
34 36
Professor S. D. Balkin -- May 20, 2002
28
Background : Telecom company wants to analyze the time to complete new service orders measured in hours
Data :
42 21 46 69 87 29 34 59 81 97 64 60 87 81 69 77 75 47
73 82 91 74 70 65 86 87 67 69 49 57 55 68 74 66 81 90
75 82 37 94
Diagram:
2 | 19
3 | 47
4 | 2679
5 | 579
6 | 045678999
7 | 0344557
8 | 111226777
9 | 0147
Professor S. D. Balkin -- May 20, 2002
29
Mode: Value or category that occurs most frequently
Median: Middle value when the data are sorted
Mean: Sum of measurements divided by the number of measurements
Professor S. D. Balkin -- May 20, 2002
30
Background : A question on a market research survey asked 17 respondents the size of their households
Data : 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
Frequency Table
Household
Size
Number of
Households
1
2
3
5
5
6
3
4
0
1
6
2
Mode
Professor S. D. Balkin -- May 20, 2002
Background : A question on a market research survey asked 17 respondents the size of their households
Data : 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
Since the n=17 observations,
• Median is the (n+1)/2 = 9 th observation
31
Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Household Size 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 6
Median
Professor S. D. Balkin -- May 20, 2002
32
Background : Cable company wants to know how long an installer spends at each stop. One employee performed five installations in one day and recorded how many minutes she was at each location.
Data : 45, 23, 36, 29, 52
Mean = (45+23+36+29+52) / 5 = 37 minutes
Professor S. D. Balkin -- May 20, 2002
33
Mean = 404.1695
Median = 350
WHY THE DIFFERENCE?
CEO Salary Histogram
0 200 400 600
Salary (in thousands)
800 1000 1200
Professor S. D. Balkin -- May 20, 2002
34
A primary reason for using statistics is due to variability
If there was no variability, we would not nee statistics
Examples:
• Worker productivity
• Stock market
• Promotional expenditures
Measures
•
Standard deviation: variation around the mean
• Range: distance between smallest and largest observations
Professor S. D. Balkin -- May 20, 2002
35
Standard Deviation: summarizes how far away from the mean the data value typically are.
Calculation
•
Find the deviations by subtracting the mean from each data value
• Square these deviations, add them up, and divide by n-1
• Take the square root of this number
Professor S. D. Balkin -- May 20, 2002
36
Background : Your firm spends $19 Million per year on advertising, and management is wondering if that figure is appropriate. Other firms in your industry have a mean advertising expenditure of $22.3 Million per year.
Professor S. D. Balkin -- May 20, 2002
(cont.)
37
Ad$$$ Deviations Sq Devs
8 -14.29
204.32
19
22
-3.29
-0.29
10.85
0.09
20
27
-2.29
5.26
4.71
22.15
12
11
32
20
37
38
23
23
18
23
35
11
14.71
216.26
15.71
246.67
0.71
0.71
0.50
0.50
-10.29
105.97
-11.29
127.56
9.71
94.20
-2.29
5.26
-4.29
0.71
12.71
-11.29
18.44
0.50
161.44
127.56
Mean =
St Dev =
22.29
9.18
Industry Advertising Histogram
5 10 15 20 25
Millions of Dollars
30 35 40
Professor S. D. Balkin -- May 20, 2002
38
(cont.)
Difference from peer group average is $3.3 Million
This difference is smaller than the industry standard deviation of $9.18 Million
Conclusion: You advertising budget, while slightly below the industry average, is typical compared with your industry peers
Professor S. D. Balkin -- May 20, 2002
39
If the histogram for a given sample is unimodal and symmetric (mound-shaped), then the following rule-of-thumb may be applied:
x s the sample standard deviation. Then x
1 s x
2 s x
3 s contains approximat ely 68% of the measuremen ts; contains contains approximat approximat ely 95% of ely all of
the measuremen
the measuremen ts.
ts;
Professor S. D. Balkin -- May 20, 2002
40
Description : Stock market returns are supposed to be unpredictable .
Let’s see if the empirical rule holds true
Data : S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002
S&P-500 Daily Returns Histogram
Mean = 0.0002
St. Dev. = 0.0128
72.8% (95.3%) of the returns fall between the sample mean plus and minus one (two) st.dev.
-0.06
-0.04
-0.02
0.00
0.02
0.04
0.06
Daily Return
Professor S. D. Balkin -- May 20, 2002
41
Inter-Quartile Range (IQR) provides an alternative approach to measuring variability
Computation:
• Sort the data and find the median
• Divide the data into top and bottom halves
• Find the median of both halves. These are the 25 th and
75 th percentiles
• IQR = 75 th percentile – 25 th percentile
Outlier Measure – Any value outside the inner fences is an outlier candidate
• Lower inner fence = 25 th percentile – 1.5 IQR
• Upper inner fence = 75 th percentile + 1.5 IQR
Professor S. D. Balkin -- May 20, 2002
42
Data : S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002
S&P-500 Daily Returns Boxplot
Outliers
Upper inner fence
75th percentile
Median
25th percentile
Lower inner fence
Professor S. D. Balkin -- May 20, 2002
44
Goal of course is to learn statistical concepts
• Most statistical analyses are performed using computers
• Each company may use a different statistical package
YES…Minitab is used in business!
• Typically in quality control and design of experiments
EXCEL has very limited statistical functionality and is considerably more difficult to use than Minitab
There are many stat packages (SAS, SPSS, Systat, Splus, R,
Statistica, Mathematica, etc.)
• Minitab is the easiest program to use right away
• Excellent Help facilities
• Statistical glossary built-in
Professor S. D. Balkin -- May 20, 2002
45
A hotel kept records over time of the reasons why guest requested room changes. The frequencies were as follows
– Room not clean
– Plumbing not working
– Wrong type of bed
– Noisy location
2
1
13
– Other
4
– Wanted nonsmoking
– Didn’t like view
18
1
– Not properly equipped 8
6
Professor S. D. Balkin -- May 20, 2002
46
Exercise 2.8 in book
• Produce graphics
• Produce descriptive statistics
Professor S. D. Balkin -- May 20, 2002
47
Diversification???
Data : S&P-500 and IBM daily returns from Jan
01, 1998 through May 17, 2002
Professor S. D. Balkin -- May 20, 2002
48
Probability and Probability Distributions
Professor S. D. Balkin -- May 20, 2002