Class1 - NYU Stern School of Business

advertisement

Statistics & Data Analysis

Course Number B01.1305

Course Section 60

Meeting Time Monday 6-9:30 pm

CLASS #1

Class #1 Outline

2

 Introduction to the instructor

 Introduction to the class

Review of syllabus

• Introduction to statistics

• Class Goals

 Types of data

 Graphical and numerical methods for univariate series

 Minitab Tutorial

Professor S. D. Balkin -- May 20, 2002

3

Professor Balkin’s Info

 Ph.D. in Business Administration, Penn State

 Masters in Statistics, Penn State

 Mathematics/Economics and Music, Lafayette College

 Employment

• Pfizer Inc.

– Management Science Group; Sept. 2001 – current

• Ernst & Young

– Quantitative Economics and Statistics Group; June 1999 – August 2001

Professor S. D. Balkin -- May 20, 2002

What is Statistics?

4

 STATISTICS: A body of principles and methods for extracting useful information from data, for assessing the reliability of that information, for measuring and managing risk, and for making decisions in the face of uncertainty.

 POPULATION: set of measurements corresponding to the entire collection of units

 SAMPLE: set of measurements that are collected from a population

 OBJECTIVES:

• To make inferences about a population from a sample, including the extent of uncertainty

• Design the data collection process to facilitate drawing valid inferences

Professor S. D. Balkin -- May 20, 2002

5

Reasons for Sampling

 Typically due to prohibitive cost of contacting millions of people or performing costly experiments

• Election polls query about 2,000 voters to make inferences regarding how all voters cast their ballots

 Sometimes the sampling process is destructive

• Sampling wine quality

Professor S. D. Balkin -- May 20, 2002

6

Statistics in Everyday Life

 Monthly Unemployment Rates (BLS)

 Consumer Price Index

 Presidential Approval Rating

 Quality and Productivity Improvement

 Scientific Inquiry

• Training effectiveness

• Advertising impact

Professor S. D. Balkin -- May 20, 2002

Interesting Statistical Perspectives

7

 “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write”.

– (H. G. Wells)

 “There are three kinds of lies -- Lies, damn lies, and statistics”.

– (Benjamin Disraeli)

 “You’ve got to know when to hold ‘em, know when to fold

‘em.”

– (Kenny Rogers, in The Gambler)

 “The average U. S. household has 2.75 people in it.”

– (U. S. Census Bureau, 1980)

 “4 out of 5 dentists surveyed recommended Trident

Sugarless Gum for their patients who chew gum.”

– (Advertisement for Trident)

Professor S. D. Balkin -- May 20, 2002

Semester Overview

8

 Understanding data

Intro to descriptive statistics, interpreting data, and graphical methods

 Dealing with and quantifying uncertainty

• Random variables and probability

 Using samples to make generalizations about populations

• Assessing whether a change in data is beyond random variation

 Modeling relationships and predicting

• Using sample data to create models that give predictions for all values of a population

Professor S. D. Balkin -- May 20, 2002

9

Goals for this Class

 To gain an understanding of descriptive statistics, probability, statistical inference, and regression analysis so that it may be applied to your job

 To be able to identify when statistical procedures are required to facilitate your business decision making

 To be able to identify both good and poor use of statistics in business

Professor S. D. Balkin -- May 20, 2002

10

Goals for Me

 To teach you statistics and data analysis effectively

 To improve my effectiveness as an instructor

Professor S. D. Balkin -- May 20, 2002

11

My Promise To You

 I will not teach you anything in this class that is not regularly used in business and industry

 If you ask, “Where is this used?” I will have a real example for you

Professor S. D. Balkin -- May 20, 2002

12

Types of Data

Data

Qualitative / Categorial

Qualitative trait only classifiable into categories

Quantitative / Continuous

Characteristic measurement on a numerical scale

Cable Appointment (Made, Missed)

Employment Status (employed, unemployed)

Bond Ratings (1, 2, 3, or 4 stars)

Service Quality (poor, good , excellent)

Cable Appointment Waiting Time (hours)

Employment Tenure (months)

Bond Return (percentage)

Cost (dollars)

Professor S. D. Balkin -- May 20, 2002

13

Example: Data Types

 Business Horizons (1993) conducted a comprehensive survey of 800 CEOs who run the country's largest global corporations. Some of the variables measured are given below. Classify them as quantitative or qualitative.

• State of birth

• Age

• Educational Level

• Tenure with Firm

• Total Compensation

• Area of Expertise

• Gender

Professor S. D. Balkin -- May 20, 2002

14

How Much Data

Univariate Data

Data sets with just one piece of information

What is a typical value?

How do the values vary?

GMAT Scores for students in this class

Incomes in a zipcode

Returns for a stock over this past year

Respondent ages from market research

Variables

Bivariate Data

Data sets with two pieces of information

Is there a relationship?

How strong is the relationship?

Is there a predictive relationship?

GMAT scores and college GPA

Incomes and age in a zipcode

Returns and volume for a stock

MR respondent age and purchase intent

Multivariate Data

Data sets with three or more pieces of information

Are there relationships?

How strong are the relationships?

Do predictive relationships exist?

GMAT Scores, Salary, Gender, Job Tenure,

Job Category, House Ownership, etc...

Professor S. D. Balkin -- May 20, 2002

CHAPTER 2

Summarizing Data about

One Variable

16

Introduction

 Unorganized mass of numbers is difficult to interpret

 First task in understanding data is summarizing it

Graphically

• Numerically

Professor S. D. Balkin -- May 20, 2002

17

Chapter Goals

 Distinguish between qualitative and quantitative variables

 Learn graphic representations of univariate data

 Learn numerical representations of univariate data

 Investigate data acquired over time

Professor S. D. Balkin -- May 20, 2002

18

Distribution of Values

 Distribution is essentially how many times each possible data values occur in a set of data.

 Methods for displaying distributions

Qualitative data

– Frequency table

– Bar charts

• Quantitative data

– Histograms

– Stem-Leaf diagrams

– Boxplots

Professor S. D. Balkin -- May 20, 2002

Example: Qualitative Data

19

 Background : A question on a market research survey asked 17 respondents the size of their households

 Data : 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6

 Frequency Table

Household

Size

Number of

Households

1

2

3

5

5

6

3

4

0

1

6

2

Professor S. D. Balkin -- May 20, 2002

Example: Qualitative Data (cont.)

20

 Barchart : Plot of frequencies each category occurs in the data set

Number of Households

7

6

3

2

1

0

5

4

1 2 4 5 6

Professor S. D. Balkin -- May 20, 2002

Example: Quantitative Data

21

 Background : Forbes magazine published data on the best small firms in 1993. These were firms with annual sales of more than five and less than

$350 million. Firms were ranked by five-year average return on investment. The data are the annual salary of the chief executive officer for the first 60 ranked firms.

 Data (in thousands) :

145 621 262 208 362 424 339 736 291 58 498 643 390 332 750

368 659 234 396 300 343 536 543 217 298 1103 406 254 862 204

206 250 21 298 350 800 726 370 536 291 808 543 149 350 242

198 213 296 317 482 155 802 200 282 573 388 250 396 572

Professor S. D. Balkin -- May 20, 2002

22

Example: Quantitative Data

(cont.)

 Histograms are constructed in the same way as bar charts except:

• User must create classes to count frequencies

• Bars are adjacent instead of separated with space

Professor S. D. Balkin -- May 20, 2002

Example: Quantitative Data

(cont.)

CEO Salary Histogram

23

0 200 400 600

Salary (in thousands)

800 1000 1200

Professor S. D. Balkin -- May 20, 2002

24

Example: Quantitative Data

(cont.)

 Questions:

• What is the typical value of CEO salary?

• How much variability is there around this value?

• What is the general shape of the data?

 Histogram characteristics:

• Central tendency

• Variability

• Skewness

• Modality

Outliers

Professor S. D. Balkin -- May 20, 2002

25

Skewnesss

26

Symmetric Distribution

28 30

Data

Right Skewed Distribution

32

0

60

34

10

Data

20

Left Skewed Distribution

30

70

Data

80 90 100

Professor S. D. Balkin -- May 20, 2002

26

Modality

Unimodal Distribution

26 28 30

Data

Bimodal Distribution

32

8 10 12

Data

14 16 18

Professor S. D. Balkin -- May 20, 2002

Outliers

Distribution with Outlier

27

28 30 32

Data

34 36

Professor S. D. Balkin -- May 20, 2002

Example: Stem-Leaf Diagram

28

 Background : Telecom company wants to analyze the time to complete new service orders measured in hours

 Data :

42 21 46 69 87 29 34 59 81 97 64 60 87 81 69 77 75 47

73 82 91 74 70 65 86 87 67 69 49 57 55 68 74 66 81 90

75 82 37 94

 Diagram:

2 | 19

3 | 47

4 | 2679

5 | 579

6 | 045678999

7 | 0344557

8 | 111226777

9 | 0147

Professor S. D. Balkin -- May 20, 2002

29

Measures of Central Tendency

 Mode: Value or category that occurs most frequently

 Median: Middle value when the data are sorted

 Mean: Sum of measurements divided by the number of measurements

Professor S. D. Balkin -- May 20, 2002

Example: Mode

30

 Background : A question on a market research survey asked 17 respondents the size of their households

 Data : 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6

 Frequency Table

Household

Size

Number of

Households

1

2

3

5

5

6

3

4

0

1

6

2

Mode

Professor S. D. Balkin -- May 20, 2002

Example: Median

 Background : A question on a market research survey asked 17 respondents the size of their households

 Data : 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6

 Since the n=17 observations,

• Median is the (n+1)/2 = 9 th observation

31

Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Household Size 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 6

Median

Professor S. D. Balkin -- May 20, 2002

32

Example: Mean

 Background : Cable company wants to know how long an installer spends at each stop. One employee performed five installations in one day and recorded how many minutes she was at each location.

 Data : 45, 23, 36, 29, 52

 Mean = (45+23+36+29+52) / 5 = 37 minutes

Professor S. D. Balkin -- May 20, 2002

33

Example: Back to the

CEO’s Salaries

Mean = 404.1695

Median = 350

WHY THE DIFFERENCE?

CEO Salary Histogram

0 200 400 600

Salary (in thousands)

800 1000 1200

Professor S. D. Balkin -- May 20, 2002

34

Measures of Variation

 A primary reason for using statistics is due to variability

 If there was no variability, we would not nee statistics

 Examples:

• Worker productivity

• Stock market

• Promotional expenditures

 Measures

Standard deviation: variation around the mean

• Range: distance between smallest and largest observations

Professor S. D. Balkin -- May 20, 2002

35

Standard Deviation

 Standard Deviation: summarizes how far away from the mean the data value typically are.

 Calculation

Find the deviations by subtracting the mean from each data value

• Square these deviations, add them up, and divide by n-1

• Take the square root of this number

Professor S. D. Balkin -- May 20, 2002

36

Example: Standard Deviation

 Background : Your firm spends $19 Million per year on advertising, and management is wondering if that figure is appropriate. Other firms in your industry have a mean advertising expenditure of $22.3 Million per year.

Professor S. D. Balkin -- May 20, 2002

Example: Standard Deviation

(cont.)

37

Ad$$$ Deviations Sq Devs

8 -14.29

204.32

19

22

-3.29

-0.29

10.85

0.09

20

27

-2.29

5.26

4.71

22.15

12

11

32

20

37

38

23

23

18

23

35

11

14.71

216.26

15.71

246.67

0.71

0.71

0.50

0.50

-10.29

105.97

-11.29

127.56

9.71

94.20

-2.29

5.26

-4.29

0.71

12.71

-11.29

18.44

0.50

161.44

127.56

Mean =

St Dev =

22.29

9.18

Industry Advertising Histogram

5 10 15 20 25

Millions of Dollars

30 35 40

Professor S. D. Balkin -- May 20, 2002

38

Example: Standard Deviation

(cont.)

 Difference from peer group average is $3.3 Million

 This difference is smaller than the industry standard deviation of $9.18 Million

 Conclusion: You advertising budget, while slightly below the industry average, is typical compared with your industry peers

Professor S. D. Balkin -- May 20, 2002

39

Empirical Rule

 If the histogram for a given sample is unimodal and symmetric (mound-shaped), then the following rule-of-thumb may be applied:

 x s the sample standard deviation. Then x

1 s x

2 s x

3 s contains approximat ely 68% of the measuremen ts; contains contains approximat approximat ely 95% of ely all of

the measuremen

the measuremen ts.

ts;

Professor S. D. Balkin -- May 20, 2002

Example: Stock Market Volatility

40

 Description : Stock market returns are supposed to be unpredictable .

Let’s see if the empirical rule holds true

 Data : S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002

S&P-500 Daily Returns Histogram

 Mean = 0.0002

 St. Dev. = 0.0128

 72.8% (95.3%) of the returns fall between the sample mean plus and minus one (two) st.dev.

-0.06

-0.04

-0.02

0.00

0.02

0.04

0.06

Daily Return

Professor S. D. Balkin -- May 20, 2002

Inter-Quartile Range

41

 Inter-Quartile Range (IQR) provides an alternative approach to measuring variability

 Computation:

• Sort the data and find the median

• Divide the data into top and bottom halves

• Find the median of both halves. These are the 25 th and

75 th percentiles

• IQR = 75 th percentile – 25 th percentile

 Outlier Measure – Any value outside the inner fences is an outlier candidate

• Lower inner fence = 25 th percentile – 1.5 IQR

• Upper inner fence = 75 th percentile + 1.5 IQR

Professor S. D. Balkin -- May 20, 2002

42

Box-Plot – S&P-500 Example

Data : S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002

S&P-500 Daily Returns Boxplot

Outliers

Upper inner fence

75th percentile

Median

25th percentile

Lower inner fence

Professor S. D. Balkin -- May 20, 2002

Minitab Tutorial

44

Why Use Minitab???

 Goal of course is to learn statistical concepts

• Most statistical analyses are performed using computers

• Each company may use a different statistical package

 YES…Minitab is used in business!

• Typically in quality control and design of experiments

 EXCEL has very limited statistical functionality and is considerably more difficult to use than Minitab

 There are many stat packages (SAS, SPSS, Systat, Splus, R,

Statistica, Mathematica, etc.)

• Minitab is the easiest program to use right away

• Excellent Help facilities

• Statistical glossary built-in

Professor S. D. Balkin -- May 20, 2002

45

Minitab Tutorial – Case Study 1

 A hotel kept records over time of the reasons why guest requested room changes. The frequencies were as follows

– Room not clean

– Plumbing not working

– Wrong type of bed

– Noisy location

2

1

13

– Other

4

– Wanted nonsmoking

– Didn’t like view

18

1

– Not properly equipped 8

6

Professor S. D. Balkin -- May 20, 2002

46

Minitab Tutorial – Case Study 2

 Exercise 2.8 in book

• Produce graphics

• Produce descriptive statistics

Professor S. D. Balkin -- May 20, 2002

47

Minitab Tutorial – Case Study 3

 Diversification???

 Data : S&P-500 and IBM daily returns from Jan

01, 1998 through May 17, 2002

Professor S. D. Balkin -- May 20, 2002

48

Next Time

 Probability and Probability Distributions

Professor S. D. Balkin -- May 20, 2002

Download