STA201 Intermediate Statistics Lecture Notes

advertisement

STA201 Intermediate Statistics

Lecture Notes

Luc Hens

15 January 2016

ii

How to use these lecture notes

These lecture notes start by reviewing the material from STA101 (most of it covered in Freedman et al. (2007)): descriptive statistics, probability distributions, sampling distributions, and confidence intervals. Following the review, STA201 then covers the following topics: hypothesis tests (Freedman et al., 2007, chapters 26 and 29); the t -test for small samples (Freedman et al., 2007, chapter

26, section 6); hypothesis tests on two averages (Freedman et al., 2007, chapter

27), and the Chi-square test (Freedman et al., 2007, chapter 28). STA201 then covers correlation and simple linear regression (Freedman et al., 2007, chapters

10, 11, 12). Two related subjects (multiple regression and inference for regression) that are not covered in Freedman et al. (2007) are covered in-depth in the lecture notes. Each chapter from the lecture notes ends with Questions for

Review; be prepared to answer these questions in class. Work the problems at the end of the chapters in the lecture notes. The key concepts are set in boldface; you should know their definitions. You can find the lectures notes and other material on the course web site: http://homepages.vub.ac.be/~lmahens/STA201.html

We’ll use the open-source statistical environment R with the graphical user interface R Commander. The course web page contains a document ( Getting started in STA101 ) that explains how to install R and R Commander on your computer, as well as R scripts and data sets used in the course. Thanks to a web interface (Rweb) you can also run R from any computer or mobile device

(tablet or smartphone) with a web browser, without having R installed. Make sure you are connected to the internet. In your web browser, open a new a new tab. Point your browser to Rweb: http://pbil.univ-lyon1.fr/Rweb/

Remove everything from the window at the top ( data(meaudret) etc.). Type

R code (or paste an R script) in the window. Click the Submit button. Wait until Results from Rweb appears. If the script generates a graph, scroll down to see the graph.

Practice is important to learn statistics. Students who wish to work additional exercises can find hundreds of solved exercises in Kazmier (1995) (or a more recent edition). Moore et al. (2012) covers the same ground as Freedman et al. (2007) and has many exercises; the solutions to the odd-numbered exercises are in the back of the book. Older but still useful editions of both books are available in the VUB library.

iii

iv HOW TO USE THESE LECTURE NOTES

Remember the following calculation rules:

– Always carry the units of measurement in the calculations. For instance, when you have two measurements in dollars (

$

2 and

$

3) and you compute their average, write:

$ 2 + $ 3

= $ 2 .

5

2

– To express a fraction (say 2 / 5) as a percentage, multiply by 100% (not by

100):

2

× 100% = 40%

5

The same holds for expressing decimals (say, 0 .

40) as a percentage:

0 .

40 × 100% = 40%

(STA201 was for a while taught as STA301 Methods: Statistics for Business and Economics.)

Chapter 1

Descriptive statistics

1.1

Basic concepts of statistics

Suppose you want to find out which percentage of employees in a given company has a private pension plan. The population is the set of cases about which you want to find things out. In this case, the population consists of all employees in the given company; each employee is a case. A variable is a characteristic of each case in the population. In this case you are interested in the variable private pension plan . It can take two values: yes or no (it’s a qualitative variable ). The percentage of employees who have a private pension plan is a parameter : a numerical characteristic of the population. The monthly salary of the employees is a quantitative variable . The average monthly salary of all employees in the company is another parameter. We’ll be mainly concerned with these two types of parameters: percentages (of qualitative variables) and averages (of quantitative variables).

If you conduct a survey and every employee in the company fills out the survey form, the collected data set covers all of the population, and you can find the exact value of the population parameter. In some cases collecting data for the population may not be possible; you may have to rely on a sample drawn from the population. A sample is a subset of the population. The sample percentage (which percentage of employees in the sample has a private pension plan) is called a statistic .

Statistical inference is when you use a sample to draw conclusions about the population it was drawn from. We’ll see that when the sample is a simple random sample, the sample percentage (the statistic) is a good estimate of the population percentage (the parameter). Much of statistical inference deals with quantifying the degree of uncertainty that is the result of generalizing from sample evidence.

First we will deal with descriptive statistics : ways to summarize data

(from a population or a sample) in a table, a graph, or with numbers.

1.2

Summarizing data by a frequency table

How can we summarize information about a quantitative variable of a sample or a population, often consisting of thousands of measurements?

1

2 CHAPTER 1. DESCRIPTIVE STATISTICS

When a particular stock is traded unusually frequently on a given day, usually this indicates something is going on. Table 1.1 shows the number of traded

Apple shares for each of the first fifty trading days of 2013. A glance at the data reveals that the trade volumes differ considerable from day to day.

Table 1.1: Volumes of Apple stock traded on NASDAQ on the first 50 trading days of 2013.

Source : nasdaq.com

Date

(yyyy/mm/dd)

2013/03/14

2013/03/13

2013/03/12

2013/03/11

2013/03/08

2013/03/07

2013/03/06

2013/03/05

2013/03/04

2013/03/01

2013/02/28

2013/02/27

2013/02/26

2013/02/25

2013/02/22

2013/02/21

2013/02/20

2013/02/19

2013/02/15

2013/02/14

2013/02/13

2013/02/12

2013/02/11

2013/02/08

2013/02/07

Volume

10 828 780

14 473 490

16 591 730

16 888 770

13 923 820

16 709 980

16 408 620

22 746 730

20 618 900

19 688 520

11 501 780

20 936 410

17 862 940

13 259 070

11 794 320

15 937 660

16 974 720

15 545 710

13 981 970

12 683 670

16 954 690

21 677 620

18 315 220

22 591 910

25 089 680

Date

(yyyy/mm/dd)

2013/02/06

2013/02/05

2013/02/04

2013/02/01

2013/01/31

2013/01/30

2013/01/29

2013/01/28

2013/01/25

2013/01/24

2013/01/23

2013/01/22

2013/01/18

2013/01/17

2013/01/16

2013/01/15

2013/01/14

2013/01/11

2013/01/10

2013/01/09

2013/01/08

2013/01/07

2013/01/04

2013/01/03

2013/01/02

Volume

21 143 410

20 422 720

17 006 390

19 243 490

11 349 350

14 877 260

20 355 270

27 967 400

43 088 190

52 065 570

27 298 580

16 392 270

16 712 490

16 128 630

24 627 700

31 114 650

26 145 870

12 509 870

21 426 660

14 535 530

16 350 190

17 262 620

21 196 320

12 579 170

19 986 670

1.2. SUMMARIZING DATA BY A FREQUENCY TABLE 3

How can we get a better idea of the typical daily volumes and the spread around the typical volumes? A good start is to rank the values from low to high:

10 828 780 , 11 349 350 , 11 501 780 , 11 794 320 , 12 509 870 ,

12 579 170 , 12 683 670 , 13 259 070 , 13 923 820 , 13 981 970 ,

14 473 490 , 14 535 530 , 14 877 260 , 15 545 710 , 15 937 660 ,

16 128 630 , 16 350 190 , 16 392 270 , 16 408 620 , 16 591 730 ,

16 709 980 , 16 712 490 , 16 888 770 , 16 954 690 , 16 974 720 ,

17 006 390 , 17 262 620 , 17 862 940 , 18 315 220 , 19 243 490 ,

19 688 520 , 19 986 670 , 20 355 270 , 20 422 720 , 20 618 900 ,

20 936 410 , 21 143 410 , 21 196 320 , 21 426 660 , 21 677 620 ,

22 591 910 , 22 746 730 , 24 627 700 , 25 089 680 , 26 145 870 ,

27 298 580 , 27 967 400 , 31 114 650 , 43 088 190 , 52 065 570

(In R Commander, type the sort() function in the script window. The name of the variable should be between the brackets.)

The values vary from 10 .

8 to 52 .

1 million shares per day. The middle value in the ordered list is called the median . Because we have an even number of values (50), there are two middle values: the values at position 25 (16 974 720) and 26 (17 006 390). In that case, the convention is to take the average of the two middle values as the median: median =

16 974 720 + 17 006 390

2

= 16 990 555

The median gives an idea of the central tendency of the data distribution: half of the days the value (the volume of traded shares) was less than the median, and the other half the value was more than the median.

We can summarize the ordered list in a frequency table. First, define class intervals that don’t overlap and cover all data. You don’t want too few class intervals (because that would leave out too much information), nor too many

(because that wouldn’t summarize the information from the data). You also want the class intervals to have boundaries that are easy, rounded numbers.

The class intervals don’t have to be of the same width. Let us define the first class interval as 10 000 000 to 15 000 000 (10 000 000 included, 15 000 000 not included), the second as 15 000 000 to 20 000 000, and so on, until 50 000 000 to 55 000 000. A frequency table has three columns: class interval, absolute frequency, and relative frequency (table 1.2). The absolute frequency (or count) is how many values fall in each class interval. The first class interval

(10 000 000 to 15 000 000) contains 13 values (verify!): the absolute frequency for this interval is 13. Find the absolute frequencies for the other class intervals.

The relative frequency expresses the number of values in a class interval (the absolute frequency) as a percentage of the total number of values in the data set. For the first class interval (10 000 000 to 15 000 000) the relative frequency is:

13

× 100% = 26%

50

Verify the relative frequencies for the other class intervals. Show your work.

The absolute frequencies add up to the number of values in the data set, and the relative frequencies (before rounding) add up to 100%. If that is not the case, you made a mistake.

4 CHAPTER 1. DESCRIPTIVE STATISTICS

Table 1.2: Frequency table of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of 2013.

Note.

Class intervals include left boundaries and don’t include right boundaries.

Volume Absolute Relative

(shares per day) frequency frequency (%)

10 to 15 million

15 to 20 million

13

19

26

38

20 to 25 million

25 to 30 million

30 to 35 million

35 to 40 million

11

4

1

0

22

8

2

0

40 to 45 million

45 to 50 million

50 to 55 million

Sum:

1

50

1

0

2

100

2

0

1.3

Summarizing data by a density histogram

The frequency table gives you a pretty good idea of what the most common values are, and how the values differ. One way to graph the information from a frequency table is to plot the values of the variable (in this case: the daily volumes) on the vertical axis, and the absolute or relative frequency on the vertical axis. The heights of the bars represent the absolute or relative frequencies. The areas of the bars don’t have a meaning. Such a bar chart is called a frequency histogram.

For reasons that will soon be clear, it is more interesting to plot a frequency table in a bar chart where the areas of the bars represent the relative frequencies.

Such a bar chart is called a density histogram . The height of each bar in a density histogram represents the density of the data in the class interval.

To construct a density histogram, we have to find the height for each bar.

How do we compute the height? Remember that the area of a rectangle (such as the bars in the density histogram) is given by width times height: area = width × height

The area of the bar is the relative frequency, the width of the bar is the width of the class interval, and the height of the bar is the density. Hence: relative frequency = width of the interval × density

Divide both sides by the width of the interval, to obtain: density = relative frequency (%) width of the interval

This formula is on the formula sheet. For the class interval from 10 million to

15 million shares the relative frequency was 26% (table 1.2). Hence the density for this interval is: density =

26%

15 million shares − 10 million shares

1.3. SUMMARIZING DATA BY A DENSITY HISTOGRAM 5

26%

=

5 million shares

= 5 .

2% / million shares

Now that you know the height of the bar over the interval from 10 to 15 million shares (5.2% per million shares), you can draw the bar. The density for the interval from 10 to 15 million shares tells us which percentage of all 50 values falls in each interval of one unit wide on the horizontal axis, assuming that the values in interval from 10 to 15 million shares would be uniformly distributed. In the interval from 10 to 15 million shares, about 5 .

2% of all values falls between

10 and 11 million shares, about 5 .

2% of all values falls between 11 and 12 million shares, about 5 .

2% of all values falls between 12 and 13 million shares, about

5 .

2% of all values falls between 13 and 14 million shares, and about 5 .

2% of all values falls between 14 and 15 million shares. It is as if the bar is sliced up in vertical strips of one horizontal unit (here: one million shares) wide. The density measures which percentage of all values falls in such a strip of one unit wide. Note the unit of measurement of density: percent per million shares.

More generally, density is expressed in percent per unit on the horizontal axis .

Given a data set such as table 1.1, you should be able to construct a frequency table and a density histogram. The first assignment asks you to do exactly that.

Figure 1.1 shows the density histogram as generated by R. A script to draw this density histogram in R Commander is posted on the course web page.

Suppose you don’t have the data set or the frequency table, but just the density histogram (figure 1.1). On which percentage of trading days was the volume of traded Apple shares between 20 and 30 million? Show in the density histogram what represents your answer. On (approximately) which percentage of trading days was the volume of traded Apple shares between 24 and 27 million? Show in the density histogram what represents your answer.

We conclude that the area under de histogram between two values represents the percentage of observations that falls between those two values.

What is the area under all of the histogram?

%.

In a density histogram the vertical axis shows the density of the data. The areas of the bars represent percentages. The area under a density histogram over an interval is the percentage of data that fall in that interval. The total area under a density histogram is 100%. (Freedman et al., 2007, p. 41)

A density histogram reveals the shape of the data distribution. To assess the shape of the density histogram, locate the median on the horizontal axis and draw a vertical line. Is the histogram symmetric about the median, or is it skewed? Is the histogram skewed to the left (that is, with a long tail to the left) or to the right (with a long tail to the right)? Is the histogram bell-shaped?

how the world income distribution has changed over the last two centuries: https://youtu.be/_JhD37gSNVU

Although a density histogram is somewhat more complicated than a frequency histogram, a density histogram has several advantages:

6 CHAPTER 1. DESCRIPTIVE STATISTICS

6

4

2

0

10 20 30

Daily volume (in millions)

40 50

Figure 1.1: Density histogram of the volumes of Apple stock traded on NASDAQ on the first 50 trading days of 2013.

– a density histogram allows for intervals with different widths;

– a bell-shaped density histogram can be approximated by the normal curve

(see below);

– a density histogram has an interpretation that resembles the interpretation of a probability distribution curve (see below).

1.4

Summarizing data by numbers: average

We already saw that the median is a measure of the central tendency of the data distribution. Another useful measure of central tendency is the average.

The formula to compute the average of a list of measurements is: average = sum of all measurements how many measurements there are

1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 7

Here is an example. Suppose you collected the price of the same bottle of wine in five restaurants:

2 ,

2 ,

4 ,

5 ,

7

The average price is: average =

2 +

2 +

4 +

5 +

7

5

=

20

5

=

4

A disadvantage is that the average is sensitive to outliers (exceptionally low or exceptionally high values). Suppose that the list looked like this:

2 ,

2 ,

4 ,

5 ,

22

The average of this list is: average =

2 +

2 +

4 +

5 +

22

5

=

35

5

=

7

The one exceptionally expensive bottle of

22 pulled the average up quite a lot.

In cases like this we often prefer to use a different measure of central tendency: the median. To find the median, first rank the values from low to high. Then take the middle value. The median of the list {

2,

2,

4,

5,

22 } is

4.

The median of the first list {

2,

2,

4,

5,

7 } is also

4. As you can see, the outlier doesn’t affect the median. When a density histogram is skewed or when there are outliers, the median usually is a better measure of the central tendency. One example is the distribution of families by income (Freedman et al., 2007, figure 4 p. 36).

1.5

Summarizing data by numbers: standard deviation

We have seen how to summarize the central tendency of a data set. Another feature we would like capture is the spread (or dispersion) of the data. One way to measure the spread is to look at how much the measurements deviate from the average. Let’s go back to the prices of the same bottle of wine in five restaurants:

2 ,

2 ,

4 ,

5 ,

7

The average price is: average =

2 +

2 +

4 +

5 +

7

5

=

20

5

=

4

The deviation from the average measures how much a measurement is below

( − ) or above (+) the average: deviation = measurement − average

The deviations are:

2 −

4 = −

2

2 −

4 = −

2

4 −

4 =

0

5 −

4 = +

1

7 −

4 = +

3

8 CHAPTER 1. DESCRIPTIVE STATISTICS

To get an idea of the typical deviation, we could take the arithmetic mean of the deviations:

( −

2) + ( −

2) +

0 + (+

1) + (+

3)

5

=

0

It can be easily proven that—whatever the list of measurements—the arithmetic mean of the deviations is always equal to 0: the negative deviations exactly cancel out the positive ones. Therefore statisticians use the quadratic mean of the deviations as a measure of the spread; the outcome is called the standard deviation .

The standard deviation (SD) is a measure of the typical deviation of the measurements from their mean. It is computed as the quadratic mean (or rootmean-square size) of the deviations from the average.

The quadratic mean is usually referred to as the root-mean-square (R-M-S) size. To obtain the standard deviation, find the deviations. The compute the quadratic mean (or root-mean-square size) of the deviations, apply the rootmean-square recipe in reverse order: first square the deviations, then find the

(arithmetic) mean of the result, and finally take the (square) root.

In our example:

1. Square the deviations:

( −

2)

2

( −

2)

2

(

0)

2

(+

1)

2

(+

3)

2

=

2

4

=

2

4

=

2

0

=

2

1

=

2

9

By squaring we get rid of the minus signs. Note that the unit of measurement (here:

) is squared, too.

2. Next find the arithmetic mean (or average) of the results from the previous step: mean =

2

4 +

2

4 +

2

0 +

2

1 +

2

9

5

=

2

18

5

=

2

3 .

6

The unit (

) is still squared (

2

).

3. Finally take the square root of the result from the previous step: p

2

3 .

6 ≈

1 .

90

This is the standard deviation. Note that by taking the square root, the units are

€ again: the standard deviation has the same unit as the measurements.

In this case, the measurements were in euros, so the standard deviation is also in euros.

1.5. SUMMARIZING DATA BY NUMBERS: STANDARD DEVIATION 9

Expressed as a formula, we get:

SD = s sum of (deviations)

2 number of measurements

(The formula is on the formula sheet, so you don’t have to learn it by heart.)

The formula above is for the standard deviation of a population. For reasons

I won’t explain, a better formula for the standard deviation of a sample is:

SD

+

= s sum of (deviations)

2 sample size

× s sample size sample size − 1 that is, you compute the SD with the usual formula (the quadratic mean of the deviations), which is the first factor in the equation above, and then multiply by s sample size sample size − 1

(you don’t have to memorize this formula). Because the second factor is larger than 1, the formula gives a value larger than SD. That’s why Freedman et al.

(2007) use the notation SD

+

. For large samples, the difference between SD and

SD

+ is small. In what follows, we’ll use the SD formula for both samples and populations, unless stated explicitly otherwise. We’ll return to SD

+ when we discuss small samples.

Remember the following rule: few measurements are more than three

SDs from the average.

1 This rule holds for histograms of any shape.

Measurements that are more than three SDs from the average (exceptionally small or exceptionally large measurements) are called outliers . To identify outliers, compute the standard scores of all measurements. The standard score expresses how many standard deviations a measurement is below ( − ) or above

( − ) the average: standard score = measurement − average standard deviation

Converting measurements to standard scores is called standardizing.

Let us return to the daily traded volumes of Apple shares (table 1.1). The volumes of Apple shares trade on the first 50 trading days of 2013 have an average of 19 315 460 and a standard deviation of 7 466 246. On 14 March 2013 only 10 828 780 Apple shares were traded. Is that volume exceptionally small?

Compute the standard score for 10 828 780:

10 828 780 − 19 315 460

=

7 466 246

− 32 750 110

≈ − 1 .

13

7 466 246

De standard score of − 1 .

13 means that the volume of 10 828 780 shares was

1 .

13 standard deviations below the average. Because the absolute value of the

1

A more precise statement can be made. It can be proven ( Chebychev’s Theorem ) that at least 8/9 of the measurements fall within 3 SDs of the average, that is, between

[average − 3 · SD , average + 3 · SD]

Hence at most 1/9 of the measurements fall outside that interval. You don’t have to memorize this.

10 CHAPTER 1. DESCRIPTIVE STATISTICS standard score (after omitting the minus sign: 1 .

13) is smaller than 3, we don’t consider 10 828 780 as an outlier.

Standard scores have no units . The following example illustrates this. A list of incomes per person for most countries in the world (the Penn World

Table, Heston et al. (2012)) has an average of

$

15 115 and a standard deviation of

$

18 651. Income per person in Belgium is

$

39 759. De standard score for income per person in Belgium is:

$

39 759 − $

15 115

$

18 651

=

$

24 644

$

18 651

≈ 1 .

32

The units in the numerator (

$

) and denominator (

$

) cancel each other out, and hence the standard score has no units. That’s why Freedman et al. (2007) refer to computing standard scores as converting a measurement to standard units.

The standard score of 1 .

32 means that income per person in Belgium is 1 .

32 standard deviations above the average of all countries in the list. So is income per person in Belgium an outlier?

Shortcut formula for the SD of 0-1 lists.

Computing the SD is tedious.

To estimate percentages, we’ll be dealing with lists that consist of just zeroes and ones (0-1 lists): for instance, we will model an employee with a private pension plan as a 1, and an employee without a private pension plan as a 0.

The following shortcut formula simplifies the calculation of the SD of 0-1 lists: the standard deviation of a list that consist of just zeroes and ones can be computed as: s

SD of 0-1 list = fraction of ones in the list

× fraction of zeroes in the list

(This formula is on the formula sheet, so no need to memorize. Just for your information, I posted a proof on the course home page.)

Here is an example. Consider the list { 0 , 1 , 1 , 1 , 0 } . The average is 3 / 5. The deviations from the average are: {− 3 / 5 , 2 / 5 , 2 / 5 , 2 / 5 , − 3 / 5 } , or {− 0 .

6 , 0 .

4 , 0 .

4 , 0 .

4 , − 0 .

6 } .

The SD is the root-mean-square size of the deviations:

1. Square the deviations: { 0 .

36 , 0 .

16 , 0 .

16 , 0 .

16 , 0 .

36 }

2. Next find the average of the squared deviations:

0 .

36 + 0 .

16 + 0 .

16 + 0 .

16 + 0 .

36

=

5

1 .

20

= 0 .

24

5

3. Finally take the square root to obtain the SD:

SD =

0 .

24 ≈ 0 .

4898979

According to the shortcut rule we can compute the SD as: s fraction of ones

× fraction of zeroes which yields: r

3

5

×

2

5

= r

6

25

=

0 .

24 ≈ 0 .

4898979 which indeed yields the same result, with far fewer calculations.

1.6. THE NORMAL CURVE 11

1.6

The normal curve

Many bell-shaped histograms can be approximated by a special curve called the normal curve . The function describing the normal curve is complicated: y = √

1

2 π e

− x

2

/ 2

In practice we won’t need this equation: it is programmed in all statistical software packages. The equation describes the standard normal curve , which is the only version of the normal curve we’ll need. In what follows, I’ll refer to the standard normal curve simply as the normal curve.

Figure 1.2 illustrates the properties of the standard normal curve:

1. the curve is symmetric about 0;

2. the area under the curve is 100% (or 1);

3. the curve is always above the horizontal axis.

40

30

20

10

0

-4 -3 -2 -1 0

Standard units (z)

1 2 3 4

Figure 1.2: The standard normal curve

Statisticians use statistical software (on a calculator or a computer) to find areas under the normal curve. On a TI-84, you find the area under the standard normal curve using the normal cumulative density function (normalcdf). The area under the standard normal curve between − 1 and 2 is:

12 CHAPTER 1. DESCRIPTIVE STATISTICS

DISTR → normalcdf( − 1,2) which yields approximately 0.8186. To express the area as a percentage, multiply by 100%:

0 .

8186 × 100% = 81 .

86%

The area under the standard normal curve to the right of − 1 (that is, between

− 1 and infinity) is:

DISTR → normalcdf( − 1,10 99 )

The area under the standard normal curve to the left of 2 (that is, between minus infinity and 2) is:

DISTR → normalcdf( − 10 99 , 2)

For the exams, you have to use the TI-84 to find areas under the normal curve.

On the course web page I posted an R script ( area-under-normal-curve.R

) that computes and plots the area under the normal curve between any two values on the horizontal axis.

R Commander has a built-in function to find the area under the normal curve in the left tail or in the right tail:

Distributions → Continuous distributions → Normal distribution

→ Normal probabilities . . .

1.7

Approximating a density histogram by the normal curve

These are scores of 100 job applicants who took a selection test:

74, 82, 70, 84, 54, 60, 79, 62, 72, 66, 72, 79, 73, 73, 84, 59, 53, 65, 62, 81,

76, 67, 72, 89, 70, 72, 71, 78, 98, 58, 68, 89, 70, 62, 71, 56, 68, 68, 76, 63,

63, 71, 82, 63, 98, 76, 74, 71, 52, 80, 80, 66, 69, 67, 70, 81, 62, 63, 76, 57,

89, 60, 87, 80, 75, 71, 87, 59, 69, 65, 66, 67, 62, 87, 58, 58, 60, 54, 74, 83,

48, 77, 79, 60, 84, 86, 68, 64, 83, 65, 77, 79, 68, 75, 77, 72, 47, 77, 68, 67

(the data are posted on the course web page)

The average of the test scores is about 70, and the standard deviation is about

10 (verify using R Commander). Figure 1.3 shows the density histogram. The histogram is bell-shaped. In 1870, the Belgian statistician Adolphe Quetelet had the idea to approximate bell-shaped histograms by the normal curve (Freedman et al., 2007, p. 78). The horizontal scale of the histogram differs from that of the standard normal curve: most test scores are between 40 and 100, while most of the standard area under the normal curve extends between − 3 and +3 on the horizontal axis; and the center of the density histogram is about 70, while the center of the standard normal curve is 0. If we standardize the values, we get what we want. To obtain the standard scores , do: standard score = measurement − average standard deviation

For example, to standardize the first test score (74; in this case the variable has no units), do:

74 − 70 standard score = = 0 .

4

10

The list of standard scores is: 0 .

4; 1 .

2; 0 .

0; . . . ; − 0 .

3. Verify that you can compute the first couple of standard scores.

1.7. APPROXIMATING A DENSITY HISTOGRAM BY THE NORMAL CURVE 13

3

2

1

0

40 50 60 70

Test score (points)

80 90

Figure 1.3: Density histogram of 100 test scores

100

Figure 1.4 shows the histogram of the standard scores. If you compare with the histogram of the original test scores (figure 1.3) you notice that the shape of the histogram hasn’t changed.

Consider the original test scores. Count the number of job applicants who had a test score between 75 and 85: 25 out of the 100 job applicants had a test score between 75 and 85. So 25% of the job applicants had a test score between

75 and 85. In the histogram (figure 1.3), the percentage corresponds to the area under the histogram between 75 and 85. The standard scores of 75 and 85 are:

75 − 70

= +0 .

5

10 en

85 − 70

= +1 .

5

10

In the histogram of the standard scores (figure 1.4) the percentage (25%) corresponds to the area under the histogram between +0 .

5 and +1 .

5. The area under the normal curve between +0 .

5 and +1 .

5 approximates the area under the histogram between +0 .

5 and +1 .

5. Now carefully look at figure 1.4. The normal approximation overestimates the bar over the interval between +0 .

5 and

+1 .

0, and underestimates the bar over the interval between +1 .

0 and +1 .

5. The area under the normal curve between +0 .

5 and +1 .

5 is approximately:

DISTR → normalcdf(0 .

5,1 .

5) ≈ 0 .

2417 ≈ 24 .

17%

14 CHAPTER 1. DESCRIPTIVE STATISTICS

30

20

10

0

-3 -2 -1 0

Standard units

1 2 3

Figure 1.4: Density histogram of 100 test scores, standardized

The normal approximation (24.17%) is quite close to the actual percentage

(25%).

Use your TI-84 to find the areas under the normal curve between − 1 and

+1. Using the normal approximation, which percentage of measurements will be between ave − SD and ave + SD? Repeat for − 2 and +2 and − 3 and +3.

You see that the normal approximation implies the following rule, called the

68-95-99.7 rule . For a bell-shaped histogram:

– approximately 68% of the measurements are within one SD of the average, that is, between ave − SD and ave + SD;

– approximately 95% of the measurements are within two SDs of the average, that is, between ave − 2 · SD and ave + 2 · SD;

– approximately 99.7% of the measurements are within three SDs of the average, that is, between ave − 3 · SD and ave + 3 · SD;

(The 68-95-99.7 rule is not on the formula sheet; you have to know it by heart.)

The normal approximation will turn out to be very useful in statistical inference (drawing conclusions about population parameters on basis of sample evidence).

1.8. QUESTIONS FOR REVIEW 15

1.8

Questions for Review

1. What is the difference between a qualitative and a quantitative variable?

Illustrate using examples where you consider different characteristics of the students in the class.

2. What is the difference between a parameter and a statistic?

3. What does descriptive statistics do?

4. What does statistical inference do?

5. How can you summarize the distribution of a numerical data set in a table?

In a graph?

6. In a density histogram, what does the density represent? What are the units of density? Explain for a hypothetical distribution of heights (in centimeter) of people.

7. When would the median be a better measure of the central tendency of a distribution than the mean? Illustrate by giving an example.

8. What does the standard deviation measure? How is the standard deviation computed?

9. What are the properties of the normal curve?

10. What does the standard score measure? How is the standard score computed?

11. What does the 68-95-99.7% rule say?

1.9

Exercises

1.

Download the data file AAPL-HistoricalQuotes.csv

from the course web site: http://homepages.vub.ac.be/~lmahens/STA201.html

and save the data file to your STA201 folder (directory). The data set contains data about Apple stock. Run R Commander and load the data set: Data →

Import data → from textfile, clipboard, or URL. . . . A window opens. For

“Location of Data File” select Local file system.” For “Field Separator” select

“Commas.” For “Decimal-Point Character” select “Period [.]”. Press OK, navigate to the data file AAPL-HistoricalQuotes.csv

, abd double-click the file.

Your data should now be loaded by R Commander. In the R Commander menu, click the View Data Set button. A new window opens, showing the data set.

The variable volume is the variable from table 1.1. Now enter the following line of script in the R script window: h <- hist(Dataset$volume/1000000,right=FALSE) and press the Submit button. This command will compute the numbers needed to make a histogram and store then in an object called h . Next, type in the R script window:

16 CHAPTER 1. DESCRIPTIVE STATISTICS h$breaks and press the Submit button. The output window will display the breaks between the intervals, that is, the boundaries of the intervals used by R when it computes the frequency table. Next, type in the R script window: h$counts and press the Submit button. The output window will display the absolute frequencies (counts) of each interval. Next, type in the R script window: h$density and press the Submit button. The output window will display the densities of each interval. The densities are expressed as decimal fractions per horizontal unit; to get densities expressed as percentages per horizontal unit you have to multiply by 100%. Finally, type in the R script window: h$counts/sum(h$counts) and press the Submit button.

The output window will display the relative frequencies for each interval; to get relative frequencies expressed as percentages you have to multiply by 100%.

2.

Use the relative frequencies from table 1.2 to compute the densities for the other intervals. Add a column to show the densities. Then draw the density histogram on scale on squared paper.

3.

Figure 1.1 shows that the daily traded volumes of Apple shares have a skewed distribution. The average daily volume is 19 315 460 shares. Find the median. Show your work. How do mean and median compare? Is that what you expected from the shape of the histogram? Explain.

4.

Find the standard deviation of { 1, 1, 1, 1, 0 } using two methods: the usual formula (root-mean-square size of the deviations) and the shortcut formula for

0-1 lists. Do you get the same result?

5.

The daily traded volumes of Apple shares (table 1.1) have an average of

19 315 460 and a standard deviation of 7 466 246. Is 52 065 570 an outlier? And

43 088 190? Show your work and explain.

6.

Use the TI-84 to find the areas under the standard normal curve:

(a) to the right of 1 .

87

(b) to the left of − 5 .

20

(c) between − 1 and +1

(d) between − 2 and +2

(e) between − 3 and +3

Make for every case a sketch, with the relevant area shaded. Verify your answers using the R script. We’ll get back to cases (c), (d), and (e) in a moment.

1.9. EXERCISES 17

7.

For the 100 given test scores, find which percentage of job applicants scored between 50 and 60. Then use the normal approximation. Is the normal approximation close?

8.

For 164 adult Belgian men born in 1962 the average height is 175 .

7 centimeter and the SD is 8 .

2 centimeter (Garcia and Quintana-Domeque, 2007).

Suppose that the histogram of the 164 heights follows the normal curve (heights usually do). What is, approximately, the percentage of men in this group with a height of 170 centimeter or less? What is, approximately, the percentage of men in this group with a height of between 170 centimeter and 180 centimeter?

9.

Of the volumes of Apple shares traded in the first 50 trading days of 2013

(p. 1.2) the average is 19 315 460 and the SD is 7 466 246.

Find the actual percentage of values between: ave − SD and ave + SD; ave − 2 · SD and ave + 2 · SD; ave − 3 · SD and ave + 3 · SD;

Does the 68-95-99.7 rule give a good approximation? Why (not)?

18 CHAPTER 1. DESCRIPTIVE STATISTICS

Chapter 2

Probability distributions

2.1

Chance experiments

Examples of chance experiments are: rolling a die and counting the dots; tossing a coin and observing whether you get heads or tails; or randomly drawing a card from a well-shuffled deck of cards and observing which card you get.

It is convenient to think of a chance experiment in terms of the following chance model : randomly drawing one or more tickets from a box. For instance, rolling a die is modeled as randomly drawing a ticket from the box:

1 2 3 4 5 6

In R: box <- c(1,2,3,4,5,6) sample(box,1)

Tossing a coin is like randomly drawing a ticket from the box: heads tails

In R: box <- c("heads","tails") sample(box,1)

2.2

Frequency interpretation of probability

Consider the following chance experiment. Roll a die and count the dots. If you get an ace (1), write down 1; if you don’t get an ace (2, 3, 4, 5, or 6), write down 0. Repeat the experiment many times. After each roll, compute the relative frequency of aces up to that point. Make a graph with the number of tosses on the horizontal axis and the relative frequency on the vertical axis.

Figure 2.1 shows the result of 10 000 repetitions in such an experiment. The frequency of aces tends towards 1 / 6 (16 .

666 . . .

%, the horizontal dashed line).

The frequency interpretation of probability states that the probability of an event is the percentage to which the relative frequency tends if you repeat the chance experiment over and over, independently and under the same conditions

(Freedman et al., 2007, p. 222).

19

20 CHAPTER 2. PROBABILITY DISTRIBUTIONS

100

80

60

40

20

0

0 2000 4000 6000

Number of repeats

8000 10000

Figure 2.1: Frequency of aces in 10,000 rolls of a die

2.3

Drawing with and without replacement

Consider the following box with tickets:

1 2 3 4 5 6

The probability to draw an even number is 3 / 6:

P ( 2nd draw is even ) =

3

6

Suppose you randomly draw a ticket from the box. The ticket turns out to be

2 . Suppose you replace the ticket, and again randomly draw a ticket from the box. This is called drawing with replacement . The conditional probability to draw an even number on the second draw, given that the first draw was 2 , is again 3 / 6. In mathematical notation:

P ( 2nd draw is even | 1st draw was 2 ) =

3

6

The vertical bar ( | ) is shorthand for “given that.” What comes after the vertical bar ( | ) is called the condition. A probability with a condition is called a conditional probability.

Note that in this case imposing the condition didn’t affect the probability of drawing an even number: whether the first draw was 2 or not doesn’t matter

2.4. THE SUM OF DRAWS 21 for the second draw, because we replaced the ticket after the first draw. In both cases, the probability of getting an even number was the same (3 / 6):

P ( 2nd draw is even | 1st draw was 2 ) = P ( 2nd draw is even )

The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be independent : the probability of the second event is not affected by how the first event turned out. That is because we were drawing with replacement.

When drawing with replacement, the events are independent .

Now consider a different chance experiment. Suppose you randomly draw a ticket from the box. The ticket turns out to be 2 . Suppose you don’t replace the ticket. The box now looks like this:

1 3 4 5 6

If we now again randomly draw a ticket from the box, this is called drawing without replacement . The conditional probability to draw an even number on the second draw, given that the first draw was 2 now is :

P ( 2nd draw is even | 1st draw was 2 ) =

2

5

In this case, what happened in the first draw (as expressed by the condition

“1st draw was 2 ”) does make a difference: the probability of getting an even number differs:

P ( 2nd draw is even | 1st draw was 2 ) = P ( 2nd draw is even )

The two events (getting an even number on the second draw, and getting an even number on the second draw) are said to be dependent : the probability of the second event is affected by how the first event turned out. That is because we were drawing without replacement.

When drawing without replacement, the events are dependent .

Think of a population as a box with tickets. A random sample is like drawing a number of tickets without replacement from this box. The number of draws is the sample size. Remember this. We’ll use this box model when doing statistical inference.

2.4

The sum of draws

For the theory of statistical inference, we’ll frequently use the concept of the sum of draws. Here’s a simple example: roll a die twice, and add the numbers.

The chance model has the following box:

1 2 3 4 5 6

Draw two tickets with replacement from the box, and add the outcomes. The result is the sum of draws.

The sum of draws is a brief way to say the following (Freedman et al.,

2007, p. 280):

22 CHAPTER 2. PROBABILITY DISTRIBUTIONS

– Draw tickets from a box.

– Add the numbers on the tickets.

As the following activity makes clear, the sum of draws is itself a random variable:

(a) Conduct the chance experiment above using an actual die or the following

R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1)

(b) Repeat the experiment a couple of times and write up the outcomes (using an actual die, or in R by running the line sample(box,1)+sample(box,1) .

Would it be fair to say that the sum of draws is a chance variable? Explain.

2.5

Picking an appropriate chance model

We model a population as a box with tickets. Taking a random sample is like randomly drawing a number of tickets from the box, without replacement; the number of draws is the sample size. In order to use such a chance model for inference, we will use some interesting properties of the sum of draws. The trick is to set up the chance model in such a way that the chance variable of interest is the sum of draws, or is computed from the sum of draws. An example clarifies my argument.

Suppose you roll a die three times, and want to know what the sum of the outcomes is. What is the appropriate chance model? What is the chance variable?

An appropriate chance model is a box with six tickets:

1 2 3 4 5 6 and the chance variable is the sum of three random draws with replacement from the box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 3 , 2 , and 6 . The sum of draws ( 3 + 2 + 6 = 11) is obtained by adding up the outcomes.

Now suppose that we are interested in another question: how many times

(out of three rolls) will we get a six? First, we need the appropriate chance model. When we roll a die, when can get two kinds of outcomes: either we get a six (we’ll label this outcome as a success ), or we get another number (1, 2, 3,

4, 5: not a success ). The term success is used here in a technical meaning: the outcome we are interested in. Note that we classify the outcomes of a single roll as a success or not a success. In such a case, the appropriate chance model is a box with six tickets: one ticket 1 for the outcome 6 labelled as a success, and five tickets 0 for the outcomes 1, 2, 3, 4, or 5 labelled as not a success:

0 0 0 0 0 1

2.6. PROBABILITY DISTRIBUTIONS 23

Now we are interested in the number of sixes in three rolls, so we need to count the sixes. Counting the sixes is the same thing as taking the sum of three draws from the 0-1 box. For instance, if you roll 3, 2, and 6, this corresponds to drawing tickets 0 , 0 , and 1 (we classified each outcome as a success or not a success). The sum of draws ( 0 + 0 + 1 = 1) is the number of sixes

(the number of successes). A box like this, with tickets that can only take values 0 and 1, is called a 0-1 box. Remember that when the problem is one of classifying and counting , the appropriate box is a 0-1 box.

Here’s a real-world example. Suppose you are the marketing manager of a telecommunications company that doesn’t cover Brussels yet. You would like to find out which percentage of households in Brussels already has a tablet. The population of interest is all households in Brussels. Think of each household in Brussels as a ticket in a box, so there are as many tickets as households. A ticket takes value 1 if the household has a tablet, and 0 if the household doesn’t.

Taking a random sample of households is like randomly drawing tickets without replacement from this 0-1 box. The number of households in the sample who have a tablet is the sum of draws. The percentage of households in the sample who have a tablet is: sample percentage = sum of draws size of the sample

× 100%

2.6

Probability distributions

Chance experiments can be described using probability distributions. In what follows, we’ll focus on the probability distribution of the sum of draws. Suppose you roll a die twice and add the outcomes. The chance model is: randomly draw two tickets with replacement from the box

1 2 3 4 5 6 and add the outcomes.

The chance variable (the sum of the two draws) can take the following values: { 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } (the chance variable is discrete; we won’t develop the theory for continuous chance variables). For each of these possible outcomes, we can compute the probability. There are 36 possible combinations:

1 2 3 4 5 6

1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

4 5 6 7 8 9 10

5 6 7 8 9 10 11

6 7 8 9 10 11 12

Each of these 36 combinations has the same probability, and as the probabilities have to add up to 1, each combination has a probability of 1 / 36. By applying the rules or probability, we can find the probability that the sum of draws takes the value 2, and then repeat the work to find the probability that the sum of draws takes the value 3, and so on. There are for instance two combinations that yield a sum of 3:

24 CHAPTER 2. PROBABILITY DISTRIBUTIONS

– when the first draw is 1 and the second draw is 2 (row 1, column 2 in the table above)

– when the first draw is 2 and the second draw is 1 (row 2, column 1)

The probability that the sum of draws is 3 is therefore equal to:

P (sum is 3) = P [(first 1, than 2) or (first 2, then 1)]

Apply the addition rule (Freedman et al., 2007, pp. 241–242) to obtain:

P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) − something

The third term (“minus something”) is equal to zero because the events (first 1, than 2) and (first 2, then 1) are mutually exclusive (two events are mutually exclusive when as one event happens, the other cannot happen at the same time). So we get:

P (sum is 3) = P (first 1, than 2) + P (first 2, then 1) − 0 =

1

36

+

1

36

2

=

36

If you do this for all other possible values of the chance variable, you get the following table: outcome probability

2 3 4

1

36

2

36

3

36

5 6

4

36

5

36

7 8

6

36

5

36

9 10 11 12

4

36

3

36

2

36

1

36

A table that shows all possible values for a (discrete) chance variable and the corresponding probabilities is called a probability distribution .

We can graph the probability distribution as a bar chart. On the horizontal axis we put the chance variable, and we construct the bar chart in such a way that the area of a bar shows the probability (expressed as a percentage), just as in a density histogram the area of a bar showed the relative frequency (expressed as a percentage) of the data over the interval. That is why Freedman et al. (2007, pp. 310–316) call such a bar chart a probability histogram . For a discrete chance variable the convention is to center the bars on the values that the variable can take: the bar over 2 will start at 1 .

5 and end at 2 .

5; the bar over 3 will start at 2 .

5 and end at 3 .

5, and so on. The width of each bar is equal to 1. The height of each bar in a probability distribution is called probability density : the probability per unit on the horizontal axis. We find the probability densities by applying the formula for the area of a rectangle: area = width × height

We want the area to represent the probability (expressed as a percentage) and the height to represent the probability density (expressed as percent per unit on the horizontal axis), and hence the equation becomes: probability = width of interval on horizontal axis × probability density

Divide both sides of the equation by (width of interval on horizontal axis) to obtain: probability probability density = width of interval on horizontal axis

2.6. PROBABILITY DISTRIBUTIONS 25

Because the width of each interval on horizontal axis is one unit of the horizontal axis, this becomes: probability density = probability per unit on the horizontal axis which gives us the meaning of probability density .

For example, the probability to get a 7 is 6 / 36 (= 16 .

66 . . .

%). The probability density over the interval from 6 .

5 to 7 .

5 then is equal to: probability density =

16 .

66 . . .

%

7 .

5 − 6 .

5

= 16 .

66 . . .

% per per unit on the horizontal axis

Figure 2.2 shows the corresponding bar chart representing the probability distribution . The curve traced by the bar chart of the probability distribution is called the probability density function . The probability density function has the following properties:

– the curve is always on or above the horizontal axis, that is, the probability density (on the vertical axis) is always 0 or positive;

– de area under the curve is equal to 1 (or 100%);

– the area under the curve between two values on the horizontal axis gives the probability.

The probability distribution has an expectation and a standard error. The following example illustrates the intuition of these concepts. Roll a die twice and add the numbers. You can do that with an actual die, or run the following

R script: box <- c(1,2,3,4,5,6) sample(box,1) + sample(box,1)

Repeat this a couple of times, and write down the outcomes. You will get something like { 6, 7, 10, 8, 10, . . .

} . The outcomes are random. The lowest value you can get is 2 (when you roll two aces), and the highest value is 12

(when you roll two sixes). If you repeat the experiment many times you’ll notice that those extreme values occur only occasionally; values like 6, 7, or 8 occur much more frequently. The expectation is the typical value that the random variable will take; the value around which the outcomes vary. Another way to think about the expectation is as the center of the probability distribution

(figure 2.2). In this case the expectation is 7 (we’ll see below how to compute the expectation). Now define the difference between the outcome of a chance experiment and the expectation as the chance error. For instance, our first outcome was 6, the expectation is 7, and hence the chance error was: chance error = outcome − expectation = 6 − 7 = − 1

(the negative value of 1 means that the outcome was 1 below the expectation).

If we compute the chance errors for the other outcomes, we get:

26 CHAPTER 2. PROBABILITY DISTRIBUTIONS

16

14

12

10

8

6

4

2

2 3 4 5 6 7 8 9 10 11 12

Outcome

Figure 2.2: Probability distribution of the sum of two rolls of a die outcome chance error (without the minus sign)

6

7

− 1

0

(1)

(0)

10

8

+3

− 1

(3)

(1)

10

. . .

typical value: expectation

+3

. . .

(3)

. . .

standard error

The third column shows the chance errors without the minus sign. The standard error is the typical size of the chance errors (without the minus sign).

Average and expectation are related concepts: the average is a measure of the central tendency of data (represented in a density histogram), and the expectation is a measure of the central tendency of a chance variable (represented in a probability density graph). Similarly, the standard deviation is a measure of the spread of data around the average, and the standard error is a measure of the “spread” of a chance variable around the expectation. In brief: data central tendency spread average standard deviation (SD) chance variable expectation ( E ) standard error (SE)

Let us now define these concepts more rigorously.

2.7. INTERMEZZO: A WEIGHTED AVERAGE 27

2.7

Intermezzo: a weighted average

To define the expectation and standard error of a discrete chance variable, we need the concept of a weighted average . A weighted arithmetic average of a list of numbers is obtained by multiplying each value in the list by a weight and adding up the outcomes; each of the weights is a number between zero

(included) and one (included), and the weights add up to one. Suppose the first value in the list is x

1 and its weight is w

1

, the second value in the list is x with weight w

2

, . . . , and the last ( n th) value in the list is x n

2 and and with weight w n

, then the weighted average is:

( w

1

× x

1

) + ( w

2

× x

2

) + . . .

+ ( w n

× x n

)

An example is the way a professor computes the students’ grades for a course.

Here are the weights for the graded components of a course, and the results for a student: component assignment 1 assignment 2 assignment 3 assignment 4 participation and preparedness midterm exam final exam weight (%)

7.50

7.50

7.50

7.50

10.00

30.00

30.00

result (score/20)

12

14

16

12

16

12

17

Each weight is between 0 and 1: 7 .

50 percent is 0 .

075, 10 percent is 0 .

10, and

30 percent is 0 .

30. Moreover, the sum of the weights is equal to 1:

7 .

50% + 7 .

50% + 7 .

50% + 7 .

50% + 10 .

00% + 30 .

00% + 30 .

00% = 100% = 1

The weighted average of the scores is:

(0 .

075 × 12) + (0 .

075 × 14) + (0 .

075 × 16) + (0 .

075 × 12)

+(0 .

10 × 16) + (0 .

30 × 12) + (0 .

30 × 17) = 14 .

35

So this student has an overall score of 14.35/20.

2.8

Expectation (E)

Just as the average is a measure of the central tendency of a density histogram, the expectation of a chance variable is in a sense a measure for the central tendency of a probability distribution. For a discrete chance variable, the expectation is defined as the weighted average of all possible values that the chance variable can take; the weights are the probabilities.

The probability distribution of the sum of two rolls of a die is: outcome probability

2

1

36

3 4

2

36

3

36

5 6

4

36

5

36

7 8 9 10 11 12

6

36

5

36

4

36

3

36

2

36

1

36

The expectation of the chance variable “sum of two rolls of a die” (or of two draws with replacement from a box with the tickets { 1,2,3,4,5,6 } ) is the weighted

28 CHAPTER 2. PROBABILITY DISTRIBUTIONS average:

1

2 ×

36

2

+ 3 ×

36

3

+ 4 ×

36

4

+ 5 ×

36

5

+ 6 ×

36

6

+ 7 ×

36

5

+ 8 ×

36

4

+9 ×

36

3

+ 10 ×

36

2

+ 11 ×

36

1

+ 12 ×

36

=

2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12

36

Let the operator E denote the expectation:

=

252

36

= 7

E (sum of two rolls of die) = 7

2.9

Standard error (SE)

Just as the standard deviation is a measure of the spread of a density histogram, the standard error of a chance variable is in a sense a measure for the spread of a probability distribution.

We defined the chance error as the difference between the outcome of a chance variable and the expectation of that chance variable.

If the chance experiment is to roll a die twice and add the outcomes, we could get 2 as an outcome; in that case de chance error is 2 − 7 = − 5. For the outcome 3, the chance error is 3 − 7 = − 4, etc. It is useful to add the chance errors to the table of the probability distribution: outcome 2 3 4 5 6 7 probability

1

36

2

36

3

36

4

36

5

36

6

36 chance error -5 -4 -3 -2 -1 0

8 9 10 11 12

5 4 3

36 36 36

1 2 3

2 1

36 36

4 5

The standard error of a discrete chance variable is defined as the weighted quadratic average of the chance errors; the weights are the probabilities. (A quadratic average is the root-mean-square size.)

Start from the chance errors in the example (the third line we just added to the table of the probability distribution):

− 5 , − 4 , − 3 , − 2 , − 1 , 0 , 1 , 2 , 3 , 4 , 5

1.

Square . First square the chance errors: ( − 5)

2

, ( − 4)

2

, ( − 3)

2

, ( − 2)

2

, ( − 1)

2

,

0

2

, 1

2

,2

2

, 3

2

, 4

2

, 5

2

. This yields:

25 , 16 , 9 , 4 , 1 , 0 , 1 , 4 , 9 , 16 , 25

2.

Mean . Then take the weighted average. Use the probabilities of the chance errors as the weights:

1

36

× 25 +

2

36

× 16 +

3

36

× 9 + . . .

≈ 5 .

33

3.

Verify that this indeed yields approximately 5 .

33 (a spreadsheet is helpful).

Root . Finally take the square root:

5 .

33 ≈ 2 .

42

The standard error of the sum of two draws from { 1, 2, 3, 4, 5, 6 } is approximately 2 .

42. You can think of this as the typical size of the chance errors.

2.10. EXPECTATION AND SE FOR THE SUM OF DRAWS 29

2.10

Expectation and SE for the sum of draws

When doing statistical inference, we’ll use the sum of draws with replacement from a box with tickets. The formulas for the expectation and the standard error of discrete probability distributions from the previous sections also apply if the chance variable is a sum of draws. However, the computations can become tedious. It can be shown that the following formulas hold:

E (sum of draws) = (number of draws) × (average of box)

SE(sum of draws) =

√ number of draws × (SD of the box)

“Average of box” means: the average of the values on the tickets in the box; similarly “SD of the box” means the SD of the values on the tickets in the box.

You don’t have to memorize these formulas; they are on the formula sheet. In inference, the box will represent the population, so the average of the box is the population average and the SD of the box is the population SD.

Let us apply these formulas to the example from the previous sections: roll a die twice and add the outcomes. The chance model is: randomly draw two tickets with replacement from the box

1 2 3 4 5 6 and add the outcomes. We found in the previous sections that the expectation is 7 and the SE is approximately 2 .

42. What is we use the formulas for the expectation and the SE of the sum of draws?

To apply the formula for the expectation of the sum of draws we first need the average of the box: average of box =

1 + 2 + 3 + 4 + 5 + 6

6

=

21

6

The expectation of the sum of two draws is:

E (sum of draws) = (number of draws) × (average of box) = 2 ×

21

6

=

21

3

= 7

This is the same number we found be applying the definition of the expectation.

To apply the formula for the standard error for the sum of draws, we first need the SD of the box; the SD of the box is about 1 .

71 ( exercise : verify this).

Then apply the formula for the standard error for the sum of draws:

SE (sum of draws) =

√ number of draws × (SD of the box) ≈

2 × 1 .

71 ≈ 2 .

42

This is the same number we found be applying the definition of the standard error.

2.11

The Central Limit Theorem

Consider again the chance experiment: roll a die twice and add the outcomes.

The chance model is: randomly draw two tickets with replacement from the box

1 2 3 4 5 6

30

20

10

CHAPTER 2. PROBABILITY DISTRIBUTIONS

1 2 3 4

Outcome

5 6

Figure 2.3: Histogram of the dots on a die and add the outcomes. The chance variable is the sum of draws. A histogram of the box (the list of numbers { 1, 2, 3, 4, 5, 6 } ) is shown in figure 2.3. Note that the histogram is not bell-shaped at all.

We already computed and plotted the probability distribution of the sum of two draws (figure 2.2). Figure 2.4 compares the probability distribution with the normal curve. The normal curve approximates the probability distribution reasonably well. From the probability distribution table (p. 2.9) we know that the probability to get an outcome between 5 (included) and 7 (included) is

4

36

5

+

36

6

+

36

=

15

36

≈ 42%

In figure 2.4 the probability of 42%corresponds to the area of the bar over 5

(between 4 .

5 and 5 .

5), plus the area of the bar over 6 (between 5 .

5 and 6 .

5), plus the area of the bar over 7 (between 6 .

5 and 7 .

5). The area under the normal curve between 4 .

5 and 7 .

5 approximates the area under the blocks. We can find the area under the normal curve between 4 .

5 and 7 .

5 using statistical software.

First, standardize the boundaries of the interval (4 .

5 and 7 .

5). The variable on the horizontal axis is a chance variable, not data, so we use the expectation instead of the average and the standard error instead of the standard deviation to standardize: chance variable in standard units = value − expectation

SE

The left boundary (4 .

5) in standard units is approximately:

4 .

5 − 7

≈ − 1 .

04

2 .

42

The right boundary (7 .

5) in standard units is approximately:

7 .

5 − 7

≈ 0 .

21

2 .

42

To find the area under the standard normal curve between − 1 .

04 and 0 .

21 on the TI-84, use the normalcdf -function: normalcdf(-1.04,0.21)

2.11. THE CENTRAL LIMIT THEOREM 31

16

14

12

10

8

6

4

2

2 3 4 5 6 7 8 9 10 11 12

Outcome

Figure 2.4: Probability distribution of the sum of two rolls of a die which yields approximately 0 .

43 or 43%. The normal approximation (43%) is close to the actual probability (42%).

The example illustrates the central limit theorem :

When drawing at random with replacement from a box, the probability distribution for the sum of draws will follow the normal curve, even if the contents of the box do not. The number of draws must be reasonably large.

When is the number of draws “reasonably large”? Consider a box with 99 tickets 0 and one 1 . The histogram of the box is very skewed (figure 2.5).

Let us now investigate how the sum of 100, 400, or 900 draws from this skewed box is distributed (the calculations to find the probabilities are very tedious and are done using statistical software). The top panel in figure 2.6

shows the distribution of the sum of 100 draws; the probability distribution of the sum is skewed. The middle panel in figure 2.6(a) shows the distribution of the sum of 400 draws; the probability distribution of the sum is still skewed, but less so than in the case of 100 draws. The bottom panel in figure 2.6 shows the distribution of the sum of 900 draws; the probability distribution of the sum is pretty much bell-shaped.

This example illustrates that the number of draws required to use the normal approximation for the sum of draws differs from case to case. When rolling a die (drawing from a box { 1, 2, 3, 4, 5, 6 } ), two draws were sufficient. Generally,

32 CHAPTER 2. PROBABILITY DISTRIBUTIONS

100

80

60

40

20

0

0 1

Outcome

Figure 2.5: Histogram of a box with 99 tickets 0 and one 1 when drawing from a box with a histogram that is not too skewed, often 30 draws will suffice. But when drawing from a very skewed box, often hundreds or even thousands of draws are needed before the normal curve is a reasonably good approximation of the probability distribution of the sum of draws.

Why is the central limit theorem important? When doing statistical inference, we will use a sample drawn from a population. The sample is like tickets drawn from a box (the box represents the population). The sample statistic (for instance, the sample proportion) is a chance variable: as the sample is random, so is the sample statistic. We can use the central limit theorem to approximate the probability distribution of the sample statistic by the normal curve. But the normal approximation is only good if the sample is large enough.

2.12. QUESTIONS FOR REVIEW 33

40

30

20

10

0

10 20 30 40 50 60

(a) Sum of 100 draws

70 80 90 100

20

15

10

5

0

10 20 30 40 50 60

(b) Sum of 400 draws

70 80 90 100

15

10

5

0

10 20 30 40 50 60

(c) Sum of 900 draws

70 80 90 100

Figure 2.6: Probability distributions of the sum of 100, 400, and 900 draws from a box with 99 tickets 0 and one ticket 1

2.12

Questions for Review

1. The chance of drawing the queen of hearts from a well-shuffled deck of cards is 1 / 52. Explain what this means, using the frequency interpretation of probability.

2. What is the difference between drawing with and without replacement?

Use as an example drawing a ball from a fishbowl filled with white and red balls.

3. When are two events independent? Give an example, referring to a fishbowl filled with white and red balls.

4. What does the sum of draws mean?

5. Explain the difference between adding and classifying & counting .

6. What does the addition rule say?

7. When are two events mutually exclusive?

34 CHAPTER 2. PROBABILITY DISTRIBUTIONS

8. What is a probability distribution for a discrete chance variable? Which properties should it have?

9. What is a probability density histogram? Which properties does it have?

10. What is probability density?

11. What is a chance error?

12. What is a weighted average?

13. What is the expectation of a discrete chance variable?

14. What is the standard error of a discrete chance variable?

15. What does the Central Limit Theorem say?

2.13

Exercises

1.

Conduct the experiment described in section 2.2 using an actual die (or with http://www.random.org/dice/?num=1 ). Roll the die ten times. After each roll, compute the relative frequency of aces up to that point. Complete the following table:

Table 2.1: Number of aces in rolls of a die

Repeat Ace (1) or not (0) Absolute frequency (*)

Relative frequency, % (*)

7

8

9

10

5

6

3

4

1

2

(*) Absolute and relative frequency of aces in this and all previous repeats

Plot the number of tosses on the horizontal axis and the relative frequency on the vertical axis.

2.

Conduct the experiment described in section 2.2 using the R script roll-a-die.R

on the course home page (the script simulates 10 000 rolls of a die). How does the graph look like? Run the script again. Is the graph exactly the same? How does it differ? In what respect is it similar? Run the script once more. Is there a pattern?

2.13. EXERCISES 35

3.

You roll a die twice and add the outcomes. Find the probability to get a

10. Show your work and explain.

4.

You toss a coin twice and count the number of heads. Construct a probability distribution table and a probability density histogram. What does the area under a bar in the probability density histogram show? And the height of a bar? Find the expectation, the chance errors, and the standard error. (This was an exam question in Fall 2015.)

5.

Consider the following chance experiment: roll a die and count the number of dots. Formulate an appropriate chance model. What are the possible outcomes? What are the probabilities? Make a table and a bar chart of the probability distribution (in the chart, put the probability density on the vertical axis). Compute the expectation and the standard error.

6.

Work parts (a) and (b) of Freedman et al. (2007, review exercise 2 p. 304).

36 CHAPTER 2. PROBABILITY DISTRIBUTIONS

Chapter 3

Sampling Distributions

A sample percentage is chance variable, with a probability distribution. The probability distribution of a sample percentage is called a sampling distribution (the probability distribution of a sample average is also a sampling distribution). This chapter discusses the properties of sampling distributions.

The next two chapters build on the properties of sampling distributions to estimate confidence intervals and test hypotheses for the percentage or the average of a population.

3.1

Sampling distribution of a sample percentage

In a small town there are 10 000 households. 4 600 households (46% of the total) own a tablet. The population percentage (46%) is a parameter : a numerical characteristic of the population.

A market research firm doesn’t know the parameter. It tries to estimate the parameter by interviewing a random sample of 100 households. The researchers counts the number of households in the sample and computes the sample percentage: number in the sample sample percentage = × 100% size of sample

The sample percentage is a statistic : a numerical characteristic of a sample.

We model the population as a box with 10 000 tickets. Every household that owns a tablet is represented by a ticket 1 , and every household that doesn’t own a tablet is represented by a ticket 0 :

5400 tickets 0 4600 tickets 1

Of course, the market research firm doesn’t know how many out of the 10 000 tickets are tickets with a 1 (but we do). The random sample is like randomly without replacement drawing 100 tickets from the box. The researcher counts the number of tickets with 1 (the number of households in the sample who own a tablet). Suppose they draw

0 0 1 0 1 . . .

0

37

38 CHAPTER 3. SAMPLING DISTRIBUTIONS

The number of households in the sample who own a tablet is then equal to:

0 + 0 + 1 + 0 + 1 + . . .

+ 0 that is, the number in the sample is the sum of draws from the 0-1 box . As the researcher computes the sample percentage: sample percentage = number in the sample

× 100% size of sample the numerator (the number of households in the sample who own a tablet) is the sum of draws from the 0-1 box. Hence the sample percentage is computed from the sum of draws. Remember this.

Will the sample percentage be equal to the percentage in the population?

We can find out by simulating the experiment described above in R. First we define the box with 4 600 tickets 0 and 5 400 tickets 0 : population <- c(rep(1,4600),rep(0,5400))

This line of code generates a list (called “population”) of 10 000 numbers: 4 600 times 1 and 5 400 times 0 . You can check this by letting R display a table summarizing the contents of the list called “population”: table(population)

Now take a random sample of 100 households from the population: sample(population,100,replace=FALSE)

You get a list of 100 numbers that looks something like this:

0 1 1 0 1 1 1 0 ... 1 0 0 1

The researcher is interested in the number of households in the sample who own a tablet. That number is the sum of the draws: sum(sample(population,100,replace=FALSE))

You get something like: 39

So this sample contained 39 households who own a tablet (and 61 who don’t). If you divide the number in the sample (39) by the sample size (100) and multiply by 100%, you get the sample percentage: sample percentage = number in the sample

× 100% = size of sample

39

100

× 100% = 39%

So the sample percentage (39%) is not equal to the percentage in the population

(46%). That should be no surprise: the sample percentage is just an estimate of the population percentage, based on a random sample of 100 out of the 10 000 tickets. The difference between the estimate (the sample percentage) and the parameter (the population percentage) is called the chance error : chance error = sample percentage − population percentage

In this case the chance error is chance error = 39% − 46% = − 7%

(the minus sign indicates that the sample percentage underestimates the population percentage.

Of course, because the researcher doesn’t know the population percentage, she doesn’t know how big the chance error she made is; all she knows is that she made a chance error.

3.1. SAMPLING DISTRIBUTION OF A SAMPLE PERCENTAGE 39

Why is the estimation error called a chance error? That’s because the estimation error a chance variable. That can be easily seen by repeating the line sum(sample(population,100,replace=FALSE)) a couple of times, and computing the sample percentage for every sample. You’ll get something like: 39, 43, 46, 37, 43, 52, . . . : the sample percentage is a chance variable. In a table: typical value: sample percentage chance error (without the minus sign)

39 − 7 (7)

43 − 3 (3)

46

37

0

− 9

(0)

(9)

52

. . .

expectation

. . .

6 (6)

. . .

standard error

So in repeated samples, the sample percentage is a chance variable. The sample percentage has a probability distribution (called the sampling distribution of the sample percentage). The expectation of the sample percentage is the typical value around which the sample percentage varies in repeated samples

(take a look at the first column: do you have a hunch what the expectation of the of the sample percentage is?). The standard error of the sample percentage is the typical size of the chance error (after you omitted the minus signs, as shown in the third column).

It can be shown that the expectation of the sample percentage is the population percentage (proof omitted):

E (sample percentage) = population percentage

The sample percentage is said to be an unbiased estimator of the population percentage. This also implies that

E (chance error) = 0

To find the SE for the sample percentage, start from sample percentage = number in the sample

× 100% size of sample which can be written as: sum of draws sample percentage = number of draws

× 100%

Take the standard error of both sides:

SE(sample percentage) =

SE(sum of draws)

× 100% number of draws

From the square root law (p. 29) we know that for random draws with replacement:

SE(sum of draws) =

√ number of draws × (SD of the box)

40 CHAPTER 3. SAMPLING DISTRIBUTIONS

This is still approximately true for draws without replacement, provided that the population is much larger than the sample:

SE(sum of draws) ≈

√ number of draws × (SD of the box)

So the expression for the SE for the sample percentage becomes:

SE(sample percentage) =

√ number of draws × (SD of the box)

× 100% number of draws or:

SE(sample percentage) ≈ × 100% sample size

You don’t have to memorize this formula. The formula is only approximately right because taking a sample is drawing without replacement. When the population is much bigger than the sample, the distinction between drawing with and without replacement becomes small (Freedman et al., 2007, pp. 367-370).

In that case, the formula gives a good approximation.

To find the SD of the population, use the shortcut rule for 0-1 lists: s

SD of population = fraction of ones r

4600

=

10000

≈ 0 .

50

5400

×

10000

× fraction of zeroes

(Of course, the researcher doesn’t know the fraction of ones in the population

(the fraction of households in the population who own a tablet). If the sample is large, she can estimate the SD of the population by the SD of the sample.

This technique is called the bootstrap . We’ll get back to this when we discuss inference.)

Now we can find the standard error of the sample percentage:

SE(sample percentage) ≈

SD of population

√ sample size

× 100% ≈

0 .

50

100

× 100% ≈ 5%

In sum: if many researchers would each take a random sample of 100 households and compute the sample percentage, the sample percentage will be about 46%

(the expectation), give or take 5% (the standard error).

What is the shape of the sampling distribution ? A computer simulation is helpful (see the R script 100000-repeats.R

). Let us start from the box:

5400 tickets 0 4600 tickets 1 and let a researcher (who doesn’t know the contents of the box) randomly without replacement draw 100 tickets from the box. The researcher uses the sample to compute the sample percentage (the percentage of 1s in the sample). We write down the result (say, 39%) and toss the tickets back in the box. Then we let another researcher randomly without replacement draw 100 tickets from the box and compute the sample percentage, and so on. The computer simulation repeats this chance experiment 100 000 times, so 100 000 researchers each

3.1. SAMPLING DISTRIBUTION OF A SAMPLE PERCENTAGE 41

8

6

4

2

0

0 10 20 30 40 50 60

Percentage of households in the sample

who own a tablet

70 80 90 100

Figure 3.1: Density histogram of the sample percentages in 100 000 repeats randomly without replacement draw 100 tickets from the box and compute the sample percentage. The result is a list of 100 000 sample percentages (39%,

43%, . . . ). Even for a computer this is a lot of work, so running the simulation can take a while. The program finally plots a density histogram of the sample percentages of the 100 000 researchers (figure 3.1). Given the frequency interpretation of probability and the large number of repeats, the density histogram

(figure 3.1) resembles the probability distribution of the sample percentage. The density histogram shows that most researchers found a sample percentage in the neighborhood of 46%: almost all come up with a sample percentage between

31% and 61%. The distribution is clearly bell-shaped. Why is that the case?

Remember that each researcher computes the sample percentage as: sample percentage = number in the sample

× 100% = size of sample sum of draws size of sample

× 100% that is, the sample percentage is computed from the sum of draws. From the central limit theorem we know that the sum of draws follows the normal curve, if we draw with replacement and if the number of draws is reasonably large.

The researchers drew without replacement, but when the size of the population is much larger than the size of the sample, the normal curve will still be a reasonably good approximation (in this case the size of the population is 100 0000 and the size of the sample is 100). If the sum of draws follows the normal curve, so will the sample percentage.

42 CHAPTER 3. SAMPLING DISTRIBUTIONS

The 68-95-99 .

7% rule applies (using expectation instead of average, and

SE instead of SD). Most (approximately 99 .

7%) of the sample percentages fall within three standard errors of the expectation, that is, between

46% − 3 × 5% and 46% + 3 × 5% or between

31% and 61%

Similarly, approximately 95% of the sample percentages fall within two standard errors of the expectation, that is, between

46% − 2 × 5% and 46% + 2 × 5% or between

36% and 56%

To summarize the properties of the sampling distribution of the sample percentage : sample percentage = sum of draws size of the sample

× 100%

1. The sample percentage is an unbiased estimator of the population percentage:

E (sample percentage) = population percentage

2. The standard error is:

SE(sample percentage) ≈

SD of population

√ sample size

× 100%

This approximation is good if the population is much larger than the sample.

The SD of the population (the box) can be found using the shortcut formula for 0-1 lists. (You don’t have to memorize the formula for the SE.)

3. If the sample is large, the sampling distribution of the sample percentage approximately follows the normal curve (central limit theorem).

3.2

Sampling distribution of a sample average

We can now follow the same line of reasoning for the sampling distribution of a sample average. Suppose a market research firm is interested in the annual household income of the 10 000 households in a small town. Let us model this population as a box with 10 000 tickets. On each ticket the annual income of a household is written:

23 275

54 982

32 833 . . .

The average annual income of all households in the population is

27 788; this average is a parameter : a numerical characteristic of the population.

The standard deviation of the population is

8 245 (another parameter). The market research firm doesn’t know these parameters, and would like to estimate the population average by taking a random sample of 100 households. Suppose the sample looks like this:

3.2. SAMPLING DISTRIBUTION OF A SAMPLE AVERAGE 43

26 419

47 001 . . .

14 981 (100 tickets)

The sample average is called a statistic (a numerical characteristic of a sample).

The researcher will compute the sample average by adding up the incomes in the sample and dividing by how many there are: sample average =

26 419 +

47 001 + . . .

+

14 981

100 that is: sample average = sum of draws sample size

Just like in the previous example, the sample average is a chance variable: in repeated samples, the outcome would be different. Just like in the previous example, the sample average is computed from the sum of draws, so (under certain conditions) the central limit theorem applies.

It can be shown that the sample average is an unbiased estimator of the population average. The standard error for the sample average is:

SE(sample average) =

SE(sum of draws) sample size

Using the square root law for the SE of the sum of draws, we get:

√ number of draws × (SD of the box)

SE(sample average) ≈ sample size

The number of draws is the same thing as the sample size, and the SD of the box is the same thing as the SD of the population, so we get:

√ sample size × (SD of population)

SE(sample average) ≈ sample size which can be written as:

SE(sample average) ≈

SD(population)

√ sample size

The SD of the population is given as

8 245, so the SE for the sample average is:

8 245

SE(sample average) ≈

100

=

824.5

This means that a researcher who tries to estimate the population average using a random sample of 100 households is typically going to be off by

824.5 or so;

824.5 is the typical size of the chance error that a researcher will make.

(Of course, the researcher doesn’t know the SD of the population. If the sample is large, she can estimate the SD of the population by the SD of the sample.

This technique is called the bootstrap . We’ll get back to this when we discuss inference.)

In sum, the sampling distribution of the sample average sample average = sum of draws sample size has the following properties:

44 CHAPTER 3. SAMPLING DISTRIBUTIONS

1. The sample average is an unbiased estimator of the population average:

E (sample average) = population average

2. The standard error is

SE(sample average) ≈

SD(population)

√ sample size

This approximation is good if the population is much larger than the sample. (You don’t have to memorize the formula for the SE.)

3. If the sample is large, the sampling distribution of the sample average approximately follows the normal curve (central limit theorem).

3.3

Questions for Review

1. “The sample percentage is a statistic.” Explain.

2. What does the term sampling distribution mean? Explain, using an example for the sampling distribution of a sample percentage.

3. “The sample percentage is random variable.” Explain.

4. You want to estimate the percentage of a population. Explain what the chance error is.

5. What does the term expectation of the sample percentage mean? Explain, using the concept of repeated samples.

6. What does the term standard error of the sample percentage mean? Explain, using the concept of repeated samples.

7. “The sample percentage is un unbiased estimator of the population percentage.” Explain.

8. Suppose you would be able to take many large samples (each of the same size, say, 2500) from the same population. For each sample, you compute the sample percentage. How would the density histogram of the sample percentage look like (central location, spread, shape)? Explain.

9. Assume that the sample is sufficiently large. How does the probability density of a sample percentage look like (central location, spread, shape)?

Explain.

Chapter 4

Confidence intervals

Carefully review all of chapters 21 (skip section 5) and 23 from Freedman et al.

(2007), covered in STA101. Below is a summary of the main ideas. The summary is no substitute for reviewing chapters 21 and 23.

4.1

Estimating a percentage

In a small town there are 10 000 households. A market research agency wants to find out which percentage of households own a tablet. The population percentage is an (unknown) parameter. To estimate the parameter, the market research agency interviews a random sample of 100 households. The researcher counts the number of households in the sample who say they own a tablet, and computes the sample percentage (a statistic). How reliable is the estimate?

In order to answer this question, we need an appropriate chance model. We model the population as a box of 10 000 tickets. There is a ticket for every household in the population. This is a case of classifying and counting: we classify a household as owning a tablet or not owning a tablet, and want to count the number of households who own a tablet. In cases of classifying and counting, a 0-1 box is appropriate. For households who own a tablet the value on the ticket is 1, and for households who own a tablet the value on the ticket is 1. The number of tickets of each kind is unknown:

??? tickets 0 ??? tickets 1

10 000 tickets

The sample is like 100 random draws without replacement from this box. It will look something like:

{ 0 , 0 , 1 , 0 , 1 , 0 , 0 , . . .

} (100 entries)

The number of households in the sample who own a tablet is the sum of draws.

The sum of draws is a chance variable: if the researcher had drawn a different sample of 100 households, the number of households in the sample who own a tablet would most likely have been different.

45

46 CHAPTER 4. CONFIDENCE INTERVALS

Suppose that the researcher interviewed 100 random households of whom 41 say that they own a tablet. The sample percentage is: sample percentage = number in the sample

× 100% = size of sample

41

100

× 100% = 41%

The sample percentage is called a point estimator of the population percentage, and the result (41%) is a point estimate .

The decision maker who gave the market research agency the job of estimating the percentage would like to know how reliable the point estimate of 41% is. The sample percentage is a chance variable, subject to sampling variability: a different sample would most likely have generated a different estimate.

We know from the previous chapter that—if the sample is random—the sample percentage is an unbiased estimator of the population percentage:

E (sample percentage) = population percentage

Intuitively that means that if many researchers all would take a random sample of 100 households and each would compute a sample percentage, they would get different results, but the results would vary around the population percentage.

Some researchers might come up with a sample percentage that is exactly equal to the population percentage, but about half of the researchers would come up with a sample percentage that underestimates the population percentage, and about half would come up with a sample percentage that overestimates the population percentage. The difference between the sample percentage and the population percentage is called the chance error : chance error = sample percentage − population percentage

The researcher who came up with the estimate of 41% also made a chance error.

Of course, she won’t be able to find how big the chance error exactly is, because she doesn’t know the population percentage. But she does know that she makes a chance error. In the previous chapter we saw that the typical size of the chance error is called the standard error (SE). We also saw that for a sample percentage, the standard error is:

SE(sample percentage) ≈

SD of population

√ sample size

× 100%

The bad news is that the researcher doesn’t know the SD of the population.

The good news is that statisticians have shown that—provided that the sample is large—the SD of the sample is a reasonably good estimate of the SD of the population. So for large samples, we can approximate the SE for the sample percentage by:

SE(sample percentage) ≈

SD of sample

√ sample size

× 100%

(This is an example of the bootstrap technique.) To find the SD of the sample, the researcher can use the shortcut formula (p. 10): s

SD of sample = fraction of ones

× fraction of zeroes

4.2. CONFIDENCE INTERVAL FOR A PERCENTAGE 47 r

41

=

100

≈ 0 .

49

59

×

100

The resulting estimate for the standard error for the sample percentage is:

SE(sample percentage) ≈

SD of the sample

√ sample size

× 100%

0 .

49

100

≈ 4 .

9%

× 100%

In sum: the sample estimate (41%) is off by 4 .

9% or so. It is very unlikely that the estimate is off by more than 14 .

7% (3 SEs).

4.2

Confidence interval for a percentage

From the previous chapter we know that, for large samples, the sampling distribution of the sample percentage approximately follows the normal curve (thanks to the central limit theorem ). We also know that the sample percentage is un unbiased estimator of the population percentage:

E (sample percentage) = population percentage

Hence the probability distribution of the sample percentage looks approximately like this: pop%

Sample percentage

The distribution of the sample percentage implies that for 95% of all possible samples, the sample percentage will be in the interval from population percentage − 2 · SE to population percentage + 2 · SE

(SE refers to the SE for the sample percentage).

This implies that for 95% of all possible samples the chance error (= sample percentage − population percentage) will be in the interval from

− 2 · SE to + 2 · SE

48 CHAPTER 4. CONFIDENCE INTERVALS

Put in a different way, for 95% of all possible samples the chance error (without the minus sign) will be smaller than 2 · SE. Or: for 95% of all possible samples, the interval from sample percentage − 2 · SE to sample percentage + 2 · SE will cover the population percentage. This interval is called the 95%-confidence interval for the population percentage .

There is a shorter notation for the 95%-confidence interval: sample percentage ± 2 · SE(sample percentage)

The term 2 · SE is called the margin of error .

The researcher found an estimate of 41% and a standard error of about 4 .

9%.

The sample was reasonably large (100), so it’s safe to assume that the normal curve is a good approximation of the probability distribution of the sample percentage. Hence the 95%-confidence interval for the population percentage is the interval between:

41% − 2 × 4 .

9% and 41% + 2 × 4 .

9%

41% − 9 .

8%

31 .

2% and and

41% + 9

50 .

8%

.

8%

In sum: the sample estimate (41%) is off by 4 .

9% or so. You can be about

95% confident that the the interval from 31.2% to 50.8% covers the population percentage.

To compute a confidence interval for a population percentage with the TI-

84, do STAT → TESTS → 1-PropZInt (one-proportion z interval). The z refers to the fact that we use the normal approximation (central limit theorem). The value x that you have to enter is the number of times the event occurs in the sample (41 in the example); n is the sample size (100 in the example); C-Level stands for confidence level: for a 95% confidence interval enter .

95 (the default value). The procedure gives the sample proportion (ˆ ) and the boundaries of the confidence interval as decimal fractions; to get percentages multiply by 100%.

The confidence interval provided by the TI-84 (and by statistical software) differs somewhat from what you find using the formula above. Don’t worry: our formula gives a good approximation.

4.3

Interpreting a confidence interval

Carefully read Freedman et al. (2007, section 3 pp. 383–386). It’s important that you understand the correct interpretation of a confidence interval.

Suppose you have a fish bowl with 100 000 marbles (the population): 80 000 red marbles and 20 000 blue ones. The proportion of red marbles in the population is:

80 000 population proportion = × 100% = 80%

100 000

The population proportion is a parameter. Now you conduct the following experiment. You hire a researcher and tell her that you don’t know the proportion of red marbles, and that you would like her to estimate the proportion of red

4.3. INTERPRETING A CONFIDENCE INTERVAL 49 marbles from a random sample of 2 500 marbles. The researcher takes up the job: she takes a simple random sample of 2 500 marbles, counts the number of red marbles in the sample, and computes a the sample proportion as a point estimate of the population proportion: sample percentage = number of red marbles in the sample

100 000

× 100%

Because the sample is large (and hence the central limit theorem applies), she can compute a 95%-confidence interval: sample percentage ± 2 · SE(sample percentage)

The sample percentage is a chance variable: the outcome depends on the sample she took. Had she taken another sample of size 2 500, the sample percentage would most likely have been different (the population percentage would still have been 80%, of course): the chance is in the sampling variability, not in the parameter. Suppose that she finds a confidence interval like that in case (1):

Three confidence intervals (x = the parameter)

(1)

(2)

|------x--------| x

(3) |----------| x

|----------|

(covers)

(does not cover)

(does not cover)

Confidence interval (1) covers the population percentage (the researcher will of course not know this because she doesn’t know the population percentage).

Now you hire another researcher, and you tell him the same thing: that you don’t know the proportion of red marbles, and that you would like him to estimate the proportion of red marbles from a random sample of 2 500 marbles.

Because he will draw a different sample, he will come up with a different point estimate, a different SE, and a different confidence interval. The confidence interval may cover the population percentage, but—due to sampling variability— it may not: the interval may be too far too the right (case (2)) or too far to the left (case (3)) (Freedman et al. (2007, p. 384) call confidence intervals that don’t contain the parameter “lemons”). Again, the researcher doesn’t know whether the confidence interval he computed covers the population percentage: it may

(case (1)), or it may not (cases (2) or (3)).

You can find out what happens in repeated samples if you repeat the experiment many times (say: you hire 100 researchers) and plot the resulting 95%confidence intervals. A computer simulation posted on the course web site does exactly that ( interpreting-a-confidence-interval-for-a-percentage.R

).

The script generates a diagram like figure 4.1 (compare to figure 1 in Freedman et al. (2007, p. 385)). Each horizontal line represents a 95%-confidence interval computed by a researcher. The vertical line shows the population percentage.

Run the script a number of times (and make sure you understand what the script does). Count the number of lemons in each run. Is there a pattern?

In sum: if 100 researchers would each take a simple random sample of 100 marbles, and each computes a 95%-confidence interval, we get 100 confidence intervals. The confidence intervals differ because of sampling variability. For about 95% of samples, the interval sample percentage ± 2 · SE(sample percentage) covers the population percentage, and for the other 5% is does not.

50 CHAPTER 4. CONFIDENCE INTERVALS

100

80

60

40

20

0

76 78 80

Percentage of reds

82 84

Figure 4.1: Interpreting confidence intervals. The 95%-confidence interval is shown for 100 different samples. The interval changes from sample to sample.

For about 95% of the samples, the interval covers the population percentage, marked by a vertical line.

4.4

Confidence interval for an average

Review all of chapter 23 in Freedman et al. (2007). This section just gives a brief summary of the main ideas, based on example 3 from Freedman et al.

(2007, pp. 417–419).

A marketing research agency wants to find out the average years of schooling in a small town. The population consists of all people age 25 and over in the town. The average of the population is an (unknown) parameter. The researcher interviews a random sample of 400 people of age 25 and over, and list the answers in a table. Together, the 400 interviewed people had 5 225 years of schooling.

That implies a sample average of

5 225 years

≈ 13 .

1 years

400

From the table with responses, the researcher also computes the standard deviation of the sample, which turns out to be 2 .

74 years. What is the standard error? What is the 95%-confidence interval?

Model the population as a box with many tickets; the researcher may not even know how many tickets exactly. Each ticket represents a person age 25 or over who lives in the town. On each ticket the years of schooling of that person

4.4. CONFIDENCE INTERVAL FOR AN AVERAGE 51 is written. For instance, for someone who completed high school but took no higher education, the ticket says: 12 years. The box will look like this:

12 6 10 . . .

many tickets

Of course the researcher doesn’t know the exact contents of the box. The sample is like 400 draws without replacement from the box. The researcher lists the ages of the people in the sample. The list will look something like:

{ 16 , 12 , 18 , 8 , . . .

} (400 entries)

To find the sample average, the researcher adds all numbers (suppose the sum is 5 225 years) and divides by how many entries there are in the sample (400): sample average = sum of draws sample size

=

5 225 years

400

= 13 .

1 years

Note that the sample average is computed using the sum of draws. From the list of responses, the researcher can also compute the standard deviation. As noted above, suppose the standard deviation of the sample is 2 .

74 years.

The sample average is a called a point estimator of the population average, and the result (13 .

1 years) is a point estimate .

The sample average is a chance variable: had the researcher drawn another sample of size 400, the sample average would most likely have been different.

When the researcher reports to the decision maker, the decision maker would like to know how precise the point estimate is. In the previous chapter we learned that the sample average is an unbiased estimator of the sample average:

E (sample average) = population average

That doesn’t mean that the sample average found from this sample (13 .

1 years) is equal to the population average. Why not? Because the researcher made a chance error : chance error = sample average − population average

The researcher who came up with the estimate of 13 .

1 years also made a chance error. Of course, she won’t be able to find how big the chance error exactly is, because she doesn’t know the population average. But she does know that she makes a chance error. In the previous chapter we saw that the typical size of the chance error is called the standard error (SE). We also saw that the standard error for the sample average is:

SE(sample average) ≈ sample size

The researcher doesn’t know the SD of the population, but if the sample is large the SD of the sample is a reasonably good estimate of the SD of the population:

SE(sample average) ≈

SD(sample)

√ sample size

52 CHAPTER 4. CONFIDENCE INTERVALS

(This is an example of the bootstrap technique.) The researcher can compute the SD from the sample (and found: SD(sample) = 2 .

74 years). The resulting estimate for the standard error for the sample average is:

SE(sample average) ≈

SD(sample)

√ sample size

2 .

74 years

400

≈ 0 .

137 years

From the previous chapter we know that—if the sample is reasonably large— the probability distribution of the sample average follows approximately the normal curve. We also know that the sample average is an unbiased estimator of the population average:

E (sample average) = population average

As a result, the probability distribution of the sample average looks like this: pop. ave.

Sample average

The probability distribution of the sample average implies that for 95% of all possible samples of size 400, the sample average will be between population average − 2 · SE and population average + 2 · SE

(SE refers to the SE for the sample average.) This implies that for 95% of all possible samples of size 400, the interval from sample average − 2 · SE to sample average + 2 · SE will cover the population average. This interval is the 95%-confidence interval for the population average . A shorter notation is: sample average ± 2 · SE(sample average)

The researcher in our example found a sample average of 13 .

1 years and a standard error for the sample average of about 0 .

137 years. The 95%-confidence interval for the population average is:

13 .

1 years ± 2 × 0 .

137 years

4.5. DON’T CONFUSE SE AND SD 53 or:

13 .

1 years ± 0 .

274 years

The margin of error (with a confidence level of 95%) is 0 .

274 years. So the

95%-confidence interval for the population average is the interval from:

13 .

1 years − 0 .

274 years to 13 .

1 years + 0 .

274 years or from:

12 .

83 years to 13 .

37 years

In sum: the sample estimate (13 .

1 years) is off by 0 .

137 years or so. You can be about 95% confident that the interval from 12 .

83 years to 13 .

37 years covers the population average.

To compute a confidence interval for a population average with the TI-84, do

STAT → TESTS → ZInterval. The z refers to the fact that we use the normal approximation (central limit theorem). The value σ (sigma) that you have to enter is the standard deviation of the population; as you don’t know the standard deviation of the population, enter the sample standard deviation instead (you are using the bootstrap , but remember that the bootstrap only works when the sample is large). In the example, the standard deviation of the sample was 2 .

74 years. The value x ( x -bar) that you have to enter is the sample average (13 .

1 years in the example) and n is the sample size (400 in the example).

C-Level stands for confidence level; for a 95% confidence interval enter .

95.

4.5

Don’t confuse SE and SD

The researcher computed two numbers: she found that the SD of the sample was 2 .

74 years, and that the SE for the sample average was 0 .

137 years. These two numbers tell two different stories (Freedman et al., 2007, p. 417):

– the SD says how far schooling is from the average—for typical people.

– the SE says how far the sample averages are from the population average— for typical samples.

People who confuse SE and SE often think that 95% of the people have schooling in the range 13 .

1 years ± 0 .

274 years (13 .

1 years ± 2 · SE).

That is wrong.

The interval 13 .

1 years ± 0 .

274 years covers only a very small part of the years of schooling: the SD is about 2 .

74 years. The confidence interval measures something else: if many researchers each take a sample of 400 people, and each computes a 95%-confidence interval, then about 95% of the confidence intervals will cover the population average; the other 5% of the confidence intervals won’t.

The term “confidence” reminds you of the fact that the chance is in the sample variability; the population average doesn’t change.

4.6

Questions for Review

1. Why do we need to use the bootstrap when estimating the standard error for a percentage?

54 CHAPTER 4. CONFIDENCE INTERVALS

2. What is the margin or error (at a 95% confidence level) for the population percentage? For a population mean?

3. Suppose the decision maker requires a level of confidence higher than 95%

(say, 99%). Would the margin or error be bigger, smaller, or the same as with a level of confidence of 95%? Explain.

4. Suppose the decision maker is happy with a confidence level of 95% but wants a smaller margin of error. What should you do? Explain.

5. What is the difference between the standard deviation of the sample and the standard error of the sample average? Explain.

6. A researcher computes a 95%-confidence interval for the mean. Right or wrong: 95 percent of the values in the sample fall within this interval.

Explain.

7. A researcher computes a 95%-confidence interval for the mean. Explain what the meaning of the interval is, using the concept of repeated samples.

Add a sketch.

4.7

Exercises

Work the following exercises from Freedman et al. (2007), chapter 21: A–1; A–2 and B–2(a); A–3; A–4, A–5; A–9; B–1; B–4; C–5; C–6; D–1; D–2.

Work the following exercises from Freedman et al. (2007), chapter 23: A–1;

A–2; A–3; A–4; A–5; A–6; A–7; A–8; A–9; A–10; B–1; B–2; B–3; B–4; B–5;

B–6; B–7; C–1; C–2; C–3; C–4; D–1; D–2; D–3; D–4.

Chapter 5

Hypothesis tests

Read Freedman et al. (2007, Ch. 26, 29). Leave section 6 of chapter 26 (pp.

488–495) for later.

Questions for Review

1. Freedman et al. (2007, Ch. 26) repeatedly use the term observed difference .

Explain (the difference between what and what?).

2. What is a test statistic? Explain using an example.

3. What is the observed significance level (or P -value)?

4. When the P -value is less than 1% (so the results is “highly statistically significant”), what does this imply for the null hypothesis?

Exercises

Work the following exercises from Freedman et al. (2007), chapter 26: Set A: 4,

5. Set B: 1, 2, 5. Set C: 1, 2, 4, 5. Set D: all. Set E: 1–5, 7, 10. Set F: 1–4.

Review Exercises: 1–5, 7, 8.

Work the following exercises from Freedman et al. (2007), chapter 29: Set A: 1,

2. Set C: 1, 4, 5, 7. Set D: 2, 5.

55

56 CHAPTER 5. HYPOTHESIS TESTS

Chapter 6

Hypothesis tests for small samples

Read Freedman et al. (2007, Ch. 26, section 6).

The spectrophotometer example used by Freedman et al. (2007, Ch. 26, pp.

488–489) is rather complicated. Consider instead the following story (that uses the same numbers): A large chain of gas stations sells refrigerated cans of Coke.

On average the chain sells about 70 per day per station. The manager notices that a competing chain has increased the price of a refrigerated can of Coke, and wonders whether as a result she now is on average selling more than before.

She records the number of cans sold by five randomly selected gas stations from the chain:

78 83 68 72 88

Four out of five of these numbers are higher than 70, and some of them by quite a bit. Can this explained on basis of chance variation? Or did the mean number of cans sold per gas station increase?

(You can now construct the box model, formulate the null and alternative hypothesis, and compute the test statistic. Continue reading on p. 490 and replace ppm by cans.)

Questions for Review

1. When the sample is small, there is extra uncertainty. How does the test procedure take this extra uncertainty into account (two ways)?

2. What are the properties of Student’s t -curve? Compare with the normal curve.

3. When should Student’s curve be used?

57

58 CHAPTER 6. HYPOTHESIS TESTS FOR SMALL SAMPLES

Exercises

1.

A long series of the number of refrigerated cans of Coca Cola sold by a large chain of gas stations averages to 253 cans per station per week. Following an advertising campaign by Pepsi Cola, the manager of the chain collects data from ten randomly selected gas stations. She finds that the number of cans of

Coca Cola in the sample averages 245 cans, and that the standard deviation is

9 cans. Did the mean fall, or is this chance variation? (This is a variant of Set

F: exercise 6.)

2.

(I will add more exercises later.)

Chapter 7

Hypothesis tests on two averages

Read Freedman et al. (2007, Ch. 27).

Questions for Review

1. What is the standard error for a difference? Explain using a box model.

Give an example of a case where the formula from for the SE for a difference used in the textbook does not apply.

2. What are the assumptions of the two-sample z -test for comparing two averages? Can you think of examples when you want to compare two averages but a z -test is not appropriate?

3. When should the χ 2 -test be used, as opposed to the z -test?

4. What are the six ingredients of a χ

2

-test?

Exercises

Work the following exercises from Freedman et al. (2007), chapter 28: Set A: 1

(use the TI-84), 2 (use the TI-84), 3, 4, 7, 8. Set C: 2.

59

60 CHAPTER 7. HYPOTHESIS TESTS ON TWO AVERAGES

Chapter 8

Correlation

Read Freedman et al. (2007, Ch. 8, 9). Skip section 3 (“The SD line,” pp.

130–132) from Freedman et al. (2007, Ch. 8).

Questions for Review

1. If you want to summarize a scatter diagram, which five numbers would you report?

2. What are the properties of the coefficient of correlation?

3. How do you compute a coefficient of correlation?

4. What are the units of a coefficient of correlation?

5. Which data manipulations will not affect the coefficient of correlation?

6. In which cases can a coefficient of correlation be misleading? Make sketches to illustrate your point.

7. What are ecological correlations? Why can they be misleading?

8. What is the connection between correlation and causation?

Exercises

Work the following exercises from Freedman et al. (2007), chapter 8: Set A: 1,

6. Set B: 1, 2, 9. Set D: 1.

Work the following exercises from Freedman et al. (2007), chapter 9: Set C: 1,

2. Set D: 1, 2. Set E: 2, 3, 4.

61

62 CHAPTER 8. CORRELATION

Chapter 9

Line of best fit

Read Freedman et al. (2007), chapters 10 (introduction pp. 158–161 only), 11,

12.

Note

The formula sheet has the following formula for the y -intercept of the line of best fit: y -intercept = (ave of y ) − slope × (ave of x )

This formula is obtained as follows. The equation of the line of best fit is: predicted value of y = slope × x + y -intercept

As the line of best fit passes through the point of averages (ave of x, ave of y ), we know that: ave of y = slope × (ave of x ) + y -intercept

Solving this expression for the y -intercept yields: y -intercept = (ave of y ) − slope × (ave of x )

Q.E.D.

Questions for Review

1. What does line of best fit measure?

2. On the average, what happens to y if there is an increase of one SD in x ?

3. Suppose you have a scatter plot with a line of best fit. What is the error

(or residual) of a given point of the scatter plot? Illustrate.

4. What does the standard error (or r.m.s. error) of regression measure?

5. What is the standard error (or r.m.s. error) of regression computed?

6. What properties do the residuals have?

63

64 CHAPTER 9. LINE OF BEST FIT

7. What is the difference between a homoscedastic and a heteroscedastic scatter diagram? Illustrate.

8. If you run an observational study, can the line of best fit be used to predict the results of interventions? Why (not)?

9. In what sense is the line of best fit the line that gives the best fit?

Exercises

Work the following exercises from Freedman et al. (2007), chapter 10: Set A: 1,

2, 4.

Work the following exercises from Freedman et al. (2007), chapter 11: Set A: 3,

4, 7. Set D: 2, 3.

Work the following exercises from Freedman et al. (2007), chapter 12: Set A: 1,

2.

Additional exercise

Table 9.1 shows the heights and weights of 30 students (the file is available as students.csv

on the course web site). The average height is 174.03 cm and the SD is 9.63 cm. The average weight is 65.13 kg and the SD is 13.36 kg. The coefficient of correlation between height and weight is 0.75.

(a) Make a scatter plot (12 cm tall and 10 cm wide) on squared or graphing paper. Truncate the horizontal axis at 150 cm and let it run to 200 cm (2 cm on the page is 10 cm of a student’s height). Truncate the vertical axis at 40 kg and let it run to 100 kg (2 cm on the page is 10 kg of a student’s weight). In class, plot 10 points or so (you can plot the other points later).

(b) Show the point of averages, the run, the rise, and the second point of the line of best fit. With the two points, draw the line of best fit.

(c) Compute the slope and the y -intercept of the line of best fit. Report the line of best fit (use the actual variable names and pay attention to the units of measurement).

(d) Find the predicted weight for the student with a height of 190 cm. Illustrate in the scatter plot.

(e) Find the residual for the student with a height of 190 cm. Illustrate in the scatter plot.

(f) The r.m.s. error is 8.65 kg (you can verify this using the R script referred to below). Draw the line of best fit plus two r.m.s. errors and minus two r.m.s. errors ( cf.

Freedman et al. (2007, p. 183). Add all 30 points to the scatter and count which percentage of points lies within two r.m.s. errors of the line of best fit. Does the rule of thumb give a good approximation?

65

Use R Commander to find the descriptive statistics (average and standard deviation), the coefficient of correlation, the equation of the line of best fit, and to make a scatter plot with the line of best fit. To get the averages and SDs, select in the menu: Statistics → Summaries → Numerical summaries. To get the coefficient of correlation, select : Statistics → Summaries → Correlation matrix. To get the scatter plot with the line of best fit, select: Graphs → Scatterplot. . . In the Data tab, select the x-variable and the y-variable. In the Options tab, only select the Plot option Least-squares line (unselect other items that may be selected by default). In the Identify Points options, select Do not identify. To get the coefficients of the line of best fit, select : Statistics → Fit models →

Linear regression. . .

Select the correct response variable (the dependent variable; the variable on the y-axis of the scatter plot) and explanatory variable

(the independent variable, the variable on the x-axis of the scatter plot).

Compare your outcomes with the outcomes obtained using R . (Alternatively, the R script R-script-Scatter-plot-of-heights-and-weights.R

on the course web site computes the line of best fit and makes the scatter plot.)

66 CHAPTER 9. LINE OF BEST FIT

Table 9.1: Heights and weights of 30 students

Height (cm) Weight (kg)

172

170

63

70

170

171

186

183

52

52

90

79

170

169

175

175

195

176

188

192

172

169

66

56

75

65

94

51

76

82

70

53

172

178

177

178

160

175

190

178

163

161

162

170

154

170

52

85

59

72

54

54

70

85

55

59

44

54

52

65

Chapter 10

Multiple Regression

10.1

An example

Suppose we are interested in child mortality.

1 Child mortality differs tremendously across countries: in Sierra Leone, out of every 1000 children that were born alive, 193 die before their fifth birthday; in Iceland, only 3 do (data refer to 2010). We suspect that child mortality is related to income per person and the education of young mothers. To examine whether such a relationship exists, we collected the following data for 214 countries from World Bank (2013):

– child mortality: mortality rate, under-5 (per 1 000 live births), 2010 (indicator code: SH.DYN.MORT);

– income per person: gross national income (GNI) per capita, PPP (current international

$

), 2010 (indicator code: NY.GNP.PCAP.PP.CD);

– for the education of young mothers I use as a proxy: literacy rate, youth female (% of females ages 15-24), 2010 or most recent (indicator code:

SE.ADT.1524.LT.FE.ZS).

The data set is posted on the course web site as STA201-multiple-regressionexample.csv. In R Commander, import the data (Data → Import → From text file, clipboard, or URL. . . ). Inspect the data by clicking the View Data button.

To obtain the descriptive statistics, do: Statistics → Summaries → Numerical summaries. . . . The computer output looks like this:

GNI.per.capita.PPP

Literacy.rate.youth.female

Mortality.rate.Under.5

mean sd n NA

13769.20000 15065.43148 175 39

88.70234

39.66198

16.51968 150 64

42.09289 192 22

Always report the descriptive statistics (mean, standard deviation) and units of measurement of all variables . Never include raw computer output (like the one above) in a paper. Summarize the information (including units of measurement and the data sources) in a reader-friendly table (table

10.1). Round numbers to the number of decimals that is relevant to your reader.

Carefully document the sources, either in the text of in a note to the table.

1

I borrowed the example from Gujarati (2003, pp. 213-215 and table 6.4 p. 185) and updated the data.

67

68 CHAPTER 10. MULTIPLE REGRESSION

Table 10.1: Descriptive statistics mean SD n NA

Income per capita (international $ , PPP) 13 769 15 065 175 39

Youth female literacy rate (%) 89 17 150 64

Child mortality rate (per 1000) 40 42 192 22

Note.

n is the number of countries for which the data exist. NA is the number of countries for which data are not available.

If the relationship between child mortality (as the response variable) and income per person and literacy of young mothers is linear, the regression equation looks like this: predicted child mortality = m

1

× income + m

2

× literacy + b

(or, more generally for any variables y , x

1

, and x

2

: predicted value of y = m

1 x

1

+ m

2 x

2

+ b )

Regression with more than one explanatory variable is called multiple regression ; m

1 and m

2 are slope coefficients, and b is the y -intercept. We can’t (easily) plot this equation because it is is no longer the equation of a straight line (it is a plane in three-dimensional space). It is however possible to find the values for the coefficients m

1

, m

2

, and b that minimize the r.m.s. error of regression.

The mathematics behind the formula to estimate the coefficients is beyond the scope of this course, but any statistical computer program can compute the coefficients. In R Commander do: Statistics → Fit models. . .

→ Linear regression.

This is the R output for the regression of the example:

Call: lm(formula = Mortality.rate.Under.5 ~ GNI.per.capita.PPP

+ Literacy.rate.youth.female, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-50.834 -14.020

-7.594

11.595

90.680

Coefficients:

(Intercept)

GNI.per.capita.PPP

Estimate Std. Error t value Pr(>|t|)

2.131e+02 1.224e+01 17.417

< 2e-16 ***

-7.709e-04 2.161e-04 -3.567 0.000504 ***

Literacy.rate.youth.female -1.813e+00 1.460e-01 -12.419

< 2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 25.89 on 132 degrees of freedom

(79 observations deleted due to missingness)

Multiple R-squared: 0.6592,Adjusted R-squared: 0.6541

F-statistic: 127.7 on 2 and 132 DF, p-value: < 2.2e-16

10.2. INTERPRETATION OF THE SLOPE COEFFICIENTS 69

(The regression output of other statistical software is similar.)

The first line shows the regression equation.

lm stands for linear model. The variable before the tilde ( ~ ) is the dependent variable (the variable you want to predict; y in the mathematical notation). Following the tilde are the independent variables ( x

1 and x

2 in the mathematical notation). Think of the tilde as meaning “is predicted by”: y ~ x1 + x2 means “ y is predicted by x

1 and x

2

.”

The column Estimate in the Coefficients table gives the estimated coefficients of the regression equation. We discuss the other columns of the Coefficients table in the chapter on inference for regression.

Residual standard error is the r.m.s. error of regression (calculated slightly differently than in Freedman et al. (2007, pp. 185-187), so the result may differ somewhat; if the number of cases is small, the difference may be quite large). The r.m.s. error of regression measures by how much the predicted value for y typically deviates from the actual value. Report the equation using the actual variable names, not y , x

1

, and x

2

: predicted child mortality = 213 .

1 − 0 .

00077 × income − 1 .

813 × literacy

The r.m.s.

error of regression is 25 .

89: this is by how much the predicted value for child mortality typically deviates from the actual value. The R output reports that 79 observations were deleted due to missingness, so the regression uses only 135 (= 214 − 79) countries.

10.2

Interpretation of the slope coefficients

In a controlled experiment, the researcher controls for variables other than the treatment variable. In observational studies however, usually the variables y , x

1

, and x

2 all vary at the same time. One of the strengths of multiple regression is that it allows to isolate the association between the dependent variable ( y ) and one of the independent variables, keeping the other independent variables in the equation constant.

The slope coefficient m

1 the predicted value of y changes if x

1 measures by how much increases by one unit, keeping all other independent variables in the equation constant . (In this case, there is only one other independent variable: x

2

.) Similarly, the slope coefficient m

2 measures by how much the predicted value of y changes if x

2 increases by one unit, keeping all other independent variables in the equation constant. (If you took calculus: m

1 is the partial derivative of y with regards to x

1

, and m

2 is the partial derivative of y with regards to x

2

.)

For the numerical example, the slope coefficient of income per capita shows that a one unit (in this case: a one dollar) increase in income per capita is associated with decrease in the predicted child mortality rate by 0 .

00077 units

(children per 1000 life births). Note that the slope coefficients have as units of measurement: units of the response variable per units of the independent variable. This is clear from the formula for the slope coefficient in simple regression: slope = r × SD of y

SD of x and is still true for the slope coefficients in multiple regression. In practice, the units of measurement of the coefficients in multiple regression are often omitted when the equation is reported.

70 CHAPTER 10. MULTIPLE REGRESSION

On the scale of income per capita—that typically ranges from

$

1 950 (first quartile) to

$

19 150 (third quartile)—a

$

1 increase is not very meaningful; a

$

1000 increase is more relevant. So let us reinterpret the slope coefficient of income per capita: the regression predicts that a

$

1000 increase of income per capita is associated with a decrease by 0 .

77 in the predicted number of children per 1000 who die before their fifth birthday. The slope coefficient of the literacy rate of youth females shows that with a 1 percentage point increase in the literacy rate of youth females is associated a drop of the predicted child mortality rate by 1 .

813 units, that is, a decrease by 1 .

813 in the number of children per

1000 who die before their fifth birthday.

Be careful when drawing policy conclusions from an observational study. It is tempting to infer from the regression that a policy that increases the female youth literacy rate by 10 percentage points would cause child mortality to decrease by about 18. However, “[T]he slope cannot be relied on to predict how y would respond if you intervene to change the value of x . If you run an observational study, the regression line only describes the data you see. The line cannot be relied on for predicting the results of interventions.”(Freedman et al.,

2007, p. 206).

10.3

Coefficient of determination

The following sketch shows in a scatter plot of a simple regression the actual value of y , the predicted value of y , and the average of y (many textbooks use the following notation: y for the actual value, ˆ for the predicted value, and y for the average): yactual-ypredicted-yave.pdf

Take a closer look at the following vertical differences: actual value for y − ave y predicted value for y − ave y actual value for y − predicted value for y = residual

Note that: actual value for y − ave y = (predicted value for y − ave y )+(actual value for y − predicted value for y )

10.3. COEFFICIENT OF DETERMINATION 71

Define the total sum of squares (TSS) as the sum of squared deviations between each actual value for y and the average of y :

TSS = sum of (actual values for y − ave y )

2

The total sum of squares is a measure of the variation of y around its average.

The explained sum of squares (ESS) is the sum of squared deviations between each predicted value for Y and the average of y :

ESS = sum of (predicted values for y − ave y )

2

The explained sum of squares is a measure of the variation of y around its average that is explained by the regression equation.

The residual sum of squares (RSS) is the sum of squared residuals:

RSS = sum of (actual values for y − predicted values for y )

2

When we computed the r.m.s. of regression we already encountered the residual sum of squares as the numerator of the fraction under the square root in

Freedman et al. (2007, p. 182):

RSS = (error #1)

2

+ (error #2)

2

+ . . .

+ (error # n )

2

It can be shown (proof omitted) that the total sum of squares is equal to the explained sum of squares plus the residual sum of squares:

TSS = ESS + RSS

Divide both sides by TSS:

1 =

ESS

TSS

+

RSS

TSS

The term RSS/TSS measures which proportion of the variation in y around its average is left unexplained by the regression. The term ESS/TSS shows which proportion of the variation in y around its average is explained by the regression. We call this proportion the coefficient of determination (notation:

R

2

): the coefficient of determination ( R

2

) measures the proportion of the variation in the dependent variable that is explained by the regression equation :

R

2

=

ESS

TSS

= sum of (predicted values for y − ave y ) 2 sum of (actual values of y − ave y ) 2

The coefficient of determination is a measure of the goodness-of-fit of the estimated regression equation. You don’t have to memorize the formula, but you do have to know the meaning of R 2 .

In the R computer output, the coefficient of determination is reported as

Multiple R-squared . In the example, R 2 is equal to 0 .

6592; this means that the estimated regression equation (for the 132 countries in the data set) explains about two-thirds (66%) of the variation of child mortality around its mean. That is quite a lot: the estimated regression equation fit the data quite well.

It can be shown that for simple regression (with only one independent variable), R 2 is equal to the coefficient of correlation ( r ) squared: R 2 = r 2 . For multiple regression, that is not the case (there are several coefficients of correlation, one for each independent variable: between y and x

1

, between y and x

2

).

That is why for multiple regression the coefficient of determination is written as capital R

2

, not lowercase r

2

.

72 CHAPTER 10. MULTIPLE REGRESSION

10.4

Questions for Review

1. How does multiple regression differ from simple regression?

2. What is the interpretation of the coefficient of one of the variables at the right-hand side of a multiple regression equation?

3. How is a residual in a multiple regression model computed?

4. What is the total sum of squares? What is the explained sum of squares?

What is the residual sum of squares?

5. How is the coefficient of determination of a regression model computed?

6. What is the meaning of the coefficient of determination? If the coefficient of determination of a regression model is equal to 0 .

67, what does this mean?

10.5

Exercises

1.

For 14 systems analysts, their annual salaries (in

$

), years of experience, and years of postsecondary education were recorded (Kazmier, 1995, table 15.2 p.

275). Below is the computer output for the descriptive statistics and a multiple regression of the annual salaries on the years of experience and the years of postsecondary education.

(a) Download the data ( Kazmier1995-table-15-2.csv

) from the course web site. Compute the descriptive statistics and run the regression in R with

R Commander. You should get the same output as shown below.

annual.salary

years.of.experience

years.of.postsecondary.education

mean sd n

57000.000000 3513.49048 14

5.928571

4.071429

2.94734 14

1.12416 14 lm(formula = annual.salary ~ years.of.experience

+ years.of.postsecondary.education, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-3998.6 -1379.1

-158.7

1067.9

4343.6

Coefficients:

(Intercept) years.of.experience

Estimate Std. Error t value Pr(>|t|)

45470.9

842.3

years.of.postsecondary.education

1605.3

2731.2

207.7

544.4

16.649 3.78e-09 ***

4.056

2.948

0.0019 **

0.0132 *

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

10.5. EXERCISES 73

Residual standard error: 2189 on 11 degrees of freedom

Multiple R-squared: 0.6715,Adjusted R-squared: 0.6118

F-statistic: 11.24 on 2 and 11 DF, p-value: 0.002192

(b) Report the equation like you would in a paper.

(c) Explain what the coefficients, the “residual standard error” and the value of R 2 mean. Pay attention to the units of measurement.

(d) Predict the annual salary of a systems analyst with four years of education and three years of experience.

(e) Would it be meaningful to use the regression equation to predict the annual salary of a systems analyst with four years of education and twenty years of experience? Why (not)?

74 CHAPTER 10. MULTIPLE REGRESSION

Chapter 11

Hypothesis tests for regression coefficients

Until now we have used regression as a tool of descriptive statistics: as a method to describe relationships between variables. Under certain conditions, regression can also be a tool of inferential statistics: we can test hypotheses about regression coefficients. This chapter explains when and how.

11.1

Population regression function

Consider the following (hypothetical) example (drawn from Gujarati (2003,

Chapter 2)). Suppose that during one week a population of 60 families had the weekly income and the weekly consumption expenditure shown in table 11.1.

The data set is posted on the course web site as Gujarati-2003-table-2-1.csv; the R script as two-variable-regression-analysis.R. Figure 11.1 shows the scatter plot. There are ten income groups (families with incomes of

$

80,

$

100,

$

120,. . . , and

$

260).

75

76 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION

27

28

29

30

23

24

25

26

16

17

18

19

20

21

22

12

13

14

15

10

11

8

9

Table 11.1: Income and consumption of a population of 60 families (

$

) case weekly family weekly family case weekly family weekly family income consumption income consumption

3

4

1

2

5

6

7

$

80

80

80

80

80

100

100 expenditure

$

55

60

65

70

75

65

70

31

32

33

34

35

36

37

$

180

180

180

180

180

200

200 expenditure

$

115

120

130

135

140

120

136

100

100

100

100

120

120

120

120

74

80

85

88

79

84

90

94

38

39

40

41

42

43

44

45

200

200

200

220

220

220

220

220

140

144

145

135

137

140

152

157

120

140

140

140

140

140

140

140

160

160

160

160

160

160

180

98

80

93

95

103

108

113

115

102

107

110

116

118

125

110

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

220

220

240

240

240

240

240

240

260

260

260

260

260

260

260

189

150

152

175

178

180

185

191

160

162

137

145

155

165

175

Let us first focus on families with a weekly income of

$

80. The population has five such households. Each family with a weekly income of

$

80 is represented by a ticket; on the ticket, the amount of the family’s consumption expenditures is written. This is the box for the sub-population of families with a weekly income of

$

80:

$ 55 $ 60 $ 65 $ 70 $ 75

Consider the following chance experiment: draw one ticket from the box.

The following table shows all possible values for the chance variable and the corresponding probabilities: y | x = $80

$

55

$

60

$

65

$

70

$

75 probability

1

5

1

5

1

5

1

5

1

5

11.2. THE ERROR TERM 77

This table is the probability distribution of the consumption expenditures ( y ) for families with a weekly income of

$

80.

What are the expected consumption expenditures of households with a weekly income of

$

80? The expectation of a chance variable is the weighted average of all possible values; the weights are the probabilities. This gives:

E ( y | x = $80) = $55 ×

1

5

+ $60 ×

1

5

+ $65 ×

1

5

+ $70 ×

1

5

+ $75 ×

1

5

Similarly, it can be shown that

= $65

E ( y | x = $100) = $77

E ( y | x = $120) = $89

E ( y | x = $140) = $101

E ( y | x = $160) = $113

E ( y | x = $180) = $125

E ( y | x = $200) = $137

E ( y | x = $220) = $149

E ( y | x = $240) = $161

E ( y | x = $260) = $173

(Exercise 1 asks you to verify this for E ( y | x = $180))

The expected values are shown as black dots in figure 11.1. Verify with the

TI-84 that the points ( x, E ( y | x )) are on the straight line with equation:

E ( y | x ) = 0 .

6 x + 17

This equation is called the population regression function . It is shown as a solid line in the scatter plot (figure 11.1). The relationship between E ( y | x ) and x need not be a linear one, that is, a function that yields a straight line when you plot it, but we limit out attention to those cases where the population regression function is a linear function:

E ( y | x ) = mx + b

(or—in the case of multiple regression—a linear equation of the form E ( y | x ) = m

1 x

1

+ m

2 x

2

+ b ).

11.2

The error term

Within each income class (a vertical strip in the scatter plot), we define the error as the difference between the actual value of y and the expected value of y : error = actual − expected

For the five households with a weekly income of

$

80 the errors are: error #1 = $55 − $65 = − $10 error #2 = $60 − $65 = − $5 error #3 = $65 − $65 = $0 error #4 = $70 − $65 = $5 error #5 = $75 − $65 = $10

78 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION

250

200

150

100

50

0

0 50 100 150

Weekly family income ($)

200 250

Figure 11.1: Weekly income and consumption expenditures of a population of

60 families. (The solid line is the population regression function; the black dots indicate the expected values E ( y | x ).)

Because error = y − E ( y | x ) we can write the values of y as: y = E ( y | x ) + error or y = mx + b + error

The error captures:

– things (other than x ) that are associated with y without that we know them or can measure them;

– measurement errors in y ;

– the intrinsic random nature of behavior.

11.3. SAMPLE REGRESSION FUNCTION 79

One assumption of the regression model is that the error terms within a vertical strip of the scatter plot have a probability distribution that is independent from the value of x : if we plot the error terms against x , the resulting error plot should show no pattern.

11.3

Sample regression function

Now suppose that a researcher doesn’t know the population that consists of the sixty cases in table 11.1. To estimate the (unknown) population regression function she draws a random sample (sample A) of 10 families from the population and records income and consumption expenditures for the families in the sample: case weekly family

3

17 weekly family conincome ( x ) sumption expenditures ( y )

$

80

$

65

140 80

19

20

30

37

140

140

180

200

95

103

110

136

42

44

50

56

220

220

240

260

137

152

155

175

Verify using the LinReg function of the TI-84 that the regression line for this sample is: predicted value of y = 0 .

6186 x + 8 .

2116

This equation is called the sample regression function (SRF) for sample

A. The sample regression function for sample A is plotted as a dashed line in figure 11.2.

There are many sample regression functions: a different sample of 10 families would have given a different sample regression function. For instance, if another researcher would have drawn households 9, 16, 21, 23, 25, 39, 40, 43, 47, and 57

(sample B) the sample regression function would have been: predicted value of y = 0 .

5817 x + 25 .

3188

(verify this using the LinReg function of the TI-84)

The estimated slope varies from one sample to another: sample slope of SRF

A

B

. . .

0

0

.

.

6186

5817

. . .

In repeated samples, the slope of the sample regression function is a chance variable (and so is the intercept). The slope of the sample regression function has a probability distribution (the sampling distribution of the slope of the sample regression function). The expectation of the slope of the sample regression

80 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION

250

200

150

100

50

0

0 50 100 150

Weekly family income ($)

200 250

Figure 11.2: Weekly income and consumption expenditures of a population of 60 families.

Note.

The solid line is the population regression function, the dashed line is the sample regression function for sample A.

function is the typical value around which the slope of the sample regression function varies in repeated samples (take a look at the sample estimates for the slope: do you have a hunch what the expectation is?).

It can be shown that the expectation of the slope of the sample regression function (SRF) is the slope of the population regression function (PRF):

E (slope of the SRF) = slope of the PRF

That is, the slope of the sample regression function is an unbiased estimator of the slope of the population regression function. This is only the case if the independent variable ( x ) is not a chance variable (proof omitted).

The chance error for the slope is defined as: chance error = slope of SRF − E (slope of SRF)

Now we can compute the chance errors that were made by each of the two researchers (of course, the researchers themselves can’t compute the chance

11.3. SAMPLE REGRESSION FUNCTION 81 error they made because they don’t know the population regression function).

For sample A, the chance error of the slope is:

0 .

6186 − 0 .

6000 = 0 .

0186

For sample B, the chance error of the slope is:

0 .

5817 − 0 .

6000 = − 0 .

0183

Here are the chance errors for samples A and B, and some other random samples: sample

A slope of SRF chance error (without − sign)

0 .

6186

0 .

5817

0 .

0186

− 0 .

0183

(0

(0 .

.

0186)

0183) B

C

D

E

. . .

0 .

6287

0

0

.

.

6246

5037

. . .

typical value: expectation

0

0

0

.

.

.

0287

0246

0963

. . .

(0

(0

(0

.

.

.

0287)

0246)

0963)

. . .

standard error

The standard error (SE) of the the slope of the sample regression function is the typical size of the chance error (after you omitted the minus signs, as shown in the last column). The formula of the SE for the slope of the sample regression function and uses information about the population. Unlike in the numerical example above, in practice we don’t know the population. So how can we find the SE of the slope of the sample regression function? The answer is that—just like when we estimated the SE for a sample average—we will use the bootstrap and the sample data to find an estimate for the SE of the slope of the sample regression function. The formula is complicated and I won’t report it here, but statistical software will compute an estimate of the SE based on the sample data. As in the case of the SE for an average, the SE for the slope of the sample regression function gets smaller if the sample size gets bigger: a bigger random sample tends to give a more precise estimate for the slope coefficient.

The same arguments apply to the intercept and—in multiple regression—the slope coefficients of the other independent variables.

Samples and populations

Suppose you have a data set covering all 50 states of the U.S. Some would argue that such a data set covers a population (all the states of the U.S.), not a sample.

Clearly the states were not randomly selected. Think however of the y values

(on for each state) as generated by the following random process: y = m

1 x + b + error = E ( y | x ) + error

The first part ( m

1 x + b ) is deterministic (determined by the population regression function). The second part (the error term) is random: in terms of a box model, the error term is obtained by randomly drawing a ticket from a box with tickets; each ticket contains a value for the error term. Consider the data to be the result of a natural experiment . As events unfold, “Nature” runs the experiment by drawing an error term from the box whenever a x takes a certain value. So a set of observations of x and the corresponding y can be considered as a random sample, even when the observations cover all the possible subjects (such as all

50 stated of the US): the chance is in the error terms, not in the cases.

82 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION

11.4

Example: child mortality

In the previous chapter, we used data for 135 countries to estimate a sample regression function relating child mortality ( y ) to income per capita ( x

1

) and the literacy rate of young women ( x

2

). The computer output was:

Call: lm(formula = Mortality.rate.Under.5 ~ GNI.per.capita.PPP

+ Literacy.rate.youth.female, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-50.834 -14.020

-7.594

11.595

90.680

Coefficients:

(Intercept)

GNI.per.capita.PPP

Estimate Std. Error t value Pr(>|t|)

2.131e+02

-7.709e-04

1.224e+01

2.161e-04

17.417

< 2e-16 ***

-3.567 0.000504 ***

Literacy.rate.youth.female -1.813e+00 1.460e-01 -12.419

< 2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 25.89 on 132 degrees of freedom

(79 observations deleted due to missingness)

Multiple R-squared: 0.6592,Adjusted R-squared: 0.6541

F-statistic: 127.7 on 2 and 132 DF, p-value: < 2.2e-16

The software reports the coefficient estimates in a table. The first column gives the name of the variable, the second the estimated regression coefficient for that variable, and the third column gives the standard error for the coefficient.

The standard error is estimated using the bootstrap, so the reported standard errors are only reliable if the sample is sufficiently large. From the computer output we see that the estimated intercept is 213.1, with an SE of 12.24; the estimated slope coefficient of income per capita is -0.00077, with an SE of 0.00022; and the the estimated slope coefficient of the literacy rate of young women is

-1.813, with an SE of 0.146. The convention is to report the sample regression equation with the standard errors in brackets on the next line, like this: predicted child mortality = 213 .

1 − 0 .

00077 × income − 1 .

813 × literacy

(SEs:) (12 .

2) (0 .

00022) (0 .

146)

In a paper, you would—after you reported the equation above—interpret the meaning of the coefficients (see p. 69): the slope coefficient of income per capita shows that with a

$

1000 increase of income per capita is associated a decrease by 0 .

77 in the predicted child mortality rate (the number of children per 1000 who die before their fifth birthday); and the slope coefficient of literacy shows that with a 1 percentage point increase in the literacy rate of youth females is associated a drop of the predicted child mortality rate by 1 .

813. You would also report and interpret the coefficient of determination: the regression equation

(for the 135 countries in the data set) explains about 66% of the variation of child mortality around its mean.

11.5. CONFIDENCE INTERVAL FOR A REGRESSION COEFFICIENT 83

11.5

Confidence interval for a regression coefficient

It can be shown that, if the error terms in the population regression function follow the normal curve, the sampling distribution of the coefficients of the sample regression function also follows the normal curve. Let us also assume that the error terms are homoscedastic , that is, that their spread is the same in each vertical strip. With the estimate and the standard error, we can now compute a 95%-confidence interval for a population regression coefficient using the familiar formula: coefficient of SRF ± 2 · (SE for coefficient of SRF)

A 95%-confidence interval for the population regression coefficient of income per capita is:

− 0 .

00077 ± 2 × 0 .

00022 which yields the interval from − 0 .

00121 to − 0 .

00033. So one can be 95% confident that the population regression coefficient of income per capita is between

− 0 .

00121 and − 0 .

00033. The interpretation is like before: if 100 researchers each would take a random sample and compute a 95%-confidence interval, about 95 of the confidence intervals would cover the population regression coefficient; the other five wouldn’t.

This formula works for large samples. For a small sample, you should use a number larger than 2 in the formula above, and the confidence interval will be wider.

11.6

Hypothesis test for a regression coefficient

With the estimate and the standard error, you can also perform hypothesis tests. The test statistic is: test statistic = estimator − hypothetical value

SE for estimator

Suppose you want to test the hypothesis that the population regression coefficient of income per capita is is equal to − 0 .

0008, against the two-sided alternative that the coefficient is different from − 0 .

0008. The test statistic then is:

− 0 .

00077 − ( − 0 .

0008) test statistic = ≈ 0 .

135

0 .

00022

If the errors of the population regression function follow the normal curve, the test statistic follows the normal curve. The P -value then is the area under the normal curve to the left of − 0 .

135 and to the right of +0 .

135. This area is equal to about 89% (verify using the normalcdf function of the TI-84). As the P -value is large, we do not reject the null hypothesis.

Suppose we want to test the null hypothesis that the population regression coefficient of income per capita is equal to 0, against the two-sided alternative hypothesis that the population regression coefficient differs from 0. The test statistic is:

− 0 .

00077 − 0 test statistic = ≈ − 3 .

567

0 .

00022

84 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION

If the error terms follow the normal curve, so does the test statistic. The P -value then is the area under the normal curve to the left of − 3 .

567 and to the right of +3 .

567. This area is equal to 0 .

00036 (verify using the normalcdf-function of the TI-84), or about 0 .

04%. Because the P -value is small, we reject the null hypothesis: the sample evidence supports the alternative hypothesis that the population regression coefficient of income per capita differs from 0. The coefficients is said to be statistically significant (which is short for “statistically significantly different from zero”).

Note that the value of the test statistic ( − 3 .

567) is shown in the t value column of the table in the R output. The P -value we found is (approximately) the value shown in the Pr(>|t|) column of the R output. The statistical software uses the Student curve to find the P -value; that’s why the computer output reports the test statistic as t value rather than as z value . The degrees of freedom are equal to the sample size minus the number of coefficients (including the intercept), in this case: 135 − 3 = 132 (the degrees of freedom are reported in the computer output above). The area under the Student curve with 132 degrees of freedom to the left of − 3 .

567 and to the right of +3 .

567 is 0 .

000504

(or about 0 .

05%), as is reported in the computer output. If the sample is large, there is little difference between a t test and a z test, and it is OK to use the normal curve to find the the P -value. The codes next to the Pr(>|t|) column are a quick guide to the size of the P -value. The legend below the coefficients table gives the meaning of the symbols:

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Three asterisks ( *** ) means the P -value is between 0 and 0 .

001 (0.1%); two asterisks ( ** ) means that the P -value is between 0 .

001 (0.1%) and 0 .

01 (1%); one asterisk ( * ) means that the P -value is between 0 .

01 (1%) and 0 .

05 (5%); a dot ( .

) means that the P -value is between 0 .

05 (5%) and 0 .

1 (10%); nothing means that the P -value is between 0 .

1 (10%) and 1 (100%). So if there is at least one asterisk ( * ), you can reject the null hypothesis that the coefficient is equal to zero at the 5% significance level.

Remember the following:

– the t value column gives the test statistic for the test of the hypothesis that the population regression coefficient is equal to 0 ;

– the Pr(>|t|) column gives the P -value for a two-sided test of the hypothesis that the population regression coefficient is equal to 0. If the

P -value is sufficiently small, reject the null hypothesis. One or more asterisks means that you can reject the null hypothesis at the conventional significance levels.

Note that statistically significant is not the same as substantive (review Freedman et al. (2007, pp. 552–555)): a coefficient can be statistically significantly different from zero, but at the same time be so small that it is of little substantive importance. Suppose you run a regression relating total sales to advertising spending. You find that a

$

1000 increase in advertising spending is associated with an increase in predicted total sales by

$

1, and that the coefficient of advertising spending is statistically significant. From the business

11.7. ASSUMPTIONS OF THE REGRESSION MODEL 85 context, it is clear that the effect is not substantive, even though it is statistically significant. Conversely, a coefficient can be statistically insignificant but substantive. Suppose that a rehydration set (good for a week-long treatment) costs

$

5. You find that with a drop in the price of a rehydration set by

$

1, is associated a drop in the predicted child mortality rate by 10 (per 1000 children under five years old), but that the coefficient is not statistically significant at the

5% level. Should you dismiss the relationship between the cost of a rehydration set and child mortality? Probably not, as the effect you found is substantive: in the sample, a modest drop in the price of a rehydration set is associated with a substantive drop in child mortality. Even though the coefficient was statistically insignificant, it is probably worth paying attention to the price of a rehydration sets.

To avoid confusion, use the term “statistically significant” (rather than “significant”) when you talk about statistical significance; use the term “substantive” when you talk about the size of the coefficient.

Statistics can tell you whether a coefficient is statistically significant or not, but not whether the size of a coefficient is substantive; to know whether a coefficient is substantive, you should use your judgement in the context of the problem.

11.7

Assumptions of the regression model

In the last two chapters we made a number of assumptions that were needed to make regression work. It is useful to summarize the assumptions (Kennedy,

2003, pp. 41–42):

1. the dependent variable is a linear function of the independent variable(s), plus an error term;

2. the expectation of the error term is zero (if that is not the case, the estimate of the intercept is biased);

3. the observations on the independent variable are not random: they can be considered fixed in repeated samples (if that is not the case, the coefficient estimates are biased);

4. the error terms have the same standard error (are homoscedastic ) and are not correlated with each other (if that is not the case, the estimates for the

SEs may be far off and hence inference is no longer valid; the computed coefficient of determination may also be misleading. But the estimators are still unbiased);

5. the distribution of the error terms follows the normal curve. This assumption is needed to do inference (make confidence intervals, do hypothesis tests); but even if the error terms don’t follow the normal curve, the estimators are still unbiased.

A final warning concerns time series data . Time series data are values measured at recurring points in time. For instance, annual data from the national income accounts on GDP and its components (consumption, investment, government purchases, and net exports) are time series. Time series data usually

86 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION are notated with a time index ( y t

, x t

). A time series of n observations of the variable y t is a list that looks like this:

{ y

1

, y

2

, y

3

, . . . , y t

, . . . y n

} where y

1 is the value of the year 2000), y

2

2001), and so on.

y (say, consumption) observed in the first period (say, is the value of y observed in the second period (the year

It turns out that many time series have a statistical property called nonstationarity . Amongst other things, the presence of a time trend in the data (a tendency for the values to go up or down over time) will make a series non-stationary. To spot a possible time trend, it is a good idea to plot a time series diagram of each time series. A time series diagram is a line diagram with time ( t ) on the horizontal axis and the time series ( y t on the vertical axis

(include figure as example).

If the data are non-stationary, the results of regression (and of inference based on regression) may be wrong.

Be cautious if your data are time series. Two easy fixes may work: include the time variable t as one of the independent variables in the multiple regression, or use the change in y and the change in x in the regression. Still, if you suspect non-stationarity, consult someone who knows how to deal with it.

11.8

Questions for Review

1. What is a population regression function?

2. How is the error of regression defined? What does it capture?

3. What is a sample regression function?

4. Why does an estimated sample regression function differ from the population regression function?

5. What does it mean that the slope of the sample regression function is an unbiased estimator of the slope of the population regression function?

6. How is the chance error of the slope defined?

7. What does the standard error (SE) of the the slope of the sample regression function measure?

8. Given that in practice we don’t know the population, how can we estimate the standard error (SE) of the the slope of the sample regression function?

9. What happens to the standard error (SE) of the slope of the sample regression function if (other things equal) the sample gets bigger?

10. How do you compute a 95% confidence interval for the slope of the population regression function? Under which conditions can you apply the formula?

11. How do you interpret a 95% confidence interval for the slope of the population regression function? Give the exact probability interpretation, using the concept of repeated samples.

11.9. EXERCISES 87

12. How do you compute the test statistic for a test on a coefficient from a regression?

13. Suppose that you want to test the null hypothesis that a coefficient of the population regression is equal to zero. How do you interpret the P -value for the test?

14. What does the column Estimate in computer regression output report?

15. What does the column Std.

Error in computer regression output report?

16. What does the column t value in computer regression output report?

17. What does the column Pr(>|t|) in computer regression output report?

18. What is the meaning of the Residual standard error in computer regression output?

19. What is the meaning of the R-squared in computer regression output?

20. What are the assumptions underlying the multiple regression model used in this chapter?

21. What are time series data? Illustrate using an example.

22. Why should you be careful when the multiple regression model for time series data?

11.9

Exercises

1.

For the example in section 11.1, verify that E ( y | x = $180) = $125. Show your work.

2.

Find a 95%-confidence interval for the population regression coefficient of the literacy rate in the child mortality regression. Give the probability interpretation of a 95%-confidence interval. Which assumptions did you have to make?

3.

For 14 systems analysts, their annual salaries (in

$

), years of experience, and years of postsecondary education were recorded (Kazmier, 1995, table 15.2

p. 275) (same regression as in the exercise of the previous chapter). Below is the computer output for the multiple regression of the annual salaries on the years of experience and the years of postsecondary education:

88 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION lm(formula = annual.salary ~ years.of.experience

+ years.of.postsecondary.education, data = Dataset)

Residuals:

Min 1Q Median 3Q Max

-3998.6 -1379.1

-158.7

1067.9

4343.6

Coefficients:

(Intercept) years.of.experience

Estimate Std. Error t value Pr(>|t|)

45470.9

842.3

years.of.postsecondary.education

1605.3

2731.2

207.7

544.4

16.649 3.78e-09 ***

4.056

2.948

0.0019 **

0.0132 *

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 2189 on 11 degrees of freedom

Multiple R-squared: 0.6715,Adjusted R-squared: 0.6118

F-statistic: 11.24 on 2 and 11 DF, p-value: 0.002192

(a) Download the data ( Kazmier-1995-table-15-2.csv

) from the course web site and run the regression in R with R Commander. You should get the same output as shown above.

(b) Report the estimated regression equation (with the SEs), like you would in a paper.

(c) Explain the meaning of the SE for the coefficient of years of experience.

(d) Find a 95% confidence interval for each of the three population regression coefficients. Make explicit which assumptions you made. (Ignore the fact that the sample is small.)

(e) Explain the exact probability meaning of the 95% confidence interval for the coefficient of years of experience.

(f) Test the null hypothesis that the population regression coefficient of years of experience is equal to

$

1000/year. Make explicit which assumptions you made. (Ignore the fact that the sample is small.)

(g) What do the asterisks ( * ) in the right column of the coefficients table mean? Test the null hypothesis that the intercept of the population regression function is equal to 0. Test the null hypothesis that the population regression coefficient of years of experience is equal to 0. Test the null hypothesis that the population regression coefficient of years of postsecondary education is equal to 0. Make explicit which assumptions you made.

11.9. EXERCISES 89

4.

A researcher collected the prices (in

$

) of 30 randomly selected single-family houses, together with the living area (in square feet) and the lot size (in square feet) of each house (Kazmier, 1995, table 15.3 p. 290). Here’s the computer output for the descriptive statistics: mean sd n

Living.area.sq.ft

1920.00

508.8188

30

Lot.size.sq.ft

Price.USD

15266.67

3204.8813

134233.33 33217.4817

30

30

This is the computer output for the multiple regression of the price on living area and lot size: lm(formula = Price.USD ~ Living.area.sq.ft + Lot.size.sq.ft, data = House.prices)

Residuals:

Min 1Q Median 3Q Max

-16021.8

-4935.4

-616.8

3352.0

31599.6

Coefficients:

(Intercept)

Living.area.sq.ft

Lot.size.sq.ft

---

Estimate Std. Error t value Pr(>|t|)

22168.582

77.070

-2.352

9556.970

11.972

1.901

2.320

-1.237

0.0282 *

0.2266

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9087 on 27 degrees of freedom

Multiple R-squared: 0.9303,Adjusted R-squared: 0.9252

F-statistic: 180.3 on 2 and 27 DF, p-value: 2.407e-16

(a) Report the estimated regression equation (with the SEs), like you would in a paper.

(b) Interpret the estimated intercept. Should we give much weight to this interpretation? Why (not)?

(c) What are the units of measurement of the slope coefficients? Interpret the estimated slope coefficients.

(d) Explain the meaning of the SE for the coefficient of living area.

(e) Find a 95% confidence interval for each of the three population regression coefficients. Make explicit which assumptions you made.

(f) Explain the exact probability meaning of the 95% confidence interval for the coefficient of living area.

(g) Test the null hypothesis that the population regression coefficient of living area is equal to zero. Use the normal curve to find the p -value. Make explicit which assumptions you made. Complete the columns t value and Pr(>|t|) (which I omitted for the coefficient of Living.area.sq.ft

).

How many asterisks should be in the last column? Explain.

90 CHAPTER 11. HYPOTHESIS TESTS FOR REGRESSION

(h) Interpret the r.m.s. error of regression ( Residual standard error in the computer output).

(i) Interpret the R

2

( R-squared in the computer output).

(j) Download the data ( Kazmier-1995-table-15-3.csv

) from the course web site and run the regression with R Commander. You should get the same output as shown above.

Chapter 12

The Chi-Square test

Read Freedman et al. (2007, Ch. 28). Skip the explanation of how to use χ 2 tables (starting on p. 527 with “In principle, there is one table . . . ” and ending with the sketch on the top of p. 528); statisticians use a statistical calculator or statistical software to find areas under the χ 2 -curve. Also skip section 3 (“How

Fisher used the χ 2 -test”), pp. 533–535.

Questions for Review

1. When should the χ 2 -test be used, as opposed to the z -test?

2. What are the six ingredients of a χ 2 -test?

12.1

Exercises

Work the following exercises from Freedman et al. (2007), chapter 28: Set A: 1,

2, 3, 4, 7, 8. Set C: 2. Review exercises: 7.

91

92 CHAPTER 12. THE CHI-SQUARE TEST

Bibliography

Freedman, D., Pisani, R., and Purves, R. (2007).

Statistics . Norton, New York and London, 4 th edition.

Garcia, J. and Quintana-Domeque, C. (2007). The evolution of adult height in

Europe: A brief note.

Economics & Human Biology , 5(2):340–349.

Gujarati, D. N. (2003).

Basic Econometrics . McGraw-Hill, Boston, 4th edition.

Heston, A., Summers, R., and Aten, B. (2012).

Penn World Table Version 7.1

.

Center for International Comparisons of Production, Income and Prices at the University of Pennsylvania, Philadelphia.

Kazmier, L. J. (1995).

Schaum’s Outline of Theory and Problems of Business

Statistics . Schaum’s Outline Series. McGraw-Hill, New York.

Kennedy, P. (2003).

A Guide to Econometrics . Blackwell, Malden, MA, 6 th edition.

Moore, D. S., McCabe, G. P., and Craig, B. A. (2012).

Introduction to the

Practice of Statistics . Freeman, New York, 7 th edition.

end poverty in 15 years – BBC Two. (video).

World Bank (2013).

World Bank Open Data . Consulted on 21 November 2013 on http://data.worldbank.org.

93

Download
Study collections