Distributions

advertisement

Distributions

• When comparing two groups of people or things, we can almost never rely on a single comparison

• Example: Are men taller than women?

Distributions

• We almost always measure several or many representative people or things

Distributions

• We almost always measure several or many representative people or things

• We also almost never measure every person or thing

Distributions

• We almost always measure several or many representative people or things

• We also almost never measure every person or thing

• Instead, we measure some of them

Distributions

• We almost always measure several or many representative people or things

• We also almost never measure every person or thing

• Instead, we measure some of them

• The “some of them” that you measure is called a sample because we have “sampled” the entire population

Distributions

• The population is every possible person or thing that could have been part of the sample (e.g. all of the men in the world, all of the women, etc.)

Distributions

• The population is every possible person or thing that could have been part of the sample (e.g. all of the men in the world, all of the women, etc.)

• We can tell a lot about a population by looking at a sample (e.g. you don’t need to eat a whole container of ice cream to know if you like it!)

Distributions

• When you measure several different things you get (no surprise!) different numbers

Distributions

• When you measure several different things you get (no surprise!) different numbers

• We say that those numbers are distributed

Distributions

• A distribution is a set of numbers.

– Examples: the heights of the men in the room, the heights of the women in the room, the ages in the room, the scores on the mid-term, etc.

Distributions

• Looking at distributions:

– We often conceptualize distributions by graphing them with a probability density function

Age Distribution

60

50

40

30

20

10

0

18 19 20 21 22 23 24

Ages

25 26 27 28 29 30

Distributions

• Looking at distributions:

– Here’s an example of a “normal” distribution

Age Distribution

60

50

40

30

20

10

0

18 19 20 21 22 23 24

Ages

25 26 27 28 29 30

Distributions

• Looking at distributions:

– Here’s an example of a “rectangular” distribution

40

30

20

10

60

50

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Birthdays

Distributions

• key insight: The measurements in a sample are distributed because the population is distributed

Distributions

• key insight: The measurements in a sample are distributed because the population is distributed

• Ponder this: the more people or things in your sample, the more your sample is like the entire population

– It’s like “sampling” ice cream with a really big spoon

Describing Distributions

• It’s no good to just have a pile of numbers, we need a way of summarizing the characteristics of the distribution.

What are some ways to describe a distribution?

Describing Distributions

• All distributions have a sum

– We could just add up the samples and talk about, for example, the total height of the men and the total height of the women in the room.

– What’s the problem with this approach?

Describing Distributions

• All distributions have a mean (a.k.a average)

– The mean is the normalized sum this means that it is adjusted for the number in the sample

Describing Distributions

• All distributions have a mean (a.k.a average)

– The mean is the normalized sum this means that it is adjusted for the number in the sample

– How do we do that?

Describing Distributions

• All distributions have a mean (a.k.a average)

– The mean is the normalized sum this means that it is adjusted for the number in the sample

– How do we do that?

– Divide the sum by the number in the sample

“The” Mean

“The” Mean x

1 is measurement number 1 x n is the last measurement in the distribution (of n measurements) x i is any one of the measurements (you can fill in the i with any number between 1 and n)

 means “add these up”

“x bar” (the mean)

“The” Mean

Sum of the sample

Number of measurements

Properties of the Mean

• Every value is some distance from the mean

- this distance is called a “deviation score” deviation score = x i

_

- x

Properties of the Mean

• The mean is the point from which the sum of deviation scores is zero

Properties of the Mean

• The mean is the point from which the sum of deviation scores is zero

• This means that the mean is like a balancing point: all the scores below the mean are balanced by the scores above the mean

Properties of the Mean

• The sum of the squared deviations from the mean is smaller than from any other number

Y is any other number

Properties of the Mean

• The sum of the squared deviations from the mean is smaller than from any other number

Properties of the Mean

• The mean is the number that, when added to itself n times, gives you the sum of the numbers in the sample

=

“Other” Means

• Sometimes just adding the items in the sample and dividing by n gives you a number that doesn’t really describe the n numbers

“Other” Means

• Sometimes just adding the numbers in the sample and dividing by n gives you a number that doesn’t really describe the n numbers

– for example: a sine wave

+1  x i

= 0 !

-1

“Other” Means

• Root-Mean-Square (RMS): first square the scores before you sum them, then take the square root to undo the squaring.

+1

-1

Other Descriptions of a

Distribution: the Median

• The mean is sensitive to outliers

– eg. 1, 2, 3, 100, 4

– mean = 110/5 = 22 … not particularly representative of the numbers in the sample

Other Descriptions of a

Distribution: the Median

• Another descriptive statistic, the median , is less sensitive to outliers

– the median is the ordinal middle of the sample: half of the measurements lie below the median and half of the measurements lie above it.

Other Descriptions of a

Distribution: the Median

• Another descriptive statistic, the median , is less sensitive to outliers

– the median is the ordinal middle of the sample: half of the measurements lie below the median and half of the measurements lie above it.

– in other words it is the 50th percentile

Other Descriptions of a

Distribution: the Median

• for example:

– 1, 2, 3, 100, 4 put into rank order is…

– 1, 2, 3, 4, 100

– so the middle number (obviously) is 3

(remember that the mean was 22!)

Other Descriptions of a

Distribution: the Median

• if n is even take the average of the two middle numbers:

– 1, 2, 3, 100, 4, 5 put into rank order is…

– 1, 2, 3, 4, 5 100

– so the middle number is the average of 3 and 4

= 3.5

Other Descriptions of a

Distribution: the Median

• the median is not sensitive to outliers

– notice the median of 1, 2, 3, 4, 5 = the median of 1, 2, 3, 4, 100 = 3

Measures of Variability

What’s not so good about using the mean to describe a distribution?

Measures of Variability

Example: similar mean temperature in Vancouver and Lethbridge on Sept. 11 2006

Time

5:00

6:00

7:00

8:00

9:00

10:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

21:00 mean =

Lethbridge

Temperature

5.5

5.1

9.6

14.6

18

21

23.7

25.1

26.6

27.7

28.2

29.1

28.6

26.7

19.2

17

19.9

20.3

Vancouver

Temperature

11.9

12.3

14.3

16.6

18.3

17.7

17.7

19.3

20.5

20.2

20.2

19.8

19.1

17.7

17.1

17.2

16.3

17.4

Measures of Variability

Example: BUT the distribution of temperatures is quite different for the two cities

Time

5:00

6:00

7:00

8:00

9:00

10:00

11:00

12:00

13:00

14:00

15:00

16:00

17:00

18:00

19:00

20:00

21:00 mean = range = standard Deviation=

Lethbridge

Temperature

5.5

5.1

9.6

14.6

18

21

23.7

25.1

26.6

27.7

28.2

29.1

28.6

26.7

19.2

17

19.9

20.3

24.0

7.6

Vancouver

Temperature

11.9

12.3

14.3

16.6

18.3

17.7

17.7

19.3

20.5

20.2

20.2

19.8

19.1

17.7

17.1

17.2

16.3

17.4

8.3

2.5

Measures of Variability

• The range is the highest number minus the lowest number

• e.g. X = {1, 3, 23, 45, 62}

• the range is 62 - 1 = 61

Measures of Variability

• The range is the highest number minus the lowest number

• Notice that the range doesn’t tell you much about the distribution of numbers.

– it doesn’t tell you where the distribution is located (the mean)

– it doesn’t tell you how the numbers relate to each other: e.g. 1, 48,49,50,51, 52, 100 has a range of 99!

Measures of Variability

• What’s needed is a measure of the

“distance” between the numbers in the distribution - how spread apart are they from each other

Measures of Variability

Question: How tightly or loosely spaced are the cities?

D 2

• One approach would be to calculate the distances between each pair of cities

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

= 0

D 2

• One approach would be to calculate the distances between each pair of cities

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

= 150

D 2

• One approach would be to calculate the distances between each pair of cities

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

= 343

D 2

• One approach would be to calculate the distances between each pair of cities

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

= -150

D 2

• One approach would be to calculate the distances between each pair of cities

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

Vancouver

Hope = 0

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

D 2

• One approach would be to calculate the distances between each pair of cities

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

Vancouver

Hope

Cache Creek

Kamloops

Salmon Arm

Revelstoke

Lake Louise

Banff

Calgary

Medicine Hat

Swift Current

= 193

D 2

• notice that there are n * n = n 2 pairs

D 2

• If you sum up all the differences between numbers you get…

D 2

• If you sum up all the differences between numbers you get…

Z E R O

D 2

• If you sum up all the differences between numbers you get…

D 2

• What does a statistician do when things sum to zero?

D 2

• What does a statistician do when things sum to zero?

• Square everything first, then sum them, then square root

D 2

• D 2 is the sum of the squared differences

• D is the square root of D 2

D 2

• What is the problem with using D or D 2 ?

D 2

• What is the problem with using D or D 2 ?

• if n is “pretty big” n 2 will be huge!

S 2 : a better choice

• Select a representative “anchor point” and just measure distance from that point

S 2 : a better choice

• Select a representative “anchor point” and just measure distance from that point

• For e.g. measure distances relative to

Calgary

S 2 : a better choice

S 2 : a better choice

• Notice there are some negative distances

• We don’t care about the sign of the distances, we just care about the distances themselves

S 2 : a better choice

• S 2 (called the variance) is like D 2 except it uses a single “anchor point” (like measuring distances from Calgary)

S 2 : a better choice

• S 2 (called the variance) is like D 2 except it uses a single “anchor point” (like measuring distances from Calgary)

• That anchor point is the mean

S 2 : a better choice

S: the standard deviation

• The standard deviation of a distribution of values is the square root of the variance

S: the standard deviation

• That can be rewritten this way for using a calculator:

Next Time

• Transforming Scores (chapter 4)

• We begin significance testing (chs. 11, 12,

13, 14)

Download