Chapter 2

advertisement

MAT 107 Chapter 2 Descriptive Statistics:

( Covered in a different order and with different emphasis and technique than your text!)

Graphical Displays

For summarizing categorical data the primary displays are Pie Charts and Bar Graphs

Pie Charts: circles where each “slice” represents a category and the size of each slice corresponds to the proportion or percentage of observations in that category.

Bar Charts display a vertical bar for each category. The height of the bar is the percentage or proportion of observations in that category.

The proportion of observations in a class or category is the frequency of observations that fall in the class divided by the total number of observations. The percentage is the proportion times

100. Proportions and percentages are both known as relative frequencies. If let n be the sample size and f be the frequency of a class, then RF = f/n.

A frequency table is a listing of all the classes and their corresponding frequencies. It is necessary to create a frequency table before making a pie chart or bar graph or histogram (later).

Examples:

Nominal: Cell phone carrier.

Ordinal: Year at college

Graphs for quantitative variables

Dot plots: a dot for each observation is place above the appropriate number in a number line.

Stem and Leaf Plots: each observation is represented as a stem and leaf. The stem usually consists of all the digits of the number except for the last one, which is the leaf.

Sometimes stem can be split.

Dot plots and stem and leaf plots are only reasonable for small data sets.

For larger Data sets a Histogram is used. (this differs from what your book states)

A Histogram is a graph that uses bars to portray the frequencies or relative frequencies of the possible outcomes for a quantitative variable. If you are asked to make a histogram, always make a frequency distribution first!

Steps for constructing a Frequency Distribution and Histogram

1. Choose a range of values that captures all of the data. Then divide the range of the data into classes, non-overlapping intervals of equal length. The endpoints are the boundaries. For a discrete set of data with a small number of values, use the actual values as the classes.

2. The text and calculator use left-endpoint convention, meaning that an observation equal to the lower or left-endpoint is included in that class and the upper or right-endpoint is not included. Interval of the form [a, b), like [1, 2), [2, 3),…, What is important is that every observation is in one and only one class.

3. Guideline: there should be 5 – 20 intervals and you should use “nice” numbers like [10, 20) or

[0, .5), [.5, 1) not numbers like [10.387, 12.451), …

4. Count the number of observations in each class, forming a frequency table.

5. Compute the relative frequencies, by dividing the frequency in each class by the total sample size.

6. Find the cumulative relative frequency of each class: the sum of all the relative frequencies up to and including that class.

7. On the horizontal axis, label the values or the endpoints of the intervals. Draw a bar over each class or value with height equal to its frequency (or percentage). The vertical axis should be scaled and labeled with either the raw or relative frequencies. Both the horizontal and vertical axes should be scaled so that all the classes and frequencies fit and are disguisable.

Histogram example

Heights of students (inches) from a previous class

Heights Heights

57 74

59 48

66

60

70

47

61

55

57

70

67

62

62

58

55

62

71 68

We have 20 data points and want to break them up into 4 or 5 classes.

The range of the data = max – min = 74 – 47 = 27. Note that there are 28 integers between 47 and

74 if you include the endpoints.

So, 4 classes that are 7 units long will work fine.

The classes would then be [47, 54), [54, 61), [61, 68), [68, 75).

Since we have integer data the following classes are equivalent,

[47, 53], [54, 60], [61, 67], [68, 74]. Note that all the data points are included and no point is in more than one class.

The frequency distribution would then be:

Class

[47, 53]

Freq Rel. Freq

2 .10

Cum. Rel. Freq

.10

[54, 60]

[61, 67]

[68, 74]

7

6

5

.35

.30

.25

.45

.75

1.00

Note that the total is 20. Now just make a bar graph of the frequency distribution.

Histogram of Heights

4

3

2

6

5

1

0

8

7

Freq

[47, 53] [54, 60] [61, 67] [68, 74]

Height in inches

Note that the axes are labeled and the histogram is titled and there are no gaps in the bars.

Note that Excel uses right-endpoint inclusion.

Some books use histograms with unequal class widths. This is almost always a bad idea!

For Quantitative data there are 3 kinds of plots:

Dot plots

Stem and leaf plots

Histograms

Dot plots and Stem and leaf plots are used for small data sets (under 50 observations).

Histograms are more flexible, because of classes

Histograms and dot plots and stem-and-leaf-plots allow us to see the shape of the distribution.

1.

Outlier detection: rare or unusual observations

2.

The mode or most common observation class. unimodal vs. bimodal and multimodal

3.

Symmetry of the dataset.

a.

Symmetric: when you divide the histogram down the middle, the left side of is a mirror image of the right side. b.

Skewed left: if the left tail of the histogram is longer than the right tail. The c.

small observations are more extreme than the large observations.

Skewed right: if the right tail of the histogram is longer than the left tail. The large observations are more extreme than the small observations.

Ex 2.7 page 56. Given Stem and leaf plot of heights of male semi-professional soccer players.

60; 60.5; 61; 61; 61.5;

63.5; 63.5; 63.5;

64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5;

66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67;

67; 67; 67; 67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5;

68; 68; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5;

70; 70; 70; 70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71;

72; 72; 72; 72.5; 72.5; 73; 73.5;

74;

This stem and leaf plot is terrible. It is not very clear what is going on here.

See Example2-7.xlsx

In class ex:

Misleading your audience with statistics

Guidelines for Constructing Effective Graphs

1. Label both axes and provide title.

2. Compare relative sizes accurately, scale correctly! Y axis should start at 0

3. Use standard shapes and symbols.

4. Displaying more than one group on a single graph can be difficult.

Don’ts

1. Do not use scale breaks in any of your axes!

2. When making a histogram, uses classes and bars of the same width.

3. Do not make inferences about the population from one simple statistic like the mean, especially when you have a small sample size.

Numerical Summary Measures

Mean: average

Sample (arithmetic) mean: the average of the sample.

This is a capital sigma: ∑

It means to take the sum.

Symbolically we write the formula for the sample mean as:

 x

 n x

(read x-bar)

The population mean is denoted by µ, which will talk more about later.

Median: the middle

Sample median: the middle of the sample

Symbolically we will write the sample median as x~. (read x-squiggle)

To find the sample median:

1.

2.

Sort the n observations (in ascending order)

If n is odd, let k = (n + 1) / 2. Then x~ = kth observation

3. If n is even, let k = n / 2 and j = (n + 2)/2.

Then x~ is the average of the k th and the j th observations.

Outlier: an observation that falls outside pattern of data.

Ex.

The following is a sample of 10 scores from a test given last semester.

75 84 86 68 93 97 32 90 80 70

Find the mean. Find the median. Make a dot plot.

Are there any outliers? If so identify them.

∑x = 775 and n = 10, so the mean = 775 / 10 = 77.5

Mean = 77.5

Sort the data.

32 68 70 75 80 84 86 90 93 97 n = 10, 10 / 2 = 5 and 12/2 = 6, so the median is the average of the 5 th and 6 th observations =

(80 + 84)/ 2 = 82.

Median = 82

Dot plot:

● ● ● ● ● ● ● ● ● ●

30 40 50 60 70 80 90

32 seems to be an outlier.

How can outliers affect the mean and the median?

Assume that the person who got the 32 drops the class, because the student got really sick. The remaining (sorted) data looks like:

68 70 75 80 84 86 90 93 97

So now are n = 9 and ∑x = 743, so

Mean = 743 / 9 = 82.5

Median = 5 th observation = 84.

The mean increased 5 points, but the median only increased 2 points.

The mean is a weighted measure, whereas the median is a resistant measure.

Resistant measures if extreme observations have little if any effect.

To calculate the mean and the median as well as some other important statistics on the TI83/

TI84.

1. Enter the data into a list.

Hit [STAT]

Choose 1. Edit

In L1, enter the years. Ex. 75 [ENTER] 84 [ENTER] … 70 [ENTER]

2. Hit [STAT]. Hit the right arrow to highlight CALC. Choose

1:1-Var Stats hit [ENTER]

The screen should read: 1-Var Stats (then hit L1 [2nd] [1]), so that the screen reads: 1-Var Stats L1.

Hit [ENTER]

The output should look like:

1-Var Stats

 x

77 .

5

∑ x = 775

∑ x 2 = 63183

Sx = 18.62047857

σx = 17.66493702 n = 10 (to see more hit the down arrow) minX = 32

Q1 = 70

Med = 82

Q3 = 90 maxX = 97

if mean = median then the data are symmetric if mean > median then data are skewed right. if mean < median then data are skewed left.

The above example is symmetric since the mean and the median are so close.

The mode of a data set of n observations is the value that occurs most often. If each value occurs the same number of times then the data set has no mode. If there is a tie between 2 values then we say the data set is bimodal. If there is a tie between 3 values then the data set is said to be trimodal.

The above data set does not have a mode. There are 6 values that occur twice and 3 values that occur once. This is why the mode is rarely used.

The sample proportion of successes is denoted 𝑝̂ and read p-hat. It is also called the relative frequency of successes. Note that successes are non-conotational. 𝑝̂ =

#𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑒𝑠 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒

Ex. In a random sample of 50 students, 37 had failed a class. If we are counting number of students who failed a class, then 𝑝̂ = 37/50 = .74

Measures of Variability

First we measured the center of the data, the mean and the median.

We also looked at the shape of the data, unimodal or bimodal, symmetric or skewed.

No we look at how spread out the data is.

The first measure is simple but does not tell us much about the spread.

The range is the difference between the largest and smallest observations.

Range = max - min

A better measure would summarize the deviations from the center of the data.

A deviation of an observation x from the mean xbar is (x - xbar), the difference.

A deviation is positive if x is bigger than xbar.

A deviation is negative if x is smaller than xbar.

Unfortunately if we sum all the deviations of any data set, we get 0, because of how xbar is defined.

So we before we sum up the deviations, we square them, which makes them all positive.

The average of these squared deviations is called the variance and is denoted by s 2 .

The square root of s 2 is s which is called the standard deviation.

The bigger the standard deviation, the more spread out the data is. 𝑠 = √

∑(𝑥 − 𝑥̅) 𝑛 − 1

2

Ex.

A random sample of 10 grades is given below. Calculate the mean, and standard deviation of the sample.

Grades

(x) (x - xbar) (x - xbar)^2 xbar = s^2 = s =

76

82

68

63

95

87

45

76

92

88

77.2

233.067

15.267

17.8

9.8

-32.2

-1.2

-1.2

4.8

-9.2

-14.2

14.8

10.8

0.000

316.84

96.04

1036.84

1.44

1.44

23.04

84.64

201.64

219.04

116.64

2097.600

233.067

15.267

We will not use the formula much because your calculator will do it for you.

Remember under 1-VAR_STATS there was Sx, which is the standard deviation. Technically this is the sample standard deviation which is what we want.

Interpreting the standard deviation.

The more spread out the data are the greater s is.

In general, interpretation can be difficult mathematically, so we will deal with a special case which is easier.

Also, s = 0 means that there is no deviation, which only happens when all the observations are the same.

For example, if your data set was: 20, 20, 20, 20, 20, 20.

S = 0.

The special case: when the data is bell or mound shaped meaning that the data is unimodal and symmetric around the mean (median or mode), we can use The Empirical Rule:

Approximately 68% of the observations fall within 1 standard deviation of the mean.

Approximately 95% of the observations fall within 2 standard deviations of the mean.

Approximately 100% of the observations fall within 3 standard deviations of the mean.

3 intervals

(xbar - s, xbar + s)

(xbar - 2s, xbar + 2s)

(xbar - 3s, xbar + 3s)

Ex.

A random sample of the weights (in ounces) of full term babies born at a large local hospital is collected. The population is assumed to bell shaped. The sample mean is 124 oz with a standard deviation of 17 oz. Use the empirical rule to find the intervals where about:

68% of full term baby weights should fall?

95% of full term baby weights should fall?

100% of full term baby weights should fall?

Xbar = 124, s = 17

68 % (124 – 17, 124 + 17) = (107, 141)

95 % (124 – 34, 124 + 34) = (90, 158)

100 % (124 – 51, 124 + 51) = (73, 175)

Using the above data do you think a baby that was 170 oz was unusual?

Yes. 170 is outside the 95% interval.

Using the above data do you think a baby that was 100 oz was unusual?

No. 100 is inside the 95% interval.

Using the above data do you think a baby that was 70 oz was unusual?

Yes. 70 is outside the 95% interval and the 100% interval.

The pth percentile is a value such that p percent of the observations fall below or at that value.

You have probably seen percentiles on standardized tests. We will not cover percentiles in general, but we will use some special percentiles.

The median is a percentile, the 50th.

Three useful percentiles that we will use are the quartiles.

The first quartile is called Q1or Q

L

and is the 25th percentile. It is also the median of the lower half of the data.

The median is the second called Q2.

The Third quartile is called Q3 or Q

U

and is the 75th percentile. It is also the median of the upper half of the data.

To calculate the quartiles, find the median. Divide the data into a lower half and upper half. Do not include the median in either half. To find Q1, find the median of the lower half. To find Q3, find the median of the upper half.

The calculator calculates all 3 of them for you.

The Inter Quartile Range (IQR = Q3 - Q1).

Some people look at Q0 as the minimum observation and Q4 as the maximum observation.

These 5 numbers together are called the 5-number-summary of the data.

These numbers can be used to detect outliers and create visual display of the data called a boxwhisker-plot.

The box-plots that your text described are useless!

Modified Box Plots and the 5-Number Summary.

Constructing a box-plot

1. Calculate Q1, Q2, Q3 and IQR = Q3 – Q1

2. Compute the Inner Fences (IF) and the Outer Fences (OF):

LIF = Q1 – 1.5 * IQR UIF = Q3 + 1.5 * IQR

LOF = Q1 – 3 * IQR UOF = Q3 + 3 * IQR

3. Draw a horizontal axis. Draw vertical lines at Q1, Q2, Q3.

4. A whisker (horizontal line) is drawn from Q1 to the smallest observation that is bigger than LIF.

A whisker (horizontal line) is drawn from Q3 to the largest observation that is smaller than UIF.

Any observation that is between the inner and outer fences is a mild outlier and is labeled with a solid circle.

Any observation that is outside the outer fences is an extreme outlier and is labeled with an open circle.

Ex.

Grades (x) Sorted x

95

87

35

76

76

95

92

88

87

82

82

68

63

92

88

76

76

68

63

35

Median = (76 + 82) / 2 = 79

Q1 = 68 Q3 = 88

IQR = 88 – 68 = 20

68 – 30 = 38 = LIF

68 – 60 = 8 = LOF

Min = 35 Max = 95

1.5 * IQR = 30 3 * IQR = 60

88 + 30 = 118 = UIF

88 + 60 = 148 = UOF

Since 8 < 35 < 38 it is a mild outlier.

Since 95 < 118 it is NOT an outlier.

As one can see from the box plot the data set looks pretty symmetric except for the one outlier

(35) which skews the distribution left.

Note that mean = 76.2 < 79 = median

Ex.

An airline company is wondering about the number of cancellations it receives for a specific commuter flight. The airline takes a random sample of 15 days. The data are listed below. Find the mean and the median for the sample. Make a dot plot of the data. Are there any outliers?

Describe the symmetry of the data.

4, 24, 17, 17, 9, 12, 9, 12, 13, 14, 14, 15, 15, 16, 16.

Answers:

Mean = 13.8

Median = 14

Q1=12 Q2=14

1.5* IQR = 1.5 * 4 = 6

Q3=16

3*IQR = 12

LIF = 12 – 6 = 6 UIF = 16 + 6 = 22

LOF = 12 – 12 = 0 UOF = 16 + 12 = 28

IQR = 16 – 12 = 4

4 is mild outlier because it lies between the LIF and the LOF.

In class:

2.13.1, 2.13.6, 2.13.9, 2.13.10, 2.13.14

Download