Graphical Representations of Data

advertisement

Chapter 2 – Descriptive Analysis and Presentation of Single-Variable Data

Definition : When data are in their original form, as collected, they are called raw data.

We want to be able to visualize the characteristics of a data set; hence we construct graphical representations of the data. In order to do so, we must look at the frequency of occurrence of data values.

Definition : A categorical frequency distribution, used for nominal-or ordinal-level data, is a table listing the categories, together with the frequency of occurrence of each category in the observed data.

Example : The following table shows data on class rank of students receiving financial aid at a small 4year college.

College Class Rank Frequency

Fr

So

Jr

18

12

6

Se 4

Often, when the data are numeric, there are too many different data values for a listing of the raw data to be of use in seeing the characteristics of the data. It is common to divide the interval of values of the data into a relatively small number of subintervals, called classes, and to tabulate the data using the frequencies. Each frequency is the number of occurrences of data values in one of the classes.

Definition : A grouped frequency distribution is the organizing of raw data in table form, using classes and frequencies.

Definition : The largest data value that can be included in a class is the upper class limit for that class; the smallest data value that can be included is the lower class limit.

Definition : The number halfway between the upper class limit of one class and the lower class limit of the next-higher class is called the class boundary. The upper class boundary value may be found by adding 0.5 to the upper class limit; the lower class boundary value may be found by subtracting 0.5 from the lower class limit.

Definition : The class width is found by subtracting the upper class limit of one class from the upper class limit of the next-higher class.

Definition : The midpoint of a class is found by taking the average of the classes boundary values.

Definition : The cumulative frequency for a class is the count of all observed data values in that class or in lower classes.

Rules for constructing a frequency distribution:

1) The number of classes should be between 5 and 20; 5 for small data sets, 20 for large data sets.

2) The class width should be an odd number.

3) An observed data value must be in one, and only one, class. This means that the classes must be non-overlapping, or mutually exclusive.

4) The classes must be continuous; even if there are no observed data values in a given class, that class must be included, with a frequency value of 0.

5) The classes must be exhaustive; i.e., together they must include all of the data.

6) The classes must be equal in width.

Procedure for constructing a grouped frequency distribution:

1) Find the range by subtracting the lowest value of the data from the highest.

2) Select the number of classes desired (between 5 and 20).

3) Find the class width by dividing the range by the number of classes; round the result up to get the class width.

4) Select a starting point (either the lowest data value or a number just below the lowest data value).

5) Add the class width to get the lower class limits.

6) Find the upper class limits.

7) Find the class boundaries.

8) Tally the data, counting the number of occurrences within in each class.

9) Construct a table, with one column describing the classes, a second column containing the frequencies, and a third column containing the cumulative frequencies.

Example : We have 25 scores on a final exam, as follows:

86, 83, 56, 98, 82, 52, 71, 88, 75, 91, 69, 88, 64, 78, 81, 74, 77, 83, 90, 85, 64, 79, 71, 83, 64

We want a frequency distribution. Since the data set is small, we choose 5 as the number of classes.

The range of the data is R = Largest value – Smallest value = 98 – 52 = 46. To get the class width, we divide the range by 5, obtaining 9.2. We round this number up to obtain the class width, 10. Our starting point for the first class will be just below the smallest data value, at 50. Continuing the steps listed, we obtain for our grouped frequency distribution the table below.

Class Limits

50 – 59

60 – 69

70 – 79

80 – 89

Frequency

2

4

7

9

Cumulative

Frequency

2

6

13

22

90 – 99 3 25

Graphical Representations of Data

First, we will do several types of graphs that display numeric data. One of the most common ways to graph numeric data is through use of a histogram. We will then use the result from constructing the histogram to construct a frequency table.

Definition : A histogram is a graph that displays the data by using vertical bars of various heights to represent the frequencies.

Characteristics of a histogram:

1) The classes are listed in order along the horizontal axis of the chart.

2) The vertical axis provides a scale for the frequencies.

3) A rectangle, or bar, is constructed for each class so that

a) the height of the bar is the frequency of the class

b) the bar for the class extends from the lower boundary of the class to the upper boundary

4) Each axis of the histogram has a label, and the histogram has a title.

Example : Now let us create a histogram for a data set, and in so doing, generate a grouped frequency distribution.

Entering a data set into the TI – 83 graphing calculator, using Illustration 2-5 on p. 39 of the textbook.

The stat list editor is a table where you can store, edit, and view up to 20 lists that are in memory.

Also, you can create list names from the stat list editor.

1) To display the stat list editor, press STAT , and then select 1:Edit from the STAT EDIT menu.

2) Use the up arrow key to move the cursor to the top row of the table. Press 2 ND , and then INS . You will see the

Name = prompt at the bottom of the screen. Type the name of your variable using the alphabetic keys

(green symbols on your calculator).

3) Use the down arrow to move to the list. Type in the first data value and press ENTER . The cursor will automatically move down to the next space for the next entry. If you make a mistake, use the arrow keys to return to the location of the mistake and make a correction.

4) If you want to erase a list, move the cursor to the list name, and press DEL .

Steps in constructing a histogram using the TI – 83 graphing calculator:

First, you need to clear previous graphs.

1) Press Y= . You will see a list of functions. If any of them have already been defined, use the arrow keys and the CLEAR key to erase them.

2) Next press 2 ND , and STAT PLOT . You will see a list of plots. All of them should be off. If any are not, go down to 4:PlotsOff and press ENTER .

3) Clear all drawn figures. Press 2 ND and DRAW . Choose 1:ClrDraw , and press ENTER .

4) Set the size of your graph window. Press WINDOW . The Xmin value should be equal to your smallest data value; in this case, we choose Xmin = 98. The Xmax value should be equal to or slightly larger than your largest data value; in this example, we choose Xmax = 220. The Xscl value is your class width. For this example, we choose 7 classes, and so Xscl = 17. The Ymin value should be 0; the Ymax value should be somewhat larger than your expected largest class frequency. Since there are

50 items of data, we choose Ymax = 20.

5) Press 2 ND , STATPLOT , 1:Plot1 , and ENTER . Turn Plot 1 On.

6) Choose the histogram symbol (the third symbol on the third line of the screen).

7) Go down to Xlist: and enter the name of your variable.

8) Press the GRAPH key. You will see the histogram displayed.

To generate the frequency distribution from the histogram:

1) Press the TRACE key.

2) Use the right arrow key to move from one bar of the histogram to the next, reading the class boundaries and the frequencies from the calculator screen. The result for this example is given below.

Weights of Sample Cumulative

Frequency Relative Frequency of College Students Frequency

98 – 114

115 – 131

6

10

6

16

0.12

0.20

132 – 148

149 – 165

166 – 182

183 – 199

7

10

7

8

23

33

40

48

0.14

0.20

0.14

0.16

200 – 216 2 50 0.04

Note also that the table includes a column for the relative frequencies, which are the proportions of the data set falling into each class.

Defn : The relative frequency associated with a class is the proportion of the data set falling into that class. It is found by dividing the class frequency by the size of the data set.

Defn : The cumulative relative frequency associated with a class is the proportion of the data set falling into that class or lower classes. It is found by dividing the cumulative frequency for a class by the size of the data set.

Illustration 2.6, p. 46

Interpretation of Relative Frequency and Cumulative Relative Frequency: If we randomly select an observation from the data set, the relative frequency for a class is the probability that our selected observation will be found in that class. The cumulative relative frequency for a class is the probability that the observation will be found either in that class or in a lower class.

Other Types of Graphs

Defn : A Pareto chart, or bar graph, is used to represent the frequency distribution for a categorical variable, and the frequencies are displayed by the heights of the vertical bars.

Creating a bar graph using the TI-83 calculator: (Illustration 2.1, p. 31)

1)Assign an odd number to each category.

2)Go to STAT , 1:Edit . In column L1, enter the numbers for the categories. In column L2, enter the frequencies of the categories.

3) Clear all other graphs, plots and drawings, as described previously.

4) Go to WINDOW . Set Xmin at 1, and Xmax at the largest category number. Set Xscl at 1.Set

Ymin at 0 and Ymax at the largest frequency.

5) Go to 2 nd , STAT PLOT , and turn Plot 1 on. Choose the third Type (histogram, which is also the bar chart). For the Xlist variable, choose L1, the category labels. For FREQ , choose L2, the category frequencies.

6) GRAPH .

Another type of graph used with categorical data is the pie graph.

Defn : A pie graph is a circle that is divided into sections or wedges according to the proportion of the data set in each category.

Note: The TI-83 will not do this type of graph. It must be done by hand.

Illustration 2.1, p. 31.

Note: In any situation in which data are represented using graphical techniques, it is easy to construct the graph in such a way as to mislead the viewer. It is necessary to carefully examine the graph in order to interpret it properly. On pages 94 - 95 of the textbook, there are examples of graphs constructed to be misleading.

After we have become somewhat familiar with the data through representing it graphically and observing the characteristics of the distribution, we want to describe the characteristics with numerical values called descriptive statistics.

Defn : A parameter is a numerical characteristic of a population.

Defn : A statistic is a numerical characteristic of a sample. (Remember, a sample is a subset of a population.)

We want to use the value of a statistic found from the sample data to gain knowledge about the value of the corresponding parameter, which we would be able to get directly if we had access to the entire population.

Measures of Central Tendency give us information about the location of the center (in some sense) of the distribution of (numeric) data values. We will discuss four measures of central tendency: mean, median, mode, and the midrange.

Defn : If we have a set of n sample data values, x

1

, x

2

, … , x n

, the mean of these data values is their arithmetic average: x

1 n

 x

1

 x

2

   x n

1 n i n 

1 x i

. If we have a set of N population data values, the mean of these values is:

 

1

N

 x

1

 x

2

   x

N

1

N i

N 

1 x i

.

Note: x is a statistic;

is a parameter.

Example: p. 62, Exercise 2.48

1) Go to STAT , 1:Edit.

2) Enter the data, with a suitable variable name, such as BP.

3) Choose STAT , CALC , 1:1-Var Stats .

4) Enter the variable name, and press ENTER .

5) You will see a list of numerical values for the data, including x

6 .

9333 , and i

9 

1 x i

104 .

The average, or mean, sleep time for these 15 college students was found to be 6.9333 hours.

Properties of the Mean:

1) One computes the mean by using all of the values of the data.

2) The mean varies less than the other two measures of central tendency when samples are taken from the same population and all three measures are computed for these samples.

3) The mean is used in computing other statistics, such as the variance.

4) The mean for the data set is unique, and not necessarily one of the data values.

5) The mean is affected by extremely high or low values and may not be the appropriate measure to use in these situations.

Example: Suppose that we had made a mistake in entering the data for the first student, entering 50, rather than 5. The computed value of the mean would be x

9 .

9333 , much higher than the value computed from the correct data.

Sometimes, the correct raw data has extreme values. In these situations, the mean may not be the best measure of central tendency to use. In such cases, we might prefer to use the median.

Defn : The median is the midpoint of the data set; it is a value x values lie below x x

To find the median without the calculator:

1) First list the data in increasing order. In most cases, this will mean rearranging the data so that the smallest value is listed first, the second smallest value is listed second, etc.

2) If the size, n, of the data set is odd, choose the middle value of the list as the median value; if the size, n, of the data set is even, average the two middle values of the data to get the median value.

Example: p. 62, Exercise 2.48

1) Go to STAT , 1:Edit.

2) Enter the data, with a suitable variable name, such as BP.

3) Choose STAT , CALC , 1:1-Var Stats .

4) Enter the variable name, and press ENTER .

5) You will see a list of numerical values for the data, including

Med

The median value of the sleep time for this group of students is 7 hours.

Properties of the Median:

7 .

1) The median is used when one must find the center or middle value of a data set.

2) The median is used when one must determine whether the data values fall into the upper half or the lower half of the distribution.

3) The median is used to find the average of an open-ended distribution.

4) The median is affected less than the mean by extremely high or extremely low values.

Example: Let’s return to Exercise 2.48, and assume that the first data value was incorrectly entered as

50, rather than 5. The median value is still 7 hours, since the single extremely large value has little affect on the calculation of the median.

Sometimes, the median is a more appropriate measure of central tendency than the mean for a data set.

Example: The U.S. Department of Commerce Bureau of Labor Statistics gives information about the distribution of personal incomes in the U.S. This distribution, of course, has extreme values. Hence the Bureau uses the median income, rather than the mean, as the appropriate measure of central tendency.

In some situations, the most appropriate measure of central tendency is the mode of the distribution.

Defn : The data value that occurs most often in a data set is called the mode.

Note: Some data sets do not have a mode. For example, the data set consisting of the values 1, 1, 2, 2,

3, 3, 4, 4 does not have a single most frequently occurring value, and hence does not have a mode. For this data, the mean or median would be the most appropriate measure of central tendency.

Example: p. 62, Exercise 2.47.

The calculator will provide only a little help here, in sorting the data.

1) Rearrange the data so that the values are listed in increasing order.

2) Find the value that occurs most frequently, if such a value exists.

For this data set, the mode is 7.

Properties of the Mode:

1) The mode is used when the most typical case is desired.

2) The mode is the easiest average to compute.

3) The mode can be used when the data are nominal, such as religious preference, gender, or political affiliation.

4) The mode is not always unique. A data set can have more than one mode, in which case we say that it does not have a mode.

Defn : The midrange is defined as the sum of the lowest and highest data values, divided by 2. Thus it is the arithmetic average of the extreme values of the data.

Example: p. 62, Exercise 2.48.

1) Find the largest and smallest values of the data; in this case 4 and 11.

2) Average them. The midrange for this example is thus MR = 7.5.

Properties of the Midrange:

1) The midrange is easy to compute.

2) The midrange gives the midpoint of the data set.

3) The midrange is affected by extremely high or extremely low values in a data set.

Distribution Shapes:(See p. 53)

1) In a positively skewed distribution, the majority of the data values fall to the left of the mean and cluster at the lower end of the distribution; the tail of the distribution is to the right. In this situation, the following relationship holds among the measures of central tendency:

Mode

 ~  x

2) In a negatively skewed distribution, the majority of the data values fall to the right of the mean and cluster at the upper end of the distribution; the tail of the distribution is to the left. In this situation, the following relationship holds among the measures of central tendency: x

 ~ 

Mode

3) In a symmetrical distribution, the data values are evenly distributed on both sides of the mean.

A bell-shaped curve is an important example of a type of symmetrical distribution. In this situation, the following relationship holds among the measures of central tendency: x

 ~ 

Mode

Measures of Variability: In addition to locating the center (in some sense) of a data distribution, we also want to know how spread out the data values are.We will talk about three measures of variability:

Range, Variance, and Standard Deviation.

Defn : The range of a data set is the difference between the largest and smallest data values: Range =

Xmax – Xmin.

Example: p. 62, Exercise 2.48 (student sleep times)

Range = 11 – 4 = 7

The range is not the most useful measure of the variability of the data, however, since it ignores much of the information about variability. The following two data distributions have the same range, but we would not say that they have the same variability:

Data set 1: 10, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 90

Data set 2: 10, 10, 10, 10, 50, 50, 50, 50, 90, 90, 90, 90

If we construct histograms for each of these data sets, we see that the first set of data values is more concentrated at the center.

We need another measure of variability that will allow us to distinguish between these two situations.

This measure of variability should include information about the location of each item in the data set relative to the center of the data distribution.

Defn : For an observation x i

, define the corresponding deviation from the mean to be e i

 x i

 x .

Can we use the sum of all of the deviation scores for the data as our measure of variability? No. Why not?

For any data set x

1

, x

2

, …, x n

, we have i n 

1 e i

0 . Why is this so?

Defn : For a population of N data values, x

1

, x

2

, …, x

N

, having population mean

 

1

N

 x

1

 x

2

   x

N

 

1

N i

N 

1 x i

,

The variance of the population data set is

 2 

1

N i

N 

1

 x i

  

2

.

The standard deviation of the population data set is the square root of the variance.

Defn : For a sample of n data values, x

1

, x

2

, …, x n

, having sample mean x

1 n

 x

1

 x

2

   x n

1 n i n 

1 x i

,

The variance of the sample data set is s

2  n

1

1 i n 

1

 x i

 x

2

.

The standard deviation of the sample data set is the square root of the variance.

Why do we need to define two different additional measures of variability for a data set? (Hint: units of measurement).

Why do we divide by n – 1, rather than by n, when computing the sample variance?

Defn : An unbiased estimator of a parameter is a statistic, such that the average of the values of the statistic for repeated random samples of the same size tends toward the true value of the parameter.

When we divide by n – 1, rather than n, to compute the sample variance, we are creating an unbiased statistic for estimating the population variance.

Example: p. 35, Illustration 2.3.

From the 1-Var Stats function of the calculator, we find that the mean is x

76 .

9474 mi .

The standard deviation of the data set is s = 10.5065. The variance of the data set is then s

2

=

110.3860.

Now assume that we have committed two data entry errors, replacing 96 with 86 and replacing 52 with

62. What is are the values of the variability measures now?

We find s = 8.5210 and s

2

= 72.6082. There is less variability in the data with these two data entry errors.

Example: Given the following data set: 5, 5, 5, 5, 5, 5, 5, 5, what is the standard deviation? (Hint:

You don’t need to use the calculator to answer this question.)

Uses of Variance and Standard Deviation:

1) As previously stated, variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. This information is useful in comparing two or more data sets to determine which is more (most) variable.

2) The measures of variance and standard deviation are used to determine the consistency of a variable.

For example, in the manufacture of fittings, like nuts and bolts, the variation in the diameters must be small or parts will not fit together.

3) The variance and standard deviation are used to determine the number of data values that fall within a specified interval in a distribution. For example, Chebyshev’s theorem (explained below) shows that for any distribution, at least 75% of the data values will fall within two standard deviations on either side of the mean.

4) Finally, the variance and standard deviation are used quite often in inferential statistics. These uses will be shown later in the course.

Chebyshev’s Theorem: The proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1 – 1/k

2

, where k is a number greater than 1.

1) For any data set, at least 75% of the data values will be found within two standard deviations on either side of the mean value.

2) For any data set, at least 88.8889% of the data values will be found within three standard deviations on either side of the mean value.

3) For any data set, at least 93.75% of the data values will be found within four standard deviations on either side of the mean value.

Example:

A sample of the hourly wages of employees who work in restaurants in a large city has a mean of

$5.02 and a standard deviation of $0.09. Using Chebyshev’s theorem, find the range in which at least

75% of the data values fall.

What is the value of k in this case?

From the given values of the sample mean and sample standard deviation, we have x

2 s

$ 5 .

02

( 2 )($ 0 .

09 )

$ 4 .

84 , and x

2 s

$ 5 .

02

( 2 )($ 0 .

09 )

$ 5 .

20 .

Therefore, a least 75% of hourly wages of the employees in the sample are between $4.84 and $5.20.

Chebyshev’s theorem gives us a general rule which applies to any data distribution. As such, it is a rather weak rule. For certain types of distributions, we can make much stronger statements about the range within which we find certain proportions of the data values.

The Empirical Rule: If a data distribution is bell-shaped (or normal), then the following statements are true:

1) Approximately 68% of the data values lie within one standard deviation on either side of the mean.

2) Approximately 95% of the data values lie within two standard deviations on either side of the mean.

3) Approximately 99.7% of the data values lie within three standard deviations on either side of the mean.

Example:p.

Suppose that we know that the distribution of hourly wages is approximately bell-shaped. From the

Empirical Rule, we can then say that approximately 68% of the employees in the sample have hourly wages between x

 s

 s

$ 5 .

02

$ 5 .

02

$ 0 .

09

$ 4 .

93

$ 5 .

11

, and

.

x x x

2 s

2 s

$ 5 .

02

$ 5 .

02

$ 0 .

09

We can also say that approximately 95% of the employees in the sample have hourly wages between

( 2 )($

( 2 )($

0 .

09 )

0 .

09 )

$ 4 .

84

$ 5 .

20

, and

.

x x

Finally, we can say that approximately 99.74% of the employees in the sample have hourly wages between

3 s

3 s

$ 5 .

02

$ 5 .

02

( 3 )($

( 3 )($

0 .

09 )

0 .

09 )

$ 4 .

75

$ 5 .

29

, and

.

Measures of Position

In addition to summary statistics describing the entire data set, we are often interested in locating particular members of the sample, or particular data values, within the context of the data set as a whole.

Defn : A standard score, or z-score, for a data value is obtained by subtracting the mean from the data value and then dividing by the standard deviation. For an observation x i

in a sample data set, the zscore is z i

 x i s

 x

. For a member of a population data set, the z-score is z i

 x i

.

With z-scores, we can compare relative locations of scores from two different data distributions.

Example: Which of the following exam grades has a better relative position a) A grade of 43 on a test with a mean of 40 and standard deviation of 3 b) A grade of 75 on a test with a mean of 72 and standard deviation of 5. x

1

= 43, x

40 x

2

= 75, x

72

, s = 3

, s = 5

The z-score corresponding to the data value from the first data set is z

1

43

3

The z-score corresponding to the data value from the second data set is z

2

40

5

75

72

1

.

The first score is higher, relative to its score distribution, than the second score.

0 .

6

.

Percentiles

To locate the position of an individual score relative to its own data set, we often use percentiles.

Defn : The p th percentile of a data set is the value for which at most p% of the data values are less than that value and at most (100 – p)% of the data values are more than that number.

Finding a Data Value Corresponding to the Pth Percentile.

1) Arrange the data in order from lowest to highest. c

 np

2) Substitute into the formula 100 , where n is the size of the data set, and p is the particular percent.

3) a) If c is not an integer, round up the next higher integer. Starting at the smallest data value in the list, count up to the position corresponding to the rounded value of c. The data value at that position is the p th

percentile of the data set. b) If c is an integer, use the value halfway between the c th

and (c+1) th

data values, counting from the smallest data value.

Example: What value in the following data set corresponds to the 60 th

percentile?

1) 12, 28, 35, 42, 47, 49, 50 c

( 7 )( 60 )

4 .

2

2) n = 7, and 100

3) 4.2 is not an integer so we round up to 5. The 60 th

percentile is then the 5 th

data value, counting from the lowest, or 47.

Example: The A. C. Nielsen Company publishes data on the TV-viewing habits of Americans in the

Nielsen Report on Television. A sample of 20 people yielded the following data on weekly viewing times:

25, 41, 27, 32, 43, 66, 35, 31, 15, 5, 34, 26, 32, 38, 16, 30, 38, 30, 20, 21.

What is the 25 th

percentile of the data? I.e., what is the value such that at least 25% of the data values are less than that value and at least 75% of the data values are greater than that value?

1) Rearrange the data in ascending order: 5, 15, 16, 20, 21, 25, 26, 27, 30, 30, 31, 32, 32, 34, 35,

38, 38, 41, 43, 66

c

 np ( 20 )(

2) 100 100

3) The position of the 25 namely,

21

2

25

25 )

23

. th

5

percentile is halfway between the 5 th and the 6 th data observations,

Hence, 25% of the weekly viewing times are less than 23 hours, and 75% of the weekly viewing times are greater than 23 hours.

Defn : The first quartile of a data set is Q

1

, the 25 x

Q

2

, the median, or 50 th th

percentile. The second quartile of the data set is

percentile. The third quartile of the data set is Q

3

, the 75 th

percentile.

Defn : An outlier is an extremely high or extremely low data value, when compared to the rest of the data values.

Defn : The 5-number summary of a data set consists of the lowest value of the data, Xmin, the three quartiles, Q

1

, x , and Q

3

, and the highest value of the data, Xmax.

Example: For the Nielsen data

1) Enter the data as the variable VIEW.

2) Go to STAT, CALC, 1-Var Stats, and enter the variable name VIEW.

3) Scroll down and read off the 5-number summary of the data set:

Xmin = 5, Q

1

= 23, Med = 30.5, Q

3

= 36.5, Xmax = 66.

Defn : The interquartile range of a data set is IQR = Q

3

– Q

1

.

We will consider a data value to be an outlier if its value is either greater than Q

3

+ 1.5IQR or less than

Q

1

– 1.5IQR.

Example: The Nielsen data. We suspect that the largest value, 66 could be an outlying observation.

We calculate

Q

3

+ 1.5IQR = 36.5 + (1.5)(36.5 – 23) = 56.75. Since 66 > 56.75, then the largest data value is actually an outlier, and should be investigated individually.

Two More Types of Plots

Defn : A stem-and-leaf plot is a data plot that uses part of each data value as the stem and part of the data value as a leaf to form groups or classes.

Note: The TI-83 will not do this plot.

Example: The Nielsen data

1) Arrange the data in order from smallest to largest: 5, 15, 16, 20, 21, 25, 26, 27, 30, 30, 31,

32, 32, 34, 35, 38, 38, 41, 43, 66

2) List the values of the first digit (including 0, if necessary) down the left side of a vertical line.

3) List the values of the units digits in the appropriate rows on the right side of the line.

0 | 5

1 | 5, 6

2 | 0, 1, 5, 6, 7

3 | 0, 0, 1, 2, 2, 4, 5, 8, 8

4 | 1, 3

5 |

6 | 6

The stem-and-leaf plot gives us the shape of a numerical data distribution (for a small data set), just as a histogram allows us to see the shape of the distribution. The stem-and-leaf plot has the advantage that the data set is actually listed in the plot.

Defn : A boxplot is a graphical representation of a numeric data set, using the 5-number summary.

The data values between Q

1

and Q

3

are represented by a box, with a vertical line at the median value.

The data values between Xmin and Q

1

are represented by a line segment attached to the left end of the box. The data values between Q

3

and Xmax are represented by a line segment attached to the right end of the box.

Note: The TI-83 will do boxplots.

Information Obtained from a Boxplot

1.

a) If the median is near the center of the box, the distribution is approximately symmetric.

b) If the median falls to the left of the center of the box, the distribution is positively skewed.

c) If the median falls to the right of the center of the box, the distribution is negatively skewed.

2.

a) If the lines are about the same length, the distribution is approximately symmetric.

b) If the right line is longer than the left line, the distribution is positively skewed.

c) If the left line is longer than the right line, the distribution is negatively skewed.

Example: The Nielsen data

1) Enter the data in the calculator.

2) Set the WINDOW appropriately.

3) Clear all other plots and drawings.

4) Choose STAT PLOT , and turn Plot 1 on.

5) Choose the 5 th Type of plot, the Boxplot.

6) For Xlist , choose the name of the variable, in this case VIEW.

7) To display the boxplot, hit the GRAPH key.

8) Use the TRACE key to find Xmin, Q

1

, the median, Q

3

, and Xmax.

Download