Chapter 2 – Organizing and Summarizing Data Definition: When data are in their original form, as collected, they are called raw data. We want to be able to visualize the characteristics of a data set; hence we construct graphical representations of the data. In order to do so, we must look at the frequency of occurrence of data values. Definition: A categorical frequency distribution, used for categorical (qualitative) data, is a table listing the categories, together with the frequency of occurrence of each category in the observed data. Definition: The frequency for a category is the number of data values falling in that category. The relative frequency for a category is the fraction, proportion, or percentage of the data values that fall within that category. Example: The following table shows data on class rank of students receiving financial aid at a small 4year college. College Class Rank Fr So Jr Sr Frequency 18 12 6 4 Relative Frequency 18/40 = 0.45 = 45% 12/40 = 0.30 = 30% 6/40 = 0.15 = 15% 4/40 = 0.10 = 10% Often, when the data are numeric, there are too many different data values for a listing of the raw data to be of use in seeing the characteristics of the data. It is common to divide the interval of values of the data into a relatively small number of subintervals, called classes, and to tabulate the data using the frequencies. Each frequency is the number of occurrences of data values in one of the classes. Definition: A grouped frequency distribution is the organizing of raw data in table form, using classes and frequencies. Definition: The largest data value that can be included in a class is the upper class limit for that class; the smallest data value that can be included is the lower class limit. Definition: The class width is the difference between the upper class limit of one class and the upper class limit of the next-higher class. Definition: The cumulative frequency for a class is the count of all observed data values in that class or in lower classes. Rules for constructing a frequency distribution: 1) The number of classes should be between 5 and 20; 5 for small data sets, 20 for large data sets. “Small” means roughly 25 to 30 observations; “large” means around 1000 or more observations. 2) An observed data value must be in one, and only one, class. This means that the classes must be non-overlapping, or mutually exclusive. 3) The classes must be continuous; even if there are no observed data values in a given class, that class must be included, with a frequency value of 0. 4) The classes must be exhaustive; i.e., together they must include all of the data. 5) The classes must be equal in width. Procedure for constructing a grouped frequency distribution: 1) Find the range by subtracting the lowest value of the data from the highest. 2) Select the number of classes desired (between 5 and 20). 3) Find the class width by dividing the range by the number of classes; round the result up to get the class width. 4) Go to the TI-83 calculator and construct a graph called a histogram, using the procedure listed below. 5) Use the information read off the calculator screen to construct the grouped frequency table. Example: We have 25 scores on a final exam, as follows: 86, 83, 56, 98, 82, 52, 71, 88, 75, 91, 69, 88, 64, 78, 81, 74, 77, 83, 90, 85, 64, 79, 71, 83, 64 We want a frequency distribution. Since the data set is small, we choose 5 as the number of classes. The range of the data is R = Largest value – Smallest value = 98 – 52 = 46. To get the class width, we divide the range by 5, obtaining 9.2. We round this number up to obtain the class width, 9.25. We then go to the TI-83 to construct the histogram. We will talk about constructing histograms first, then get back to constructing the grouped frequency table. Graphical Representations of Data We will do several types of graphs that display numeric data. One of the most common ways to graph numeric data is through use of a histogram. Definition: A histogram is a graph that displays the data by using vertical bars of various heights to represent the frequencies. Characteristics of a histogram: 1) The classes are listed in order along the horizontal axis of the chart. 2) The vertical axis provides a scale for the frequencies. 3) A rectangle, or bar, is constructed for each class so that a) the height of the bar is the frequency of the class b) the bar for the class extends from the lower boundary of the class to the upper boundary 4) Each axis of the histogram has a label, and the histogram has a title. Example: Now let us create a histogram for a data set, and in so doing, generate a grouped frequency distribution. Entering a data set into the TI – 83 graphing calculator, using the statistics exam data. The stat list editor is a table where you can store, edit, and view up to 20 lists that are in memory. Also, you can create list names from the stat list editor. 1) To display the stat list editor, press STAT, and then select 1:Edit from the STAT EDIT menu. 2) Use the up arrow key to move the cursor to the top row of the table. Press 2ND, and then INS. You will see the Name = prompt at the bottom of the screen. Type the name of your variable using the alphabetic keys (green symbols on your calculator). 3) Use the down arrow to move to the list. Type in the first data value and press ENTER. The cursor will automatically move down to the next space for the next entry. If you make a mistake, use the arrow keys to return to the location of the mistake and make a correction. 4) If you want to erase a list, move the cursor to the list name, and press DEL. Steps in constructing a histogram using the TI – 83 graphing calculator: First, you need to clear previous graphs. 1) Press Y=. You will see a list of functions. If any of them have already been defined, use the arrow keys and the CLEAR key to erase them. 2) Next press 2ND, and STAT PLOT. You will see a list of plots. All of them should be off. If any are not, go down to 4:PlotsOff and press ENTER. 3) Clear all drawn figures. Press 2ND and DRAW. Choose 1:ClrDraw, and press ENTER. 4) Set the size of your graph window. Press WINDOW. The Xmin value should be equal to your smallest data value; in this case, we choose Xmin = 52. The Xmax value should be equal to or slightly larger than your largest data value; in this example, we choose Xmax = 102. The Xscl value is your class width. For this example, we choose 6 classes, and so Xscl = 9.25. The Ymin value should be 0; the Ymax value should be somewhat larger than your expected largest class frequency. Since there are 25 items of data, we choose Ymax = 12. 5) Press 2ND, STATPLOT, 1:Plot1, and ENTER. Turn Plot 1 On. 6) Choose the histogram symbol (the third symbol on the third line of the screen). 7) Go down to Xlist: and enter the name of your variable. 8) Press the GRAPH key. You will see the histogram displayed. To generate the frequency distribution from the histogram: 1) Press the TRACE key. 2) Use the right arrow key to move from one bar of the histogram to the next, reading the class boundaries and the frequencies from the calculator screen. The result for this example is given below. Class Limits 52.00 – 61.24 61.25 – 70.49 70.50 – 79.74 79.75 – 88.99 89.00 – 98.25 Frequency 2 4 7 9 3 Cumulative Frequency 2 6 13 22 25 Relative Frequency 0.08 = 8% 0.16 = 16% 0.28 = 28% 0.36 = 36% 0.12 = 12% Note also that the table includes a column for the relative frequencies, which are the proportions of the data set falling into each class. Defn: The relative frequency associated with a class is the proportion of the data set falling into that class. It is found by dividing the class frequency by the size of the data set. Defn: The cumulative relative frequency associated with a class is the proportion of the data set falling into that class or lower classes. It is found by dividing the cumulative frequency for a class by the size of the data set. Interpretation of Relative Frequency and Cumulative Relative Frequency: If we randomly select an observation from the data set, the relative frequency for a class is the probability that our selected observation will be found in that class. The cumulative relative frequency for a class is the probability that the observation will be found either in that class or in a lower class. Distribution Shapes: (See p. 88) 1) In a uniform distribution, the frequencies are equal for all classes; the relative frequencies are also equal for all classes. 2) In a bell-shaped distribution, the greatest frequency (or relative frequency) occurs in the middle class, with decreasing frequencies away from the center in either direction. Uniform and bell-shaped distributions are examples of symmetric distributions. 3) In a distribution that is positively skewed, or right-skewed, the majority of the data values fall to the left of the center and cluster at the lower end of the distribution; the tail of the distribution is to the right. 4) In a distribution that is negatively skewed, or left-skewed, the majority of the data values fall to the right of the center and cluster at the upper end of the distribution; the tail of the distribution is to the left. Other Types of Graphs Defn: A bar graph is used to represent the frequency distribution for a categorical variable, and the frequencies are displayed by the heights of the vertical bars. Defn: A Pareto chart is a bar graph whose bars are drawn in decreasing order of frequency or relative frequency. Example: pp. 63-66 Note: Since we are dealing with non-numeric data, the TI-83 calculator will not do this type of graph. Another type of graph used with categorical data is the pie graph. Defn: A pie graph is a circle that is divided into sections or wedges according to the proportion of the data set in each category. Note: The TI-83 will not do this type of graph. It must be done by hand. Example 6, p. 68. Note: In any situation in which data are represented using graphical techniques, it is easy to construct the graph in such a way as to mislead the viewer. It is necessary to carefully examine the graph in order to interpret it properly. On pages 100 - 109 of the textbook, there are examples of graphs constructed to be misleading. Time Series Plots If the values of a variable are measured at regular intervals over a period of time, the data are referred to as time series data. Unlike previous data sets, the items in a time series data set may be related to each other. To represent the data graphically, we use a time series plot. Defn: A time series plot is obtained by plotting the time at which a variable is measured along the horizontal axis and the measured value of the variable along the vertical axis. Lines are then drawn connecting the points. Example: p. 98, Ex. 55 To do this type of plot using the TI-83/84, we need to enter two lists of numbers, the first list is the sequence of time points. The second list is the sequence of data values. The type of graph we are doing is the second of the six types available with the Stat Plot function of the calculator. Steps in constructing a time series graph using the TI – 83/84 graphing calculator: 1) You will need to enter two columns of data. The first column is the set of time values. To avoid making the graph look cluttered, I would enter the time values as 1, 2, 3, … up to the number of time points. The second column is the list of measured values of the variable. These two columns should have the same number of data items. 2) Set the size of your graph window. Press WINDOW. The Xmin value should be 0. The Xmax value should be slightly larger than your largest time value; in this example, there were 12 time points, so I would set Xmax = 13. For a time series graph, the Xscl value is 1. The Ymin value should be 0; the Ymax value should be somewhat larger than your largest data value; in this example, the largest price is 34.77, so I would set Ymax = 36. 3) Press 2ND, STATPLOT, 1:Plot1, and ENTER. Turn Plot 1 On. 4) Choose the line graph symbol (the second symbol on the first row of Type). 5) Go down to Xlist: and enter the name of your time list. 6) Go down to Ylist and enter the name of the variable. 7) Press the GRAPH key. You will see the time series graph displayed. 8) If you hit the TRACE key, you can read off the coordinates of each point on the graph. For time series data, we are looking for trends. In this example, we see that there is a slightly decreasing trend in Closing Price for the 12-month period, and some cyclical fluctuations. The decreasing trend corresponds to the period of the onset of the recession. A stock analyst would also want to find explanations for the cyclical pattern seen in the data. Graphical Misrepresentations of Data Data are sometimes graphed in ways that are used to mislead the reader, either intentionally or not. Example: pp. 106-109: 3, 7, 8, 11