CLASSIFYING VARIABLES, FREQUENCY DISTRIBUTIONS AND TABLES A population is the complete collection of measurements, objects, or individuals under study. A survey of the entire population is called a census. A sample is a portion or subset taken from a population. Therefore, sample results will not always accurately reflect the population. However, if sampling is done properly and scientifically the sample results will be sufficiently accurate. The subset should be chosen at random to avoid bias. A survey is the collection of information from a sample. A random sample is a sample in which each member of the population has the same chance of being selected. Data is information about individuals in a population. Categorical variables describe a particular quality or characteristic which can be divided into categories. This is qualitative data. Examples: Numerical variables describe a characteristic which has a numerical value that can be counted or measured. This is quantitative data. Examples: A parameter is a numerical quantity that measures some aspect of a population. Outliers are data values that are much larger or much smaller than the general body of data. They should be included in analysis unless they are the result of human or other known error. A distribution is the pattern of variation of data, which may be described as: positively skewed symmetrical negatively skewed A discrete numerical variable takes exact number values and is often a result of counting. Examples: A continuous variable takes numerical values within a certain continuous range. It is usually a result of measuring. Examples: A frequency distribution or frequency table groups data items into classes and then records the number of items that appear in each class. The classes can either be continuous or discontinuous. Ex. Arrange the following heights (in cm) of a group of students into a frequency table. 156 150 153 172 160 154 168 163 152 153 152 155 170 157 150 160 158 150 170 162 152 156 154 152 160 159 154 160 163 151 172 157 151 174 160 154 Choose your interval width so that there are about 10 classes or intervals (i.e., the first interval could be 148-150cm). Use the difference between the lowest and highest values to determine an appropriate interval width. Mark a check in the appropriate row of the tally column for each entry. Then total the tallies to get the frequency. Heights 148 x 151 151 x 154 154 x 157 157 x 160 160 x 163 163 x 166 166 x 169 169 x 172 172 x 175 Tally Frequency Relative Frequency Totals The total of the frequencies should be the total number of data items. We will graph the data to best show its distribution. We do this by making a histogram. A histogram is a bar graph that portrays the data found in a frequency distribution. The bars are of equal width and correspond to the class intervals. A histogram is used for continuous data while a column graph is used for discrete (discontinuous) data. Make a histogram for the data above. A relative frequency histogram would be similar, but the vertical axis would show the relative frequency as a % rather than the number of items. Higher Level Only Alternatively, if the class widths are not uniform, the area of the bar can represent the frequency of the class. The height of the bar is measured by the density, or the frequency of the class per unit of the class size. Area of the bar = width x height (where height is the frequency density= frequency/class width) Area of the bar is the frequency (see pp. 473-474 in Pearson HL Text) The mode of a data set is the item or class of items that has the highest frequency in the distribution. A distribution can be uniform (each item or class has the same frequency), unimodal, bimodal, symmetric, etc. A cumulative frequency distribution shows the accumulated frequencies of the table. Ex. A cumulative frequency distribution showing litres of fizzy cola syrup sold by 50 employees of Slimline Beverage Company in 1 week. Litres Sold <80 <90 <100 <110 <120 <130 <140 <150 Number of Employees 0 2 8 18 32 41 48 50 How many employees sold between 110 and 120 litres? The graphic presentation of a cumulative frequency distribution is called a cumulative frequency graph/curve or an ogive (oh jive). Ex. Make a cumulative frequency distribution for the data in the table below. Then make a cumulative frequency graph and find the upper quartile, median, and lower quartile. Salary 0-9,999 10,000-14,999 15,000-19,999 20,000-24,999 25,000-29,999 30,000-34,999 35,000-39,999 40,000-44,999 45,000-49,999 Number of Employees 2 5 9 11 22 30 13 7 1 Salary Number of Employees 0-9,999 0-14,999 0-19,999 0-24,999 0-29,999 0-34,999 0-39,999 0-44,999 0-49,999 You can create frequency histograms and bar graphs using your graphing calculator. Enter your data into a list, go to STATPLOT and change it to a bar graph and use the appropriate list, then graph. To find the median, find the halfway point of the cumulative frequencies and draw a straight line to the curve. Then draw a straight line down to the x-axis to find the median value. For a frequency polygon, the point goes at the middle of the interval. 35 30 Frequency 25 20 15 10 5 0 0 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 Salaries Stem and Leaf Plots Stem 0 1 2 3 Leaf 2578 0144479 2336 14 MEASURES OF CENTRAL TENDENCY AND QUARTILES A parameter is a numerical characteristic of a population. A statistic is a numerical characteristic of a sample. A statistic is a single value that summarizes some characteristic of interest. There are many statistics that can be of interest. Some of these statistics are measures of central tendency such as the mean, median, and mode of a set of data. The mode of a set of data is the most commonly occurring value in a series. If we are looking at classes, it is the class that contains the most entries (has highest frequency). It is possible to have one mode, more than one mode (bimodal, trimodal), or no mode at all. The mode is not the most useful of the measures of central tendency and is typically only used when others are not available. The mode will always be a value in the data series. The arithmetic mean of a set of data is usually what people mean when they say average. The mean is the sum of all the data values divided by the number of data values. The mean does not necessarily have to be a value in the data series. x N mean for a population x x n mean for a sample N the number of x items in the population n the number of x items in the sample If the data is in the form of a frequency table with item xi of frequency f i then the formula is: f i xi Mean fi For a table that uses intervals, use the mid-value of the interval as your xi . The median of a set of data occupies the middle position in an array of values. Half of the items are below the median, and half the values are above the median. In the case where there is an even number of items, the median is the mean of the two middle values. To find the median, arrange the items in order of size Odd number of scores Even number of scores ( f 1) th score 2 f ( f 1) th and th score 2 2 Ex. Find the mean, median, and mode of the ages shown below. 23 43 28 Mean: 41 35 52 Median: 23 29 Mode: On the Calculator: Enter the items into list 1 on your calculator. Press STAT, EDIT, SortA(L1) to sort the list in ascending order. The press 2nd STAT, press MATH and median(L1) to find the median of the list. Press 2nd STAT, MATH, and mean(L1) to find the mean of the set of data. Note: You can also choose STAT, CALC, 1-Var Stats L1 to find the mean and median. You will have to view the sorted list to find the mode – press 2nd STAT, choose L1, press enter and scroll through the list. Ex. Find the mean, median, and mode of lengths of 100 Dover sole shown in the table below. Length (mm) Number of Fish Mid-Interval 275-299 300-324 325-349 350-374 375-399 400-424 425-449 450-474 Total 1 1 14 24 30 22 6 2 100 287 312 337 362 387 412 437 462 Mean: Median: f i xi 287 312 4718 8688 11610 9064 2622 924 38225 Mode: Note: To do this on the calculator, enter the mid-interval values into L1 and the frequencies into L2. Then press STAT, CALC, 1 Var Stats (L1, L2), ENTER Note where the median and mean are for skewed data: (median is solid, mean is dashed) positively skewed symmetrical A quartile contains a quarter of the data values. negatively skewed The lower quartile (Q1) is the value below which lie one-quarter of the data items. The upper quartile (Q3) is the value above which lie one-quarter of the data items. The median divides the data series into two parts, an upper and lower part. The lower quartile is the midpoint of the lower part and the upper quartile is the midpoint of the upper part. The inter-quartile range (IQR) is the difference between the upper and lower quartiles. It contains the middle 50 percent of the values. On a cumulative frequency curve, the lower quartile is found by finding a quarter of the cumulative frequency, drawing a horizontal line to the curve and dropping a vertical line to the x-axis. The upper quartile is found similarly. In fact, any percentile can be found using this method. The tenth percentile is the value below which lie ten percent of the data values. The upper quartile therefore is also known as the 75th percentile. The kth percentile, Pk, is a value such that k percent of the data are less than or equal to Pk and (100-k) percent are greater than or equal to Pk. Ex: Find the median and first and third quartiles for the following: 1) 3, 5, 8, 9, 11 2) 3, 5, 6, 8, 9, 11 3) 3, 5, 6, 8, 9, 11, 11 4) 3, 5, 6, 8, 9, 11, 11, 13 Cumulative Frequency Mark Frequency 10 x 20 20 x 30 30 x 40 40 x 50 50 x 60 60 x 70 70 x 80 80 x 90 90 x 100 Total: 2 1 4 3 5 11 25 17 11 Cumulative Frequency Draw a cumulative frequency graph. Find the following percentiles: Median (50th): 75th: 15th: 80th: What percentage of scores were between 30 and 50? What percentage of students failed? What percentage of students were between 80 and 100? BOXPLOTS AND MEASURES OF SPREAD A box and whisker plot (or boxplot) shows the middle half of the values in a data set – what we call the interquartile range – as a box and then draws lines, or whiskers, extending to the left and right from the box to indicate the remaining 50 percent of the data items. (Only Standard Level requires the box plot). A box and whisker plot uses a five-number summary; the minimum value (Minx), the lower quartile (Q1), the median (Q2), the upper quartile (Q3), and the maximum value (Maxx). The lower and upper quartiles form the box, while whiskers are drawn from either side to the minimum and maximum values. Outliers are values that are extremely large or extremely small compared to the rest of the data. The box is drawn from Q1, to Q3 with a line drawn at the median. The whiskers are from the box to the minimum and maximum (unless they are outliers). Min Q1 Median Q3 Max Ex. Find the median, Q1 and Q3 of the set of data {0 1 2 2 4 5 5 6 7}. Ex. Find the five number summary for the data series below and use it to draw a box and whisker plot. 4 7 6 3 9 6 2 9 5 7 3 2 On the calculator, enter the data into L1, then press STAT, CALC, 1 Var Stats L1 and read off the five-number summary. To graph the boxplot, press 2nd Y=, select Plot1, select the boxplot with the median line, set the Xlist to L1 and the frequency to 1. One advantage to a boxplot is that it can show whether data is reasonably symmetrical or skewed. If the median is in the centre of the box then the data set is reasonably symmetrical. However, if the median is closer to either the upper or lower quartile then the data set may be skewed. Also, skewness is indicated if one whisker is appreciably longer than another. Ex. The boxplot shows the results of a test (out of 100 marks). Find the following: 0 10 20 30 40 50 60 70 80 90 100 a) The highest mark scored b) The lowest mark scored c) The mark above which half of the class scored d) The mark that 25% of the class scored above e) The mark that 75% of the class scored above MEASURES OF SPREAD Variance There are other ways of measuring data other than measures of central tendency. One other ways is to use measures of dispersion, that is, a measure of the variability that exists in a data set. One reason to measure dispersion is to judge how well the average value depicts the data. Two sets of data can have the same mean, but be very different. For example look at these two sets of class test scores: Set 1: 23 45 67 75 84 96 59 67 68 69 74 Set 2: 53 The mean of both sets is 65, but it is easy to see that the scores in Set 1 are much more spread out. One measure of dispersion is the range of the data set, but this does not necessarily give us enough information about the data set because it only uses the minimum and maximum values in its calculation. Another measure of dispersion that is used is the variance. The calculation for variance involves using all the values in the data set. k 2 Population variance f (x ) i 1 i 2 i k , where n = n f i 1 i k sn 2 Sample variance f (x x) i 1 i 2 i n k , where n = f i 1 i The variance is the average of the squares of the differences between an item and the mean. Since variance measures the square of the difference, if we take the square root of the variance, we should get the average difference between items and the mean. This is called the standard deviation. Standard Deviation if std deviation is high, data is widely spread if std deviation is low, data is clustered, (usually about the mean). Calculated in the following way: where: x1 x = any score 2 x2 ... xn n 2 m = mean n 2 x i 1 i n 2 s sample standard deviation population standard deviation Ex. Calculate the variance and standard deviation for the following data set. Data 1 4 5 5 5 5 6 6 8 Note: This works the same as grouped discrete data except that you have to use class midpoints to represent the class. f = frequency 2 variance å f i (xi - m )2 s= å fi fi ( xi )2 fi 2 and The larger the variance is, the larger the standard deviation is, and therefore the larger the spread in the data (more deviation from the mean). A very small standard deviation indicates that the data items are clustered around the mean. k Population standard deviation f (x ) i 1 i 2 i k , where n = n f i 1 i k Sample standard deviation sn f (x x) i 1 i 2 i n k , where n = f i 1 i Note: For this course we will use the formula for sample standard deviation but we will always use the output that represents the population standard deviation, or x . Ex. Calculate the mean, variance, and standard deviation for the following data set. Score Freq.(f) f x x ( x )2 Midpoint (x) 0-3 4-7 8-11 12-15 16-19 20-23 24-27 28-31 32-35 Total: f ( x )2 Total: Standard Deviation and Mean may be calculated by using the TI-83/84 Plus The key strokes are: STAT ENTER enter the values in List 1 STAT CALC Enter (1- varStats) ENTER* *the default for this is the data entered in List 1. _ Mean on the calculator is To clear the list: x STAT and standard deviation is ENTER (to L1) CLEAR ENTER Ex. Use the calculator to determine the mean and standard deviation for the following sets of data: Data 1: 1 4 5 5 5 6 6 8 Data 2: 1 3 4 5 5 6 7 8 Data 3: 1 2 3 4 5 6 7 8 To calculate the mean and standard deviation of a distribution we always enter OUTCOMES in LIST 1 and FREQUENCY (of each outcome) in LIST2 Once the 2 lists are entered, and you have 1-Var Stats on the screen, you must “tell” the calculator that 2 lists must be used. The key strokes are: 2nd 1(L1) , 2nd 2(L2) then press enter to get the solution. Ex. Using the calculator, determine the mean and standard deviation for the following data: 3 coins are tossed 40 times and the number of heads showing each time is recorded. The results are: # of heads 0 1 2 3 Frequency 7 18 11 4 CORRELATION AND THE PEARSON COEFFICIENT OF COVARIANCE Correlation refers to the relationship or association between two variables. Follow these steps when looking at the correlation: Step 1: Look at the scatterplot for any pattern. Step 2: Look at the spread of points to make a judgement about the strength of the correlation. For positive relationships: For negative relationships: Step 3: Look at the pattern of points to see whether or not it is linear. Step 4: Look for any outliers. Investigate any outliers as they may be mistakes made in recording or plotting data. If the data is genuinely extraordinary it should be included. Causation Correlation and causation are often conflated. Just because two variables are correlated does not mean that one causes the other. For example, if it were found that a correlation existed between raining and being in school, that doesn’t mean that going to school causes it to rain, nor that rain causes you to attend school. Only if the variables are related such that if one is changed the other changes as well can we conclude that there exists a causal relationship between the variables. Measuring Correlation The correlation coefficient (r) measures the strength of correlation between two variables. This value lies between -1 and 1. Two variables are positively correlated if an increase in one variable results in an increase in another in an approximately linear manner. For positively correlated variables, the value of r lies between 0 and 1. If r is near 0, that indicates that no linear association (correlation) is present. If r is near 1, that indicates that a perfect linear association (perfect positive correlation) exists. The following scatter diagrams show various r values for positive correlation. Two variables are negatively correlated if an increase in one variable results in a decrease in another in an approximately linear manner. For positively correlated variables, the value of r lies between 0 and -1. If r is near 0, that indicates that no linear association (correlation) is present. If r is near -1, that indicates that a perfect linear association (perfect negative correlation) exists. The following scatter diagrams show various r values for positive correlation. Pearson’s Correlation Coefficient Pearson’s correlation coefficient is used to find the degree of linearity between two random variables X and Y, given n ordered pairs: (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), …, (𝑥𝑛 , 𝑦𝑛 ) 𝑟= where: 𝑠𝑥𝑦 𝑠𝑥 𝑠𝑦 𝑠𝑥𝑦 = 𝑡ℎ𝑒 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝑌 𝑠𝑥 = 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑋 𝑠𝑦 = 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑌 The second formula is useful because it doesn’t require the means of the X and Y distributions to be found. Ex. A chemical company has been trying out a new product to control the number of lawn beetles in the soil. Determine the extent of the correlation between the quantity of chemical used and the number of surviving lawn beetles per square metre of lawn. Lawn A B C D E Amount of chemical (g) 2 5 6 3 9 x y 𝑥 − 𝑥̅ Number of surviving lawn beetles 11 6 4 6 3 (𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 𝑦 − 𝑦̅ Totals: 𝑥̅ = 𝑟= 𝑦̅ = 𝑟= ∑(𝑥−𝑥̅ )(𝑦−𝑦̅) √∑(𝑥−𝑥̅ )2 ∑(𝑦−𝑦̅)2 (𝑥 − 𝑥̅ )2 (𝑦 − 𝑦̅)2 We will usually use technology to find this value. Calculator: Ex. Ms. Allan wants to know if there is a correlation between the focussed time students spend studying for their Statistics Test and their test results. Using your calculator, find the correlation coefficient, r, for the set of data below. Time spent studying (hours) 3 2 1.6 5 Score on test (%) 84 76 75 85 LINE OF BEST FIT, LINEAR REGRESSION, INTERPOLATION AND EXTRAPOLATION Line of Best Fit The method of fitting a line to a set of data and then finding the equation of the line is called regression. Sometimes this line is called the model. We can use a line of best fit to predict a value of the dependent variable given a value of the independent variable. We can fit a straight line ‘by eye’ or by the method of ‘least squares’ (linear regression). To draw a line of best fit by eye, find the mean of the data points, plot it and draw a line through the mean that fits the trend of the data and so that about half the data points are above the line and half are below it. Ex. Create a scatter plot, find the mean, and graph the best fit line for the data given below. Age (months) 8 9 10 12 15 Shoe Size 1 2 2.5 3 4 To find the equation of the line, choose two points that the line passes through and use them to find the slope and then generate the equation 𝑦 = 𝑚𝑥 + 𝑏. Linear Regression Of course, drawing the line of best fit by eye isn’t very accurate. We will use the method of linear regression to find the best fit line. A residual is the vertical distance between a data point and the possible line of best fit. Least Squares Regression for y on x The method for finding the best line involves a process that minimizes the sum of the squares of the residuals. Note: 𝑦 − 𝑦̅ = 𝑠𝑥𝑦 𝑠𝑥2 (𝑥 − 𝑥̅ ) We will use technology to find the equation of the best fit line and use the line to predict. Calculator: Ex. Use a calculator to find the equation of the best fit line for the data given in the first example (age and shoe size). Interpolation and Extrapolation The largest value in the data set is called the upper pole and the smallest value in the data set is called the lower pole (whether you use the independent or dependent values depends on which variable you are looking at). If we are trying to predict a data value that lies between the poles, we are interpolating. If we are trying to predict a data value that lies outside the poles, we are extrapolating. The accuracy of an interpolation depends on how linear the data is. The accuracy of an extrapolation depends on how linear the data is and on the assumption that the trend will continue past the poles. Ex. The table below shows a restauranteur’s data for the number of diners in March and the temperature at noon. Temperature (Xº C) 23 25 28 30 30 27 25 28 32 31 33 29 27 Number of diners (Y) 57 64 62 75 69 58 61 78 80 67 84 73 76 a) Graph the data on a scatterplot (use your calculator and sketch it here). b) Using technology, find the value of the Pearson Correlation Coefficient, r. c) What is the equation of the least squares regression line? Graph it above. d) Using the equation you found in part c; i) How many diners could be expected if the temperature was 26º C? ii) How many diners could be expected if the temperature was 35º C?