Tutorial in Statistics Exercise 1 Which of the following variables are categorical(or qualitative) and which are quantitative? (i)The color of cars involved in several severe accidents. (ii)The length of time required for rats to move through a maze. (iii)The classification of police administration as city, county or state. (iv)The ratings given to pizza in a taste test as poor, good or excellent. (v)The number of times subjects in a sociological research study have been married. Solution The variables given in (i), (iii) and (iv) are categorical since they result in non-numerical values. They are classified into categories. The variables in (ii) and (v) result in numerical values as a result of measuring and counting respectively and are quantitative variables. Exercise 2: The areas of various continents of the world in millions of square kilometres are presented in table below. Continent Africa Asia Europe North America Oceanic South America U.S.S.R Total Area 30.3 26.9 4.9 24.3 8.5 17.9 20.5 133.3 Display this data using (i) a bar chart and (ii) a pie chart. Solution (a) The areas of various continents of the world in millions of square kilometres together with their % of the total area of the continents are presented in table below. Continent Africa Asia Europe North America Oceanic South America U.S.S.R Area in sq. kilometre 30.3 26.9 4.9 24.3 8.5 17.9 20.5 Total=133.3 Percentage Area 30.3/133.3=22.7% 26.9/133.3=20.2% 4.9/133.3=3.7% 24.3/133.3=18.2% 8.5/133.3=6.4% 17.9/133.3=13.4% 20.5/133.3=15.4% Total=100% (i)The requested bar charts are given below. Area of continemts in sq.kilometres Area of Continents 30 20 10 africa asia n. amer u.s.s.r s.ameroceaniceurope Continents Percentage Area of continents Percentage Area of Continents 25 15 5 africa asia n. amer u.s.s.r s.ameroceaniceurope Continents (ii) The requested pie chart is given below. Area of Continents in sq. kilometres asia (27, 20.2%) africa n. amer (24, 18.2%) (30, 22.7%) europe oceanic u.s.s.r (21, 15.4%) s.amer (18, 13.4%) ( 5, 3.7%) ( 9, 6.4%) Exercise 3: The breakdown of total dollars spent on business trips in the United States is estimated as follows (a) 41% on air fares, (b)22% on lodgings, (c) 12% on meals, (d) 8% on car rentals and (e) the remaining on other expenses. (i) Construct a pie chart to show this information. (i) Construct a bar chart to show this information. Solution We have 41% on air fares, 22% on lodgings,12% on meals, 8% on car rentals and 17% on other expenses Total 100% (i) Percentage breakdown of Business Trips air fares (41, 41.0%) car rentals ( 8, 8.0%) other expens (17, 17.0%) lodgings (22, 22.0%) meals (12, 12.0%) . Arranged in decreasing order clockwise Percentage breakdown of Business Trips other expens (17, 17.0%) meals (12, 12.0%) car rentals ( 8, 8.0%) lodgings (22, 22.0%) air fares (41, 41.0%) (ii) Breakdown of Expenses of Business Trips Percentage 40 30 20 10 air fares car rentals lodgings mealsother expenses C1 Breakdown of Expenses of Business Trips Percentage 40 30 20 10 air fares lodgings other expensesmeals car rentals C1 Exercise 4 The final marks of 80 students at a university are recorded below 68 84 75 82 68 90 62 88 76 93 73 79 88 73 60 93 71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78 82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79 83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57 88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77. (a)Construct a Frequency Distribution , the Relative Distribution and the % Frequency Distribution table for this data and draw the histograms for(i) the Frequency Distribution and Frequency Polygon, and (ii) the % Frequency Distribution and % Frequency Polygon. Comment on the shape of the histogram. Take 5 equal classes. (b)Construct a Cumulative Frequency Distribution table and draw the respective histogram and Ogive. Take 5 equal classes. (c) How many students got less than or equal to 75 marks and 80 marks? (d) Construct Cumulative % Frequency Distribution table and draw the respective histogram and Ogive. Take 5 equal classes (e) What % of students got less than or equal to 84 marks and 89 marks? Solution (a) We take 5 classes of equal width. 97 53 44 8.8 9 . Then class size= 5 5 Take equal class size=9. We will use 5 classes of equal size of 9 with mid-points at 57.5, 66.5, 75.5, 84.5, and 93.5 respectively. Class boundaries are shown below in the frequency table together with the equal class width. The required Frequency Distribution Table is given below. l Examination Marks Class Boundaries 53 to less than 62 62 to less than 71 71 to less than 80 Tally Class midpo int Class Width Frequency !!!/ !!! 57.5 9 8 !!!!/ !!!!/ !!!!/ !!!!/ !!!!/ !!! !!!!/ ! !!!!/ !!!!/ !!!!/ 66.5 9 16 75.5 9 33 80 to less than 89 89 to less than 98 !!!!/ !!!!/ !!!! !!!!/ !!!! 84.5 9 14 93.5 9 9 Sum=80 Frequency Distribution of Examination Marks of 80 students and Frequency Polygon Number of students 30 20 10 0 57.5 66.5 75.5 84.5 93.5 Examination Marks N.B. The shape of the histogram is single peaked and is approximately symmetrical. The complete frequency table including (i)frequency, (ii) relative frequency and % frequency is given below. Class Boundaries 53 to less than 62 62 to less than 71 71 to less than 80 80 to less than 89 89 to less than 98 Class midpo int 57.5 Freq uenc y 8 Relative Frequency Percentage % Frequency 8/80=.1 10.00% 66.5 16 16/80=.20 20.00% 70.5 33 33/80=.4125 41.25% 77.5 14 14/80=.175 17.5% 84.5 9 9/80=.1125 11.25% Sum =80 Sum=1.0 Sum=100% Last two columns are calculated from the third column according to following Note 1 Relative Frequency of a class frequency of that class . Sum of all frequencies Note 2 Percentages %=Relative Frequencies 100. Note: The Relative Frequency and the Percentage Frequency are really the same except the vertical axis have different units. Hence we will only plot the Percentage Frequency as this is most frequently used. % Frequency Distribution of Examination Marks of 80 students and % Frequency Polygon % Number of students 40 30 20 10 0 57.5 66.5 75.5 84.5 93.5 Examination Marks N.B. The shape of the histogram is single peaked and is approximately symmetrical. (b) Cumulative Frequency Distribution Definition: A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class. Using the data for the exam results of 80 students we will illustrate the cumulative frequency distribution and the %cumulative frequency. Definition: An ogive (ojive) is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of the respective classes. The ogive for the exam results of 80 students is as follows. Example 1 Construct a cumulative frequency distribution and an ojive for the data for the exam results of 80 students given in the table above. Total Payroll Millions of dollars Class Boundary 53 to less than 62 62 to less than 71 71 to less than 80 80 to less than 89 89 to less than 98 Frequency Cumulative Frequency 8 8 16 8+16=24 33 8+16+33=57 14 8+16+33+14=71 9 8+16+33+14+9=80 Sum=80 The lower boundary of the first class 53 is taken as the lower limit of each class in the cumulative frequency . The upper boundaries of all classes are the same as in the frequency distribution table. To obtain the cumulative frequency of a class just add the frequency of that class to the frequencies of all the preceding classes. The cumulative frequencies are recorded in the third column while the class boundaries are recorded in the first column. Cumulative Number of students Cumulative Frequency Distribution of Examination Marks of 80 students and Ojive 80 70 60 50 40 30 20 10 0 57.5 66.5 75.5 84.5 93.5 Examination Marks C (c) The advantage of the cumulative frequency table and the ojive is that it can answer following question Example “How many students get an exam mark less than or equal to75 and 80?” Answer 40 and 57 approximately. (d) Cumulative Relative Frequency and Cumulative Percentage Cumulative Relative Frequency and Cumulative Percentage are easily obtained from the cumulative frequency distribution using following formulae. Cumulative relative frequency cumulative frequency of a class Total observations in the data set Cumulative Percentage Cumulative relative frequency 100 . We will illustrate the Cumulative Relative Frequency and Cumulative Percentage using the example above. Class Boundaries Cumulative Relative Frequency 8/80=.10 24/80=.30 57/80=.7125 71/80=.8875 80/80=1.00 53-62 62-71 71-80 80-89 89-98 Cumulative Percentage 10.00% 30.0% 71.25% 88.75% 100.0% Note: The Cumulative Relative Frequency and the Cumulative Percentage are really the same except the vertical axis have different units. Hence we will only plot the Cumulative Percentage and the ojive. Cumulative % Number of students Cumulative % Frequency Distribution of Examination Marks of 80 students and Ojive 100 50 0 57.5 66.5 75.5 84.5 93.5 Examination Marks (e)The advantage of the % cumulative frequency table and the ojive is that it can answer following question. Example “What % of students get an exam mark less than or equal to 84 and 89?” Answer 80.00% and 88.75% approximately. Exercise 5 Consider the following example. The total payrolls( rounded to millions) for all 30 major league baseball teams in U.S.A. for 1999 are given in the table below. Total Payrolls of Major League Baseball Teams for 1999 Team Anaheim Arizona Atlanta Baltimore Boston Chicago Cubs Chicago White Sox Cincinnati Cleveland Colorado Detroit Florida Houston Kansas City Los Angeles Total Payroll(millions of dollars) 51 70 79 75 72 55 25 38 74 54 37 15 56 17 77 Team Milwaukee Minnesota Montreal New York Mets New York Yankees Oakland Philadelphia Pittsburgh St. Louis San Diego San Francisco Seattle Tampa Bay Texas Toronto Total Payroll(millions of dollars) 43 16 15 72 92 25 30 24 46 47 46 45 38 81 49 (a)Construct a Frequency Distribution , the Relative Distribution and the % Frequency Distribution table for this data and draw the histograms for(i) the Frequency Distribution and Frequency Polygon, and (ii) the % Frequency Distribution and % Frequency Polygon. Comment on the shape of the histogram. Take 5 equal classes. (b)Construct a Cumulative Frequency Distribution table and draw the respective histogram and Ogive. Take 5 equal classes. (c)Find the number of major baseball teams with payroll of $50 million or less? (d) Construct Cumulative % Frequency Distribution table and draw the respective histogram and Ogive. Take 5 equal classes (e) What % of major league baseball has 1999 payroll of $62 million or less ? Solution (a) First we decide on the number of classes , say 5. 92 15 77 15.4 . Then class size= 5 5 Take class size=16. We will use 5 classes of equal size of 16 with mid-points at 23,39,55,71 and 87 respectively. The required Frequency Distribution Table is given below Total Payroll Millions of dollars Class Limits 15 to less than31 31 to less than 47 47 to less than 63 63 to less than 79 79 to less than 95 Tally Class Class Boundaries midpoint Class Width Frequency !!!!/ !!! 14.5-30.5 23 16 8 !!!!/ !! 30.5-46.5 39 16 7 !!!!/ ! 46.5-62.5 55 16 6 !!!!/ ! 62.5-78.5 71 16 6 !!! 78.5-94.5 87 16 3 Sum=30 The resulting histogram and polygon is shown below. Frequency Distribution and Frequency Polygon of the Payroll of the Major Baseball Teams 8 Number of Teams 7 6 5 4 3 2 1 0 23 39 55 71 87 Payroll in millions of dollars Note: The shape of the overall histogram is right-skewed. The complete frequency table including (i)frequency, (ii) relative frequency and% frequency is given below Total Payroll Millions of dollars 15 to less than 31 31 to less than 47 47 to less than 63 63 to less than 79 79 to less than 95 Tally Class midpoint Frequency Relative Frequency % Frequency !!!!/ !!! 23 8 8/30=.2666 26.66% !!!!/ !! 39 7 7/30=.2333 23.33% !!!!/ ! 55 6 6/30=.2 20% !!!!/ ! 71 6 6/30=.2 20% !!! 87 3 3/30=.1 10% Sum=30 Sum=1.0 Sum=100% Last two columns are calculated from the fourth column according to following Note 1 Relative Frequency of a class frequency of that class . Sum of all frequencies Note 2 Percentages %=Relative Frequencies 100. From this frequency distribution table we can draw the the histograms for the Frequency Distribution, the Relative Distribution and the % Frequency Distribution. These are shown below. Since the Relative Frequency Distribution and the % Frequency Distribution only differ in units on the vertical axis we only plot the % Frequency Distribution. The resulting histogram and polygon is shown below. % Frequency Distribution and % Polygon of the Payroll of the Major Baseball Teams % Number of Teams 30 20 10 0 23 39 55 71 87 Payroll in millions of dollars Note: The shape of the overall histogram is right-skewed. (b) Cumulative Frequency Distribution Definition: A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class. Using the data for the total payrolls( rounded to millions) for all 30 major league baseball teams in U.S.A. for 1999 we will illustrate the cumulative frequency distribution. Example 1 Construct a cumulative frequency distribution for the total payrolls( rounded to millions) for all 30 major league baseball teams in U.S.A. for 1999 which are given in the table below. Total Payroll Millions of dollars 15-31 31-47 47-63 63-79 79-95 Class Boundaries 15-31 15-47 15-63 15-79 15-95 Frequency 8 7 6 6 3 Sum=30 Cumulative Frequency 8 8+7=15 8+7+6=21 8+7+6=6=27 8+7+6+6+3=30 The lower limit of the first class 15 is taken as the lower limit of each class in the cumulative frequency . The upper limits of all classes are the same as in the frequency distribution table. To obtain the cumulative frequency of a class just add the frequency of that class to the frequencies of all the preceding classes. The cumulative frequencies are recorded in the third column while the class boundaries are recorded in the second column. Definition: An ogive (ojive) is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of the respective classes. The ogive for the baseball teams is as follows. The Cumulative Frequency and Ojive of the Payroll of the Major Baseball Teams are given below. Cumulative Number of Teams Cumulative Distribution and Ojive of the Payroll of the Major Baseball Teams 30 20 10 0 23 39 55 71 87 Payroll in millions of dollars (c) One advantage of the ogive is that it can be used to approximate the cumulative frequency for any interval. Find the number of major baseball teams with payroll of $50 million or less. Answer approximately 17 from ogive. (d) Cumulative Relative Frequency and Cumulative Percentage Cumulative Relative Frequency and Cumulative Percentage are easily obtained from the cumulative frequency distribution using following formulae. Cumulative relative frequency cumulative frequency of a class Total observations in the data set Cumulative Percentage Cumulative relative frequency 100 . We will illustrate the Cumulative Relative Frequency and Cumulative Percentage using the example above. Class Boundaries 15-31 15-47 15-63 15-79 15-95 Cumulative Relative Frequency 8/30=.267 15/30=.500 21/30=.700 27/30=.900 30/30=1.00 Cumulative Percentage 26.7% 50.0% 70.0% 90.0% 100.0% Note: The Cumulative Relative Frequency and the Cumulative Percentage are really the same except the vertical axis have different units. Hence we will only plot the Cumulative Percentage. The Cumulative Percentage and the % Ojive for The Payroll of Major League Baseball Teams is given below. Coordinates of the ogive are 15.0, 31,26.7, 47,50, 63,70, 79,90.0, 95,100 Cumulative % Number of Teams Cumulative % Distribution and % Ojive of the Payroll of the Major Baseball Teams 100 50 0 23 39 55 71 87 Payroll in millions of dollars (e) What % of major league baseball has 1999 payroll of $62 million or less? Answer 70% of major league baseball teams has 1999 payroll of $62 million or less Exercise 6 The incomes in 2001 of 16 randomly chosen people from th U.S. census, who have high school diplomas but no third level qualifications were to the nearest thousand of dollars 12 43 20 5 67 32 19 6 43 47 21 40 31 25 22 24. Find (i) the range (ii) the five-number summary and (iii)the interquartile(IQR) . Solution Note n 16 Step 1 Arrange the data in ascending order 5 6 12 19 20 21 22 24 25 31 32 40 43 43 47 67 (i)The Range=67-5=62. (ii) Median M is at the n 1 17 8.5 position in the list 2 2 i.e. 24 25 24.5 . 2 The first quartile Q1 is the median of the data values of the ordered list of data to the left of the location of the overall median M, that is the median of n 1 9 4.5 th(where 5 6 12 19 20 21 22 24 which is at the 2 2 19 20 19.5 .Alternatively first quartile Q1 is at the n 8 ) position i.e. Q1 2 n 1 17 4.25 th position of the overall ordered data list i.e. Q is between 1 4 4 19 20 19.5 . the 4th and 5th positions, that is Q1 2 The third quartile Q3 is the median of the data values of the ordered list of data to the right of the location of the overall median M, that is the median of n 1 9 4.5 th(where 25 31 32 40 43 43 47 67 which is at the 2 2 th th n 8 ) position i.e. Q3 is between the 4 and the 5 positions ,that is 40 43 Q3 41.5 .Alternatively third quartile Q3 is at the 2 3n 1 3 17 12.75 th position of the overall ordered data list i.e. Q3 is 4 4 40 43 41.5 . between the 12th and 13 th position of the overall list , that is Q3 2 Thus the five-number summary is Minimum=5, Q1 19.5 ,M=24.5, Q3 41.5 and Maximum=67. Between the 8 th and 9 th positions i.e. M (iii) The Interquartile Range(IQR)= Q3 Q1 =41.5-19.5=22.0. Minitab gives following results Descriptive Statistics: non-graduate income Variable non-grad N 16 Mean 28.56 Median 24.50 TrMean 27.50 Variable non-grad Minimum 5.00 Maximum 67.00 Q1 19.25 Q3 42.25 StDev 16.37 SE Mean 4.09 N.B. Q1 and Q3 have slightly different values (19.25 and 42.25 respectively)from the values we calculated (19.5 and 42.5) Note: Some software packages use slightly different rules to calculate the quartiles so computer results may be slightly different from the results calculated by the above rules. However the difference will be very small and can be ignored. Stem-and-Leaf Display: non-graduate income Stem-and-leaf of non-grad Leaf Unit = 1.0 2 4 (5) 7 5 1 1 0 1 2 3 4 5 6 N 56 29 01245 12 0337 7 Boxplot non-graduate income 70 60 50 40 30 20 10 0 = 16 Boxplot(Box-and-Whisker) is a graph of the five-number summary and is of the following form 5 10 15 C1 Box plot of numbers 5 7 9 10 11 13 15 The central box spans the quartiles Q1 and Q3 and a line in the box marks the median M while lines extend from the box(the whiskers) out to the smallest and largest data values. Exercise 7: The incomes of 15 people who have bachelors degrees chosen at random from the U.S. Census Bureau in March 2002 were to the nearest thousand of dollars 110 25 50 50 55 30 35 30 4 32 50 30 31 74 60. Find the boxplot for this set of data. Solution Find the five-number summary for this data. From previous work we found the five-number summary is Minimum=4, Q1 30 ,M=35, Q3 55 and Maximum=110 and the boxplot is Income of Graduates 0 50 100 Income to the nearest thousand of dollars Note: The asterisk in the boxplot indicates the value 110 may be an outliner. Exercise 8 The incomes in 2001 of 16 randomly chosen people from th U.S. census, who have high scool diplomas but no third level qualifications were to the nearest thousand of dollars 12 43 20 5 67 32 19 6 43 47 21 40 31 25 22 24. Find the boxplot for this set of data and compare with the boxplot for graduates. Solution Find the five-number summary for this data. From previous work we found the five-number summary is Minimum=5, Q1 19.5 ,M=24.5, Q3 41.5 and Maximum=67 and the boxplot is Income of Non-Graduates 0 10 20 30 40 50 60 Income to the nearest thousand of dollars Income of Graduates 0 50 100 Income to the nearest thousand of dollars 70 Because boxplots show less detail than histograms or stemplots, they are best used for side by side comparison of more than one distribution, as in figure below. Be sure to include a numerical scale in the graph. When you look at a boxplot, first locate the median, which marks the center of the distribution. Then look at the spread. The quartiles show the spread of the middle half of the data, and the extremes (the smallest and largest observations) show the spread of the entire data set. We see from figure below that holders of a bachelor's degree as a group earn considerably more than people with no education beyond high school. For example, the first quartile for college graduates is higher than the median for high school grads. The spread of the middle half of incomes (the box in the boxplot) is roughly the same for both groups. A boxplot also gives an indication of the symmetry or skewness of a distribution. In a symmetric distribution, the first and third quartiles are equally distant from the median. In most distributions that are skewed to the right, on the other hand, the third quartile will be farther above the median than the first quartile is below it. That is the case for both distributions in figure below. The extremes behave the same way, but. remember that they are just single observations and may say little about the distribution as a whole. Exercise 9 Given the following data set 11,10,9,18,11,8,3 calculate (i)the range ,(ii) the median (iii) the mean and (iv) the variance s 2 and the standard deviation s . Solution Put the data in ascending order 3, 8, 9, 10, 11, 11, 18 Range=18-3=15. Median=10. 3 8 9 10 11 11 18 70 Mean x 10 . 7 7 Data Deviation Squared Deviations 3 3-10=-7 (7)2 49 8 8-10=-2 (2)2 4 9 9-10=-1 1 10 10=10=0 (0) 2 0 11 11-10=1 1 11 11-10=1 18 2 2 1 1 (1)2 1 82 64 _______________ Sum=120 18-10=8 ________________ Sum=0 The variance 3 10 8 10 9 10 10 10 11 10 11 10 18 10 2 s 2 2 2 2 2 6 120 20 . 6 The standard deviation s s 2 20 4.4721359 4.47 . 2 2 Exercise 10 Given the following data set 75, 10, 30,10,15,30 calculate (i)the range ,(ii) the median (iii) the mean and (iv) the variance s 2 and the standard deviation s . Solution Put the data in ascending order 10,10,15,30,30,75 Range=75-10=65 . 15 30 45 Median= 22.5 . 2 2 10 10 15 30 30 75 170 Mean x 28.333333 28.33 . 6 6 Data Deviation Squared Deviations 10 10-28.33=-18.33 (18.33)2 335.9889 10 10-28.33=-18.33 (18.33)2 335.9889 15 15-28.33=-13.33 13.33 30 30-28.33=1.67 (1.67)2 2.7889 30 30-28.33=1.67 (1.67)2 2.7889 75 75-28.33=46.67 ________________ Sum=0 .02 0 2 46.67 2 177.6889 2178.0889 _______________ Sum=3033.3334 The variance 10 28.33 10 28.33 15 28.33 30 28.33 30 28.33 75 28.33 2 s 2 2 2 2 5 3033.3334 606.6668 . 5 The standard deviation s s 2 606.6668 24.630604 24.63 . 2 2