Lecture on Statistics By Dr. Brendan Browne Introduction Everyday managers must make sense of the facts or data that businesses accumulate through ongoing activities. Information is acquired through arranging, summarizes or transforming these data in some logical manner. Information is extremely important in business. When that information comes in the form of numerical data it is said to be quantitative. And the methods needed to make sense of the data are quantitative methods i.e. coordinate geometry. An element of uncertainty or randomness is often associated with quantitative data. The appropriate quantitative methods for such situations are statistical methods. Information, in turn forms the basis of rational decision making. Statistics is the science that processes and analyzes data in order to provide managers with useful information to aid in decision making. Descriptive statistics focus on the collection, summarization and characterization of a set of data. Inferential statistics estimates a characterization of a set or helps uncover patterns in data sets that are unlikely to occur by chance. The mathematics of probability theory form the foundation of inferential statistics. Inferential methods select samples, portions of an entire set of data, rather than the complete set itself which statisticians call the population. Inferential methods use the sample data to calculate summary( called statistics ) measures that decision-makers can use to estimate the characteristics of the entire population( called parameters). Today the technological advances in computer processing has made practical applications of computational complex inferential methods , that were beyond the computational capabilities available to early statistical researches. Thus we use the MINITAB statistical package on this course which is widely used, to do our statistical calculations. 1 Picturing Distributions with Graphs Statistics is a group of methods used to collect, analyze, present and interpret data and make decisions. The volume of data available to us is over-whelming Each March, for example, the United States Census Bureau collects economic and employment data from more than 200,000 people. From the bureau's Web site you can choose to examine more than 300 items of data for each person (and more for households): child care assistance, child care support, hours worked, weekly earnings, and much more. . To make sense of such large volumes of data we must first organize this data in a systematic manner. Before we give methods for organizing large volumes of data we need some definitions. Definition: Individuals or Observations are objects described by a set of data. Individuals or observations may be people but they may be animals or things. Any set of data contains information about some group of individuals or observations. The information is organized in variables. Definition: A variable is any characteristic of an individual or observation. A variable can take different values for different individuals or observations Now in statistics there are in general two types of variables, namely categorical (or qualitative) variables and quantitative variables. Definition: A categorical variable places an individual or observation into one of several groups or categories. Definition: A quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense. Definition: A distribution of a variable tells us what values it takes and how often ) it takes these values. 2 Example Here is part of the data in which a professor records information about student performance in a course. A Name B School Smith, John Arthur, Brenda Fox, Des Boggs, Joan C Major D HW total Edu EdPsych 95 Law Psych 32 Science Biol 74 Science Math 86 E Midterm 80 61 68 75 F Final Exam 88 54 70 94 G Total H Grade 263 147 212 255 A D B A The individuals described are the students. Each row records data on one individual. Each column contains the values of one variable for all the individuals. In addition to the student's name, there are 7 variables. School and major are categorical variables. Scores on homework, the midterm, and the final exam and the total score are quantitative. Grade is recorded as a category (A, B, and so on), but each grade also corresponds to a quantitative score (A = 4, B = 3, and so on) that is used to calculate student grade point averages. Most data tables follow this format--each row is an individual, and each column is a variable. This data set appears in a spreadsheet program that has rows and columns ready (or your use. Spread sheets are commonly used to enter and transmit data and to do simple calculations such as adding homework, midterm, and final scores to get total points. Example 1Fuel economy. Here is a small part of a data set that describes the fuel economy(miles per gallon of 2002 model motor vehicles. Make and Vehicle Transmission Number of City MPG Highway model Type Type cylinders MPG Acura NSX Two-seater Automatic 6 17 24 Audi A4 Compact Manual 4 22 31 Buick Midsize Automatic 6 20 29 Century Dodge Ram Standard Automatic 8 15 20 1500 pickup truck (a) What are the individuals in this data set ? (b) For each individual, what variables are given? Which of these variables are categorical and which are quantitative? Solution (a) model motor vehicles. (b)Make and model, Vehicle Type and Transmission Type are categorical variables while Number of cylinders, City MGP and Highway MPG are quantitative variables. 3 Exercise 1 Which of the following variables are categorical(or qualitative) and which are quantitative? (i) The color of cars involved in several severe accidents. (ii) The length of time required for rats to move through a maze. (iii) The classification of police administration as city, county or state. (iv) The ratings given to pizza in a taste test as poor, good or excellent. (v) The number of times subjects in a sociological research study have been married. 4 Exploratory Data Analysis Statistical tools and ideas help us to examine data in order to describe their main features. This examination is called exploratory data analysis. The two basic strategies that help us organize our exploration of data are (i)Examine each variable by itself and if there are more than one variable study the relationship among the variables. (ii)Begin with a graph or graph that describe the data. Then add numerical summaries for more complete description. The proper choice of graph depends on the nature of the variable. We shall first study categorical variables. The distribution of a categorical variable lists the categories and gives either the count or the percent of individuals who fall in each category. Categorical variables: pie charts , bar graphs. The main graphs that we use for categorical variables are (i)bar graphs, (ii) pie charts. Definition :A bar graph is a graph made up of bars whose heights represent the frequencies or percentages of respective categories. Note: The bar graphs for relative frequency and percentages of different categories can be drawn simply by making relative frequencies or percentages instead of class frequencies of categories on the vertical axis. 5 Example 1: Consider the following example A sample was taken of 25 high school seniors who were planning to go to university. Each of the students was asked which of the following majors he or she intended to study: Business, Economics, Management Information Systems(MIS), Behavioural Science(BS),Other. The responses of these students were as follows Economics Business BS Other Economics MIS Business BS Business MIS Economics Other MIS MIS Other Business Business Other Other Other MIS Business Other Other MIS Construct a frequency distribution table, a relative frequency and percentage table for this categorical data. Hence construct their corresponding bar graphs. Step 1 Construct a tally and class frequency table for the given categorical data. Major Business Economics MIS BS Other Tally !!!!/ ! !!! !!!!/ ! !! !!!!/ !!! Frequency 6 3 6 2 8 Sum=25 Step 2 Construct a relative frequency and percentage table from above. Frequency of that category . Sum of all frequencie s The percentage of a category=(Relative frequency) 100. The relative frequency of a category = Major Business Economics MIS BS Other Relative Frequency 6/25=.24 3/25=.12 6/25=.24 2/25=.08 8/25=.32 6 Percentage % .24(100)=24% .12(100)=12% .24(100)=24% .08(100)=8% .32(100)=32 Sum=100% From these tables we can draw following bar graphs Student Choise of University Course Number of students 8 7 6 5 4 3 2 BS Business Economics MIS Other Courses In decreasing order. Student Choise of University Course Number of students 8 7 6 5 4 3 2 Other Business MIS Economics BS Courses 7 Relative Frequency Student Choise of University Course Relative Frequency 0.3 0.2 0.1 Other Business MIS Economics BS Courses Percentage Student Choise of University Course Percentage 30 20 10 Other Business MIS Economics BS Courses 8 Pie Charts :A pie chart is more commonly used to display percentages, although it can be used to display frequencies , or relative frequencies. The whole pie (or circle) represents the total sample or population. The pie is divided into different portions that represent the percentages of the population or sample belonging to different categories. Definition: Pie Chart: A circle divided into portions that represent the relative frequencies or percentages of a population or sample belonging to different categories is called a pie chart. To construct a pie chart: A circle contains 360 degrees. To construct a pie chart we multiply 360 by the relative frequency(or %) for each category to obtain the degree measure or size of the angle for representing that particular category. For the categorical data of student choice of university course above we show the calculation of angle sizes for the various categories in the table below. Major Business Economics MIS BS Other Percentage % 24 12 24 8 32 Sum=100 Angle Size 360 .24 =86.4 360 .12 =43.2 360 .24 =86.4 360 .08 =28.8 360 .32 =115.2 Sum=360 The required pie chart is shown below. Student Choice of University Course Economics (3, 12.0%) Business (6, 24.0%) MIS (6, 24.0%) BS (2, 8.0%) Other 9 (8, 32.0%) Example 2 The breakdown of American municipal waste in 2000 in million of tons is given by the following table Material Food scraps Glass Metals Paper, paperboard Plastics Rubber, leather, textiles Wood Yard trimmings Other Total Weight (million of tons) 25.9 12.8 18.0 86.7 24.7 15.8 12.7 27.7 7.5 231.9 Note: The weights add to 231.8 and not 231.9 as given in table due to roundoff error Construct a percentage distribution table for the above categorical data and draw a (i) bar chart and a (ii) pie chart for this percentage distribution. Solution We calculate the % of each category of the total waste and the % distribution is given below. Material Food scraps Glass Metals Paper, paperboard Plastics Rubber, leather, textiles Wood Yard trimmings Other Total Weight (million of tons) 25.9 12.8 18.0 86.7 24.7 15.8 12.7 27.7 7.5 231.9 10 Percentage of total 11.2% 5.5% 7.8% 37.4% 10.7% 6.8% 5.5% 11.9% 3.2% 100.0 From this distribution table we can draw the frequency bar graph and the % frequency Bar chart as shown below. Weight of Waste in millions oftons Breakdown(in million of tons) of American waste in 2000 90 80 70 60 50 40 30 20 10 0 paper yard food plasticmetals rub glass wood other Percentage Weight of Waste in millions oftons Waste % of Breakdown(in million of tons) of American waste in 2000 40 30 20 10 0 paper yard food plasticmetals rub glass wood other Waste 11 (ii) From the % frequency distribution table we can also construct the % pie chart for the above data as outlined above and this pie chart is given below. Categories are in the order given in table in the pie chart.. % of American municipal waste in 2000 metals ( 8, 7.8%) glass ( 6, 5.5%) paper (37, 37.4%) food (11, 11.2%) other ( 3, 3.2%) yard plastic (11, 10.7%) rub (12, 11.9%) wood ( 6, 5.5%) ( 7, 6.8%) Categories are decreasing order anticlockwise in the pie chart. % of American municipal waste in 2000 paper (37, 37.4%) yard (12, 11.9%) other ( 3, 3.2%) food (11, 11.2%) wood ( 6, 5.5%) glass ( 6, 5.5%) plastic (11, 10.7%) rub ( 7, 6.8%) metals ( 8, 7.8%) 12 Exercise 2: The areas of various continents of the world in millions of square kilometres are presented in table below. Continent Africa Asia Europe North America Oceanic South America U.S.S.R Total Area 30.3 26.9 4.9 24.3 8.5 17.9 20.5 133.3 % 22.7 20.2 3.7 18.2 6.4 13.4 15.4 100.0 Display this data using (i) a bar chart and (ii) a pie chart. Exercise 3: The breakdown of total dollars spent on business trips in the United States is estimated as follows (a) 41% on air fares, (b)22% on lodgings, (c) 12% on meals, (d) 8% on car rentals and (e) the remaining on other expenses. (i) Construct a pie chart to show this information. (ii)Construct a bar chart to show this information. 13 Quantitative variables: The two main tools for organizing and displaying quantitative data are (i) histograms and (ii) stem-and-leaf displays. Definition Histogram is a graph in which classes (groups of observations or individuals) are marked on the horizontal axis and frequencies, relative frequencies or percentages are marked on the vertical axis. The frequencies, relative frequencies or percentages are represented by the heights of the bars. In a histogram the bars are drawn adjacent to each other. To draw a histogram we have to construct a frequency distribution table. Data presented in the form of a frequency distribution table are called grouped data. A graph of the distribution is clearer if nearby values are grouped together. The most common graph of the distribution of one quantitative variable is a histogram. Definition A frequency distribution for quantitative data lists all the classes(or groups) and the number of values that belong to each class. To construct a frequency distribution table and hence a histogram we have to first have to decide how many groups or classes we divide the given data set into. Usually the number of classes varies from 5-20 depending on the number of data values in the data set. It is preferable to have more classes as the size of the set increases. Too few classes will give a "skyscraper" graph, with all values in a few classes with tall bars. Too many will produce a "pancake" graph, with most classes having one or no observations. Neither choice will give a good picture of the shape of the distribution.. You must use your judgment in choosing classes to display the shape. Statistics software will choose the classes for you. The software's choice is usually a good one, but you can change it if you want. The decision about the number of classes is arbitrary and is made by the data organizer. A rough guide is given below. Statistics software such as MINITAB that gives an automatic choice is usually a good one. 14 Rules for constructing a Frequency Distribution. Rule 1 Class Intervals must be inclusive and non-overlapping. Each observation or individual must belong to one and only one class and boundaries must not overlap. Rule 2 Number of Intervals Rouge Guide Sample Size Number of classes Fewer than 50 5-6 classes 50-100 6-8 classes Over 100 8-10 classes . Rule 3 Interval Width The approximate class width width l arg est data value smallest data value number of groups The interval width is often rounded to most convenient integer. Definition Class Boundaries are the end data values of the groups or classes that the data are divided into. Definition Class Width=Upper Boundary-Lower Boundary. Definition Midpoint= Lower Class Limit Upper Class Limit of the same class 2 We will illustrate all these definitions by preparing a frequency table for the data set and drawing frequency distribution and % frequency distribution for the following example. 15 Example 1 Prepare a frequency table for the following data .Hence draw a histogram of the frequency distribution and the % frequency distribution for this data. The data below shows the weights( to the nearest gram) of 40 bags of flour: Data 501 500 498 498 490 513 493 494 Solution 502 499 505 499 503 494 503 501 501 505 502 505 507 503 507 496 496 502 500 488 499 511 505 503 499 501 499 499 499 500 498 507 25 4.111 5 . 6 The required Frequency Distribution Table is given below Note: Any convenient number equal to or less than the smallest value data value can be used as the lower limit of the first class. The Range=513-488=25. Class Width= Weights to nearest gram Class Boundaries 488 to less than 493 493 to less than 498 498 to less than 503 503 to less than 508 508 to less than 513 513 to less than 518 Tally Class Width Class Midpoint Frequency !! 5 490.5 2 !!!!/ 5 495.5 5 !!!!/ !!!!/ !!!!/ !!!!/ !!!!/ !!!!/ ! ! 5 500.5 20 5 505.5 11 5 510.5 1 ! 5 515.5 1 Sum=40 Example: Class Width=Upper Class Boundary-Lower Class Boundary Class Width=498-493=5. 493 498 495.5 . Class Midpoint= 2 I just get the equal class widths and class midpoints to draw the histogram or for use in statistic software package such as MINITAB. 16 From the above frequency distribution table we can draw the following histogram. Frequency Distribution of number of flour bags Number of flour bags 20 10 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. N.B. The shape of the histogram is single peaked and is approximately symmetrical. A polygon is another device that can be used to present quantitative data in graphic form. Definition Polygon is a graph formed by joining the midpoints of the top of successive bars in a histogram with straight lines. Two extra classes are added, one at each end and their midpoints marked and they have zero frequency. Two extra classes are added, one at each end and their midpoints marked and they have zero frequency. Definition: A frequency polygon is a graph formed by joining the midpoints of the top of successive bars in a frequency histogram with straight lines. 17 Frequency Distribution and Frequency Polygon of the weights of Flour Bags Number of flour bags 20 10 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. 18 We can also draw a histogram of the Percentage % frequency and a % frequency polygon by constructing a frequency table for these frequencies. A complete frequency table showing (i)the frequency and (ii) the % frequency is given below for the above data set. Weights to nearest gram Class Boundaries 488 to less than 493 493 to less than 498 498 to less than 503 503 to less than 508 508 to less than 513 513 to less than 518 Tally Class Midpoint Frequency Relative Frequency Percentag e% Frequency !! 490.5 2 2/40=.05 5% !!!!/ 495.5 5 5/40=.125 12.5% !!!!/ !!!!/ !!!!/ !!!!/ !!!!/ !!!!/ ! ! 500.5 20 20/40=.5 50% 505.5 11 27.5% 510.5 1 11/40=.27 5 .025 ! 515.5 1 .025 2.5 Sum=40 Sum=1.0 % Sum=100 % 2.5% Last two columns are calculated from the fourth column according to following Note 1 Relative Frequency of a class frequency of that class . Sum of all frequencies Note 2 Percentages %=Relative Frequencies 100. 19 From this frequency distribution table we can draw the histogram for the % Frequency Distribution and the % Frequency Polygon. This is shown below. Definition : A percentage polygon is a graph formed by joining the midpoints of the top of successive bars in a percentage % frequency histogram with straight lines. Two extra classes are added, one at each end and their midpoints marked and they have zero frequency. % Frequency Distribution and % Frequency Polygon of the weights of Flour Bags % Number of flour bags 50 40 30 20 10 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. N.B. The shape of the histogram is single peaked and is approximately symmetrical. 20 Cumulative Frequency Distribution Definition: A cumulative frequency distribution gives the total number of values that fall below the upper boundary of each class. We will illustrate the concepts of (i) Cumulative Frequency Distribution and (ii) Cumulative % Frequency Distribution using the data for weights of the 40 flour bags above. Example 1 Construct a (i) cumulative frequency distribution table and draw its corresponding histogram for weights of the 40 flour bags above. Also construct a (ii) cumulative % frequency distribution table and draw its corresponding histogram for weights of the 40 flour bags above. Solution The complete frequency distribution table for this example was Weights to nearest gram Class Boundaries 488 to less than 493 493 to less than 498 498 to less than 503 503 to less than 508 508 to less than 513 513 to less than 518 Tally Class Midpoint Frequency Relative Frequency Percentag e% Frequency !! 490.5 2 2/40=.05 5% !!!!/ 495.5 5 5/40=.125 12.5% !!!!/ !!!!/ !!!!/ !!!!/ !!!!/ !!!!/ ! ! 500.5 20 20/40=.5 50% 505.5 11 27.5% 510.5 1 11/40=.27 5 .025 ! 515.5 1 .025 2.5 Sum=40 Sum=1.0 % Sum=100 % 21 2.5% Weights to nearest gram Class Boundaries 488 to less than 493 493 to less than 498 498 to less than 503 503 to less than 508 508 to less than 513 513 to less than 518 Class Midpoint Frequency Cumulative Frequency 490.5 2 2 495.5 5 2+5=7 500.5 20 2+5+20=27 505.5 11 2+5+20+11=38 510.5 1 2+5+20+11+1=39 515.5 1 2+5+20+11+1+1= 40 Sum=40 22 Note: The lower boundary of the first class 488 is taken as the lower boundary of each class in the cumulative frequency . The upper boundaries of all classes are the same as in the frequency distribution table. To obtain the cumulative frequency of a class just add the frequency of that class to the frequencies of all the preceding classes. The cumulative frequencies are recorded in the fourth column while the class boundaries are recorded in the first column. From this table we can draw the histogram of the cumulative frequency distribution which is given below. Cumulative Number of flour bags Cumulative Frequency Distribution of number of flour bags 40 30 20 10 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. The advantage of the cumulative frequency histogram or table is that it can answer following question “ How many observations fall below upper limit of a class?” Example How many bags of flour weigh less than or equal to 500 grams. Ans approximately 27. 23 Cumulative Relative Frequency and Cumulative Percentage Cumulative Relative Frequency and Cumulative Percentage are easily obtained from the cumulative frequency distribution using following formulae. Cumulative relative frequency cumulative frequency of a class Total observations in the data set Cumulative Percentage Cumulative relative frequency 100 . We will illustrate the Cumulative Relative Frequency and Cumulative Percentage using the example above. The Cumulative Relative Frequency and Cumulative Percentage distribution table for this data is Weights to nearest gram Class Boundaries 488 to less than 493 493 to less than 498 498 to less than 503 503 to less than 508 508 to less than 513 513 to less than 518 Class Midpoint Cumulative Frequency Cumulative Relative Frequency Cumulative % Frequency 490.5 2 2/40=.05 5% 495.5 2+5=7 7/40=.175 17.5 % 500.5 2+5+20=27 27/40=.675 67.5 % 505.5 2+5+20+11=38 38/40=.95 95 % 510.5 2+5+20+11+1=39 39/40=.975 97.5 % 515.5 2+5+20+11+1+1= 40 40/40=1.0 100 % Note: The Cumulative Relative Frequency and the Cumulative Percentage are really the same except the vertical axis have different units. Hence we will only plot the Cumulative Percentage. 24 From this table we can draw the histograms for the Cumulative Percentage and is given below. Cumulative % Number of flour bags Cumulative % Frequency Distribution of number of flour bags 100 50 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. 25 Definition: An ogive (ojive) or cumulative frequency polygon is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of the respective classes. The ogive starts at the lower boundary of the first class and ends at the upper boundary of the last class. Cumulative Number of flour bags Cumulative Frequency Distribution and Cumulative Frequency Polygon or Ogive of the weights of Flour Bags 40 30 20 10 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. One advantage of an ogive is that it can be used to approximate the cumulative frequency for any interval. For example we can find the number of bags with weights less than or equal to 504 grams is approximately 33. Definition: An % ogive (ojive) or % cumulative frequency polygon is a curve drawn for the % cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the % cumulative frequencies of the respective classes. The ogive starts at the lower boundary of the first class and ends at the upper boundary of the last class. 26 % Cumulatie Number of flour bags % Cumulative Frequency Distribution and % Cumulative Frequency Polygon or % Ogive of the weights of Flour Bags 100 50 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. Note: From the % ogive 50% of bags have weight 502 grams or less approximately. 27 EXAMINING A DISTRIBUTION Making a statistical graph is not an end in itself. The purpose of the graph is to help to understand the data. After you make a graph, always ask, "What do I see?" Once you have displayed a distribution, you can see its important features as follows. In any graph of data, look for the overall pattern and for striking deviations from that pattern. You can describe the overall pattern of a histogram by its (i) Shape, (ii) Center and (iii) Spread. . An important kind of deviation is an outlier, an individual value that falls outside the overall pattern. We will learn how to describe center and spread numerically later. For now, we can describe the center of a distribution by its midpoint or median, the value with roughly half the observations taking smaller values and half taking larger values. We can describe the spread of a distribution by giving the range that is , and largest values-' smallest values. Example: Examine the histogram which we obtained for the data on the weights of 40 flour bags which is reproduced below. % Frequency Distribution and % Frequency Polygon of the weights of Flour Bags % Number of flour bags 50 40 30 20 10 0 490.5 495.5 500.5 505.5 510.5 515.5 Weights to nearest gram. (i)The shape of the histogram is single peaked and is approximately symmetrical with no obvious outliners. (ii)The center is given by the midpoint point or median which is approximately 500.5 from graph. (iii) The spread is given by the range=518-488=30. 28 Example: Examine the histogram which we obtained for the data on the % of the American population per state that were of Hispanic origin. The histogram is reproduced below. Frequency Distribution and Frequency Polygon of the % of population of Hispanic origin Number of States 30 20 10 0 -2.5 2.5 7.5 12.5 17.5 22.5 27.5 32.537.5 42.5 47.5 % of population of Hispanic origin (i)The shape of the histogram is single peaked and right-skewed. The distribution has a single peak which represents states that are less than 5% Hispanic. The distribution is skewed to the right. Most states have no more than 10% Hispanics, but some states have a much higher percentages, so that the graph trails off to the right. (ii)Center: Frequency distribution Table and histogram shows that about half the states have less than 4.7% Hispanics among their residents half have more. So the midpoint of the distribution is close to 4.7%. (i) Spread: The spread is from about t 0% to 42%, but only four states fall above 20%. Outliers: Arizona, California, New Mexico, and Texas stand out. Whether they are outliers or just part of the long right tail of the distribution is a matter of judgement. There is no rule for calling an observation an outlier but we will give a rough rule later. Once you have spotted possible outliers, look for an explanation. Some outliers are due to mistakes,. such as typing 4.2 as 42. Other outliers point to the special nature of some observations. These four states are heavily Hispanic by history and location. 29 Exercise 1 The final marks of 80 students at a university are recorded below 68 84 75 82 68 90 62 88 76 93 73 79 88 73 60 93 71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78 82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79 83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57 88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77. (a)Construct a Frequency Distribution , the Relative Distribution and the % Frequency Distribution table for this data and draw the histograms for(i) the Frequency Distribution and Frequency Polygon, and (ii) the % Frequency Distribution and % Frequency Polygon. Take 7 equal classes. (b)Construct a Cumulative Frequency Distribution and Cumulative % Frequency Distribution table and draw their respective histograms and Ogives.Take 7 equal classes. (c) Examine and describe the Frequency Distribution. Exercise 2: The total payrolls( rounded to millions) for all 30 major league baseball teams in U.S.A. for 1999 are given in the table below. Total Payrolls of Major League Baseball Teams for 1999 Team Anaheim Arizona Atlanta Baltimore Boston Chicago Cubs Chicago White Sox Cincinnati Cleveland Colorado Detroit Florida Houston Kansas City Los Angeles Total Payroll(millions of dollars) 51 70 79 75 72 55 25 38 74 54 37 15 56 17 77 Team Milwaukee Minnesota Montreal New York Mets New York Yankees Oakland Philadelphia Pittsburgh St. Louis San Diego San Francisco Seattle Tampa Bay Texas Toronto 30 Total Payroll(millions of dollars) 43 16 15 72 92 25 30 24 46 47 46 45 38 81 49 (a)Construct a Frequency Distribution , the Relative Distribution and the % Frequency Distribution table for this data and draw the histograms for(i) the Frequency Distribution and Frequency Polygon, and (ii) the % Frequency Distribution and % Frequency Polygon. Take 5 equal classes. (b)Construct a Cumulative Frequency Distribution and Cumulative % Frequency Distribution table and draw their respective histograms and Ogives. Take 7 equal classes. (c) Examine and describe the Frequency Distribution. 31 Description and Interpretation of Histograms A Histogram is a pictorial representation of the frequency distribution of a data set. Interpreting the histogram is very important as this interpretation helps us to understand the data and what the data tells us about the process the data represents. In examining a histogram you should note its most important features which are (i) Its overall shape or pattern together with any deviations, (ii) Its centre, (iii) Its spread. An important kind of deviation is an outliner which is an individual data value that lies outside the overall pattern of data vaues. Firstly as regards the shape note whether it is irregular, multimodal(many peaked) or unimodal (single peaked). An example of an irregular histogram is the histogram displaying the costs for 2002-2003 academic year of 56 four year colleges in Massachusetts. The overall pattern shows that it is neither symmetric nor skewed but that it is irregular with two separate clusters of colleges, 11 colleges costing less than $16,000( public colleges) and the remaining 45 colleges costing more than $20,000( private colleges). 32 A histogram which is multimodal( in fact bimodal) is shown below. Such a bimodal histogram generally represents a mixture of two different types of data. In this case half of the men are Irish and half are pygmies. 33 Most of the histograms that we study are unimodal ( single peaked) and the most common shapes are (i) symmetric, (ii) skewed, (iii) uniform or rectangular. Graphs of these shapes together with their frequency curves are shown below. (i) Histogram of the vocabulary scores of all 947 seven-grade students in Gary, Indiana. The smooth curve shows the overall shape of the distribution and is an approximation of the frequency polygon for this large data set. This curve is a mathematical model for the distribution. A mathematical model is an idealized description. It gives a compact picture of the overall pattern of the data but ignores minor irregularities as well as any outliners. It is the frequency polygon when the data set becomes very large. The histogram is symmetric and bell-shaped and is very important and the most widely occurring histogram shape in statistics. 34 The table and histogram of the percentage of the population of Hispanic origin by state (2000) is given below. The percentage of the population of Hispanic origin, by in 2000 is given by following table. State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Percentage 1.5 4.5 25.3 2.8 34.4 17.1 9.4 4.8 16.8 5.3 7.2 7.9 10.7 3.5 2.8 7.0 1.5 State Percentage Louisiana 2.4 Maine 0.7 Maryland 4.3 Massachusetts 6.8 Michigan 3.3 Minnesota 2.9 Mississippi 1.3 Missouri 2.1 Montana 2.0 Nebraska 5.5 Nevada 19.7 New Hampshire 1.7 New Jersey 13.3 New Mexico 42.1 New York 5.1 North Carolina 4.7 North Dakota 1.2 Its histogram is given below Pecentage of population of Hispanic origin Number of states 30 20 10 0 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 Percentage Hispanic 35 State Percentage Ohio 1.9 Oklahoma 5.2 Oregon 8.0 Pennsylvania 3.2 Rhode Island 8.7 South Carolina 2.4 South Dakota 1.4 Tennessee 2.0 Texas 32.0 Utah 9.0 Vermont 0.9 Virginia 4.7 Washington 7.2 West Virginia 0.7 Wisconsin 3.6 Wyoming 6.4 Shape of Histogram: Its overall shape is unimodal and is skewed to the right i.e. most states have no more than 10%Hispanics but some states have much higher percentages so that the graph tails off to the right. Centre of Histogram: The centre as shown by table is about 4.7% i.e. about half the states have less than 4.7% Hispanic among their residents and half have more than. Hence the midpoint of distribution is close 4.7%. Spread of Histogram: The spread is from 0% to 42% with only 4 states fall close to 20%. Outliners of Histogram: Arizona, California, New Mexico, and Texas stand out. Whether they are outliers or just part of the long right tail of the distribution is a matter of judgement. There is no rule for calling an observation an outlier. Once you have spotted possible outliers, look for an explanation. Some outliers are due to mistakes,. such as typing 4.2 as 42. Other outliers point to the special nature of some observations. These four states are heavily Hispanic by history and location. Note: When you describe a distribution, concentrate on the main features. Look for major peaks, not for minor ups and downs in the bars of the histogram. Look for clear outliers, not just for the smallest and largest observations. Look for rough symmetry or clear skewness. 36 The most common shape for a distribution is a symmetric pattern like one shown below for the heights of 1000 men called a normal distribution . The normal distribution arises so often that when we see a non-normal pattern it is worth asking why it is not normal. The histogram below shows the heights of `1000 men with a distribution that is not normal and there is a reason for this. The left tail is missing and is called a truncated normal. The reason it is not normal is that the men are all members of the police which have a minimum requirement for all police recruits. 37 Describing Distributions with Numbers Two common features of all the distributions which we have encountered are (i) Data clusters about a central data value, (ii) Data spread or variability about this central data value. We would like to have numbers that measure these two characteristics of data distributions, namely the center, and the spread of the data. The two main numerical measures of the center of data are the (i)mean and (ii) the median. The main numerical measures of spread or variability are (i) the range (ii)the quartiles and (iii) standard deviation. Measuring Center of Data: the Mean The most common measure of center of a set of data is the arithmetic mean, usually called just the mean. Definition: Mean The mean denoted by x of n data values x1 , x2 , x3 , xn is i n x x2 x3 xn x 1 n x i 1 n i . Example 1: The incomes of 15 people who have bachelors degrees chosen at random from the U.S. Census Bureau in March 2002 were to the nearest thousand of dollars 110 25 50 50 55 30 35 30 4 32 50 30 31 74 60. Find the average income. Also find the average income of 14 of the same people excluding the person earning 110,000 dollars. Solution 110 25 50 50 55 30 35 30 4 32 50 30 31 74 60 x 15 666 44.4 or $44,400. 15 The average of the 14 people is 25 50 50 55 30 35 30 4 32 50 30 31 74 60 x 14 556 39.714286 39.7 or $39700. 14 Note 110 was an outliner and its presence raised the mean from $39,700 to $44,400. This illustrates the important fact that the mean as a measure of the center of data is sensitive to the influence of a few extreme values. These may be outliners but a skewed distribution with no outliners may also pull the mean towards its long tail. Because the mean cannot resist the influence of extreme data values we say that the mean is not a resistant measure of the center. 38 Measuring center of data: Median Another important measure of the center of data is the median M. Definition: Median The median M is the midpoint of a distribution, such that half the data values are smaller and the half are larger. To find the median of a distribution. Step 1 Arrange all the data values in order of size from the smallest to the largest. Step 2 If the number of data values n is odd, the median M is the center of the data n 1 values in the ordered list. Find the location of the median by counting 2 values up from the bottom of the list. Step 3 If the number of data values n is even, the median M is the mean of the two center data values in the ordered list. The location of the median is again got by n 1 values up from the bottom of the list. counting 2 Note: The formula n 1 does not give the median but just the location of the median in 2 the ordered list. The median requires no arithmetic to calculate and for a small data set is easy to compute. Example: To find the median when n is odd. The incomes of 15 people who have bachelors degrees chosen at random from the U.S. Census Bureau in March 2002 were to the nearest thousand of dollars 110 25 50 50 55 30 35 30 4 32 50 30 31 74 60. To find the median when n is even. Also find the median income of 14 of the same people excluding the person earning 110,000 dollars. Solution The earnings of the 15 college graduates arranged in ascending order are 4 25 30 30 30 31 32 35 50 50 50 55 60 74 110 n 1 16 8 th location in the data set and its value is The median is situated at the 2 2 Median M =35 or %35,000. The earnings of the 14 college graduates arranged in ascending order are 4 25 30 30 30 31 32 35 50 50 50 55 60 74 The median is situated at the n 1 15 7.5 th location in the data set, that is halfway 2 2 between the 7th and 8th position in the ordered list. 32 35 67 33.5 or $33,500. Thus the Median= 2 2 Note: Outliner 110 changes the median by $1500 39 Comparing the Mean and the Median We see from the calculations above the single outliner $110,000 changes the mean by $4700 while it only changes the median by $1500.The median is thus considered a resistant measure of center of data while the mean is not a resistant measure of the center of data. More generally the mean and the median are close together in symmetric distribution. In fact if the distribution is exactly symmetrical the mean and the median will be the same. In a skewed distribution the mean is further out in the long tail than the median. 40 Measuring the Spread or Variability of a Distribution of Data Two quantities that are used to measure the spread or variability of the distribution of data are (i)the range of the data and (ii)the quartiles of the data. Definition: The Range of the data is maximum data - the minimum data value. Now this measure of spread or variability can be unreliable because the data may contain outliners. A more reliable measure of the spread or variability of the distribution of data is the Interquartile Range(IQR) which measures the spread or variability of the middle half of the data. Definition Quartiles Q1 and Q3 With the data arranged in increasing order the first quartile Q1 lies one quarter of the way of the way up the list of data. The third quartile Q3 lies three quarters way up the list of data. In other words first quartile Q1 is larger than 25% of the data values and the third quartile Q3 is larger than 75% of the data values. Note: The second quartile Q2 is the median M which is larger than 50% of the data values.Thus the quartiles divide the data into quarters. Definition: The Interquartile Range(IQR)= Q3 Q1 . Note: The interquartile measures the spread of the middle half of the data. To find the quartiles Q1 and Q3 . Step 1 Arrange the data in increasing order listing all the data including those that are equal(always regard data that are equal as distinct) and locate the median M. Step 2 The first quartile Q1 is the median of the data values of the ordered list of data to the left of the location of the overall median M. In other words Q1 is n 1 position of the ordered list of data values. located in the 4 Step 3 The third quartile Q3 is the median of the data values of the ordered list of data to the right of the location of the overall median M. In other words Q3 is 3n 1 located in the position of the ordered list of data values. 4 Step 4 Find the Interquartile Range(IQR)= Q3 Q1 . Note: Some software packages use slightly different rules to calculate the quartiles so computer results may be slightly different from the results calculated by the above rules. However the difference will be very small and can be ignored. 41 Five-number summary and Box-and -Whisker plot(Boxplot) The smallest and the largest data values tell us little about the data distribution as a whole but they give us information about the tails of the distribution that is missing if we know only Q1 , M and Q3 . To get a quick summary of both center and spread we combine all five numbers into what is called the five-number summary. Definition: The five- number summary of a distribution of data consists of the smallest data value, the first quartile Q1 , the median M, the third quartile Q3 , and the largest data value ,written in order from the smallest to the largest. In symbols the fine-number summary is Minimum Q1 M Q3 Maximum. Note: These five number offer a reasonably complete description of the center and the spread or variability of a data distribution. Of course Minimum < Q1 < M < Q3 < Maximum. 42 Example 2: The incomes of 15 people who have bachelors degrees chosen at random from the U.S. Census Bureau in March 2002 were to the nearest thousand of dollars 110 25 50 50 55 30 35 30 4 32 50 30 31 74 60. Find (i) the range (ii) the five-number summary and (iii)the interquartile(IQR) . Solution Note n 15 Step 1 Arrange the data in ascending order 4 25 30 30 30 31 32 35 50 50 50 55 60 74 110 (i)The Range=110-4=106. (ii)Median M is at the n 1 16 8 position in the list 2 2 i.e.M=35. The first quartile Q1 is the median of the data values of the ordered list of data to the left of the location of the overall median M, that is the median of n 1 8 4 th(where n 7 ) 4 25 30 30 30 31 32 which is at the 2 2 n 1 16 4 th position i.e. Q1 30 .Alternatively first quartile Q1 is at the 4 4 position of the overall ordered data list i.e. Q1 30 . The third quartile Q3 is the median of the data values of the ordered list of data to the right of the location of the overall median M, that is the median of n 1 8 4 th(where n 7 ) 50 50 50 55 60 74 110 which is at the 2 2 position i.e. Q3 55 .Alternatively third quartile Q3 is at the 3n 1 3 16 12 th position of the overall ordered data list i.e. Q3 55 . 4 4 Thus the five-number summary is Minimum=4, Q1 30 ,M=35, Q3 55 and Maximum=110. The Interquartile Range(IQR)= Q3 Q1 =55-30=25. 43 Minitab gives following results Descriptive Statistics: income Variable income N 15 Mean 44.40 Median 35.00 TrMean 42.46 Variable income Minimum 4.00 Maximum 110.00 Q1 30.00 Q3 55.00 Stem-and-leaf of income Leaf Unit = 1.0 1 1 2 (6) 7 7 3 2 1 1 1 1 0 1 2 3 4 5 6 7 8 9 10 11 N = 15 4 5 000125 0005 0 4 0 Boxplot income 100 50 0 Note asterisk denotes an outliner 110 in boxplot. 44 StDev 24.90 SE Mean 6.43 Measuring the Spread or Variability of Data: The Standard Deviation The five-number summary is not the most common numerical description of a distribution. That distinction belongs to the combination of the mean to measure the center of the distribution and the standard deviation to measure the spread or variability of the data.The standard deviation measures the spread of data by measuring how far data values are from the mean of the data. Definition: Variance s 2 and Standard Deviation s : The variance s 2 of a set of data set is the average of the squares of the deviations of the data from the mean, that is if we have n set of data x1 , x2 , x3 ,, xn then s 2 x1 x 2 x2 x 2 x3 x 2 xn x 2 n 1 i n x i 1 i x 2 where x is the mean. n 1 The standard deviation s is the square root of the variance s 2 , that is in s x i 1 i x n 1 2 . Example 3 A person’s metabolic rate is the rate at which the body consumes energy. Metabolic rate is important in studies of weight gain, dieting and exercises. Below are the metabolic rates of 7 men measured in calories, who part in the study of dieting 1792 1666 1362 1614 1460 1867 1439. 1792 1666 1362 1614 1460 1867 1439 7 11200 1600 calories. 7 The mean x 45 Data Deviation Squared Deviations 1792 1792-1600=192 192 2 36864 1666 1666-1600=66 66 2 4356 1362 1362-1600=-238 2382 1614 1614-1600=14 1460 1460-1600=-140 1867 1867-1600=267 1439 1439-1600=-161 ________________ Sum=0 56644 14 2 196 1402 19600 267 2 71289 1612 25921 _______________ Sum=214,870 The variance 1792 16002 1666 16002 1362 16002 1614 16002 1460 16002 1867 16002 1 s2 6 214870 35811.67 . 6 The standard deviation s s 2 35811.67 189.24 calories. 46 Note 1: The variance s 2 has not the same units as the data which is calories. Hence we take square root of s 2 to get the standard deviation s which has the same units as the data. Note 2: We had to square the deviations because otherwise when we sum the deviations we get zero which would be useless. Note 3: When averaging the sum of the squares of the deviations we divide by n 1 where you would expect to divide by n the number of squared deviations. The reason why we divide by n 1 and not n is as follows. The sum of the in deviations x i 1 i x is always zero(because of the definition of the mean), so knowing n 1 of them determine the last one. Thus only n 1 of the squared deviations can vary freely and so we average by dividing by n 1 rather than n .The number n 1 is called the degrees of freedom of the variance or standard deviation. Note 4: s measures the spread about the mean x and should be used only when the mean is chosen as the measure of the center of the distribution. Note 5: s 0 only when there is no spread and all the data the same and otherwise s 0. Note 6: s has the same units of measurements as the data and this is the reason why the standard deviation s is chosen in preference to the variance s 2 . Note 7: Like the mean x , the standard deviation s is not resistant. Strong skewness or a few outliners can greatly increase s . For example if 1439 is replaced by 1999 in above data the new mean x and new standard deviation s are x 1680 and s 224.85 . 47 Choosing measures of Center and Spread of Data The five-number summary is usually better than the mean x and the standard deviation s for describing a skewed distribution or a distribution with strong outliners. However use mean x and the standard deviation s only for reasonably symmetric distributions that are free of outliners. Note Most computer packages give both the five-number summary and the mean x and the standard deviation s . For example MINITAB gives the following for above data namely 1792 1666 1362 1614 1460 1867 1439 Solution Data in ascending order is 1362 1439 1460 1614 1666 1792 1867. Descriptive Statistics: C1 Variable C1 N 7 Mean 1600.0 Median 1614.0 TrMean 1600.0 Variable C1 Minimum 1362.0 Maximum 1867.0 Q1 1439.0 Q3 1792.0 Stem-and-Leaf Display: C1 Stem-and-leaf of C1 Leaf Unit = 10 1 3 3 (2) 2 1 13 14 15 16 17 18 N = 7 6 36 16 9 6 1400 1500 1600 1700 1800 C1 48 1900 StDev 189.2 SE Mean 71.5 Example 4 A new machine has been purchased to cut out drinking straws from lengths of plastic tubing. The straws should be approximately 203 mm long. A random sample of drinking straws, cut by the new machine, had the following lengths in mm 204 202 204 204 201. Find the range and show that their mean is 203mm. A random sample of drinking straws were also cut by the old machine, with the following results in mm 207 200 204 197 207. Find range and median . Show that their mean is 203mm.Why is the new machine better? Hint:Calculate the standard deviations of both sets Solution Put data in ascending order 201 202 204 204 204 Range=204-201=3. Median=204 201 202 204 204 204 1015 203 . Mean= x 5 5 Data Deviations Deviations squared 201 201-203=-2 202 202-203=-1 204 204-203=1 204 204-203=1 204 22 4 12 1 12 1 12 1 12 1 204-203=1 4 1111 8 2 Variance s 2 4 4 Standard Deviation s s 2 2 1.414213562 1.41 . Put data in ascending order 197 200 204 207 207 Range=207-197=107. Median=204 197 200 204 207 207 1015 203 . Mean= x 5 5 Data Deviations Deviations squared 197 197-203=-6 200 200-203=-3 204 204-203=1 207 207-203=4 207 207-203=4 _________ Sum=0 62 36 32 9 12 1 42 16 42 16 ________ Sum=78 49 Variance s 2 36 9 1 16 16 78 19.5 4 4 Standard Deviation s s 2 19.5 4.415880433 4.42 . Comment: So a typical drinking straw from the new machine is 203 long But typically their lengths differ from 203mm by 1.41mm. We can say that this is the precision of the machine: it typically makes an error of 1.41mm each time it cuts a straw. Note that this is the typical error, not the maximum error made by the machine. Instead the old machine cuts the drinking straws typically 203 mm long but their lengths differ typically from 203 mm by 4.42 mm so that the precision of the old machine is 4.42 each time it cuts a straw while the new machine typically an error of 1.41 each time it cuts a straw. Hence the new machine is better as the straws from the new machine are less variable in length i.e. their lengths are closer in general to 203mm. 50 51