Chapter one Introduction 1.1. Definition of Statistics How do we define Statistics? It has two meanings. In the more common usage (layman definition), statistics refers to a collection of numerically expressed facts or data. Examples: The number of colleges in a city; The number of students in a college; Per capita income statistics; Statistics of imports, exports, consumption, etc; But the subject statistics has a much broader meaning than just collecting and publishing numerical information. Therefore, we define statistics as the science of collecting, organizing, presenting, analyzing, and interpreting numerical data to assist in making more effective decisions. According to Dominick Salvatore and Derrick Reagle “statistics refers to collection, presentation, analysis and utilization of numerical data to make inferences and reach decisions in the face of uncertainty in economics, business and other social and physical sciences.” As the definition suggests: The first step in investigating a problem is to collect data. foccuos The data must be organized in some way and perhaps presented in a chart. Only after the data have been organized and presented, we can analyze and interpret it. Example: If students of economics at a university would like to know the monthly household income of 200 residents in a town, then they a) have to collect the data, that is, income of the households under study , b) should organize the data (say by arranging the data in ascending or descending order), c) should present that data by using charts, tables, etc, d) and they should do some analysis (say find the average, median, mode variance, standard deviation, , etc) and interpret the data. 1.2. Types of Statistics The study of statistics is usually divided in to two categories: 1 a) Descriptive Statistics It is a statistical method that deals with describing (summarizing) given set of data without making conclusions about the larger data. ie there isnoIt consists of collection, organization and presentation of data in an informative way. decision 1 Tables, graphs and numerical summary measures may be used to describe data. In descriptive statistics, the statistician tries to describe a situation. Examples on descriptive statistics: 1) Consider the national census conducted by the Ethiopian government in 1999 E.C. Results of this census give the average age, average household income, and other characteristics of the Ethiopian population and these are descriptive statistics. 2) A survey found that 49% of the populations in Ethiopia are males. The statistic 49 describes the number out of every 100 persons who are males. 3) According to Consumer Reports, Sony TV owners reported 2 defective TVs per 100 TVs (2%) in 2001. The statistic 2(2%) describes the number of problems out of every 100 TVs. 4) According to the bureau of the labor statistics, the average daily wages of workers in a town is birr 15 in August 2007. 5) The GDP of country X was 100 million in 1960 and 140 million in 2007. If we calculate the percentage growth of GDP from 1960 to 2007, that is still a descriptive statistics. What is the percentage growth of GDP from 1960 to 2007? [Answer 40 %= 140 100 x100% ] 100 Query: Would it be descriptive statistics if we used this GDP growth rate (40%) to estimate the GDP of country X in the year 2010? Why? What type of statistics is it? 2 b) Inferential Statistics It is also called statistical inference or inductive statistics. It is a statistical method that involves taking a sample from a population, computing the statistic based on the sample, and inferring from the statistic about the value of the corresponding parameter. It is a branch of statistics that is used to determine something about the population on the basis of a sample taken from that specific population It is a decision, estimate, prediction, or generalization about a population, based on a sample. Examples: 1) The accounting department of a large firm will select a sample of the invoices to check for accuracy for all the invoices of the company. 2) Wine tasters sip a few drops of wine to make a decision with respect to all the wine waiting to be released for sale. Note the words “population” and “sample” in the definition of inferential statistics. 2 A population is a collection of all possible individuals, objects or measurement of interest. When a researcher gathers data from the whole population for a given measure of interest, it is called census (complete enumeration). A sample is a portion or part of the population of interest. When we discuss about inferential statistics we have to differentiate between parameter and statistic. Parameter is the calculated value of a population (say population mean, population standard deviation, etc.) and statistic is the calculated value of a sample (say sample mean, sample standard deviation, etc.). The difference between sample statistic and its corresponding parameter is called sampling error. Example on sample vs. population: i. If we want to do a research on the impact of high school GPA (transcript result) on college GPA of economics students at a university, the population is all economics students at that university. ii. A researcher may select all students of economics at Debre Markos University as a sample to know the impact of high school GPA on college GPA and infer (conclude) something about the impacts of high school GPA on college GPA of economics students at all Ethiopian colleges/universities. Exercise The marketing department of a bank asked a sample of 1960 customers to try a newly developed banking system. Of the 1960 samples, 1176 said they would use the new system if it is marketed. What would the marketing department report to the bank officials regarding the acceptance of the new system in the population? Is this an example of descriptive or inferential statistics? Solution: Based on the samples of 1960 customers; we estimate that, if it is marketed sixty percent (1176/1960*100%) of all customers will use the new system and it is inferential statistics, because a sample was used to draw a conclusion about how all customers in the population would react if the new system were marketed. 3 1.3. Why we study Statistics? Statistics is required for many college programs like business, economics, engineering, psychology, medicine etc. The course content is basically the same. The biggest difference is the examples used and level of mathematics required. Statistics course in colleges of business and economics usually teach the course at a more applied level. Thus, in business and economics, we are interested in such things as: profits (revenue minus cost), foccous Gross Domestic Product (GDP), Demand, Supply, Consumption, Cost , Wages, etc. Dear distance learners, why statistics is required in so many fields of studies? We are studying statistics for the following reasons: 1) The first reason is that numerical information is everywhere. If you look in the magazines in Ethiopia, you are going to find a lot of numerical foccous information like exchange rates (say $1=10 birr), unemployment rates (say 5% in Bahir Dar= unemployed ), labor force per capita income (= Gross National Income ), Population consumption rate of cement, export of coffee, import of cars, inflation rate, demand for kerosene, enrollment rates of high schools, etc. Therefore, to be an educated consumer of this information, an understanding of the concepts of basic statistics will be useful. 2) Students and/or professionals may be called on to conduct research in their fields, since statistical procedures are basic to research. To accomplish this, they must be able to design experiments; collect, organize, analyze and summarize data and possibly make reliable predictions or forecast for future use. They must also be able to communicate the results of the study in their own words. 3) Students, like professionals, must be able to read and understand the various statistical studies performed in their field. To have such understanding, they must be knowledgeable about the vocabulary, concepts and statistical procedures used in these studies. 4 4) Data is everywhere and no matter what your future line of work, you will make decisions that involve data and understanding of statistical methods will help you make these decisions more effectively. 1.4. Uses of Statistics Importance of statistics is clearly stated in the following words of Carol D. Wright of USA “to a very striking degree, our culture has become a statistical culture. Even a person who may never have heard of an index number is affected by of those index numbers which describe the cost of living. It is impossible to understand psychology, sociology, economics, business, finance, or physical science without some general idea of the meanings of an average, of variations, of sampling, of how to interpret charts and tables.” According to H.G Wells “statistical thinking will one day be as necessary for effective citizenship as the ability to read and write.” The main functions of statistics are to enlarge our knowledge of complex phenomena. That is; i. It presents facts in a definite and precise form. Example: Instead of saying that per capita income of Ethiopia is low; better and clear to say it is 110. ii. It reduces data: i.e. it simplifies a complex mass of data and presents it in a few, clear, and useful summaries. The bulky data may be summarized in totals, averages, percentages, etc. iii. It measures the magnitude of variation in data. iv. It furnishes with technique of comparison. v. It helps to estimate the unknown population parameter from a sample. vi. It helps to test and formulate hypothesis. vii. It helps to study the relationship between two or more variables. viii. It helps to forecast future events. 1.5. Users of Statistics Most people become familiar with statistics through radio, television, newspaper, and magazines and statistical methods are used in almost all fields of human endeavor. Statistical methods help people identify and solve many problems concerning the environment, the economy, transportation, public health and other matters of public concern. Economists use statistical techniques to predict future economic conditions, to understand economic problems, to formulate economic policies, to do research in the areas of economics, to do market analysis, etc. Doctors use such methods to determine whether certain drugs help in the treatment of medical problems. Weather forecasters use statistics to help them predict the weather more accurately. Engineers use it to set standards for product safety and quality. Statistical ideas help scientists design effective experiments. 5 Lawyers are increasingly turning the statisticians to help weigh evidence and determine reasonable doubt. In education, the researchers might want to know if new methods of teaching are better than the old ones. 1.6. Application of Statistics in Business and Economics Now-a-days the success of a particular business or industry very much depends on the accuracy and precision of statistical analysis. Before taking a new venture or for the purpose of improvement of an existing venture, the business executives must have a large number of quantitative facts. Examples: cost of raw materials, foccous demand of products in the market, price of products in the market, various taxes to be paid, labor conditions, sales forecast. All these facts are to be analyzed statistically before stepping in for a new enterprise or before fixing the price of a commodity. Statistical methods are now used for exploring possibilities to advertising campaigns, for adjustment of production methods and as an aid to establish standards. Statistical techniques help in forecasting future markets. Market research and market surveys by statistical sampling methods are now extremely useful for any business person. In industry, statistics is widely used in quality control. In production engineering, to find whether the product confirms to specification, statistical tools like inspection plans, control charts, etc are of great use. Wide application of statistics can be found in insurance companies where the premium rates are fixed on the basis of mortality, average length of life, possibilities of investment, etc. 1.7 Limitations of statistics i. disadvantage Statistics deals with only quantitative information, i.e. information should be capable of numerically expressed either directly or indirectly. ii. Statistics deals with only aggregates of facts and not with individual data items. iii. Statistical data are only approximately and mathematically correct. iv. Statistics can be easily misused and, therefore, should be used only by experts. 6 Misuse of statistics Knowingly: Unknowingly: Advertising media Government for political cause - Statistics Inappropriate comparison - The subject matter to which it is applied Lack of knowledge in Incomplete information 1.8. Steps of Statistical investigation A statistical study involves the following stages: i. Determine the objective of the study; ii. Collection of data; iii. Organizing the collected data; iv. Presenting the data; v. Analyzing the data, and vi. Interpreting the results of the study and recommendations. 1.9. Types of Variables What is variable? A variable is measurable characteristics of a given phenomenon (object, process, event, etc) which can take different values in a given population or samples of elements or it is a characteristic about each element of a population or a sample. Examples: annual income (it can be Birr 2000, Birr 3000, Birr 4000, or any other value), quantity demanded (it can be 2000 units, 3000 units, 4000 units, or any other value), price (it can be Birr 2 per unit, Birr 4 per unit, Birr 10 per unit or any other value), gender (female or male), etc. Data (singular datum): are the set of values collected for the variable from each of the elements of the sample are the actual measurements or observations that result from an investigation or survey are the values (response) of the variable associated with an element of a population or a sample. Example: The variable monthly household income of a family in a town can assume different values (say, Birr 1000, Birr 3000, etc). But if we collect the monthly household income of 100 households then the values are called data. 7 Data set: is a collection of data values (data). Example: the monthly households’ income of 100 residents in a town is called data set. Raw data: is a data collected in an original form (not yet organized) Information: is a set of data corresponding to a specific aspect of knowledge combined in an organized way. Information is a processed data to be used directly. It can transfer knowledge and meanings From the point of view of statistical methods, variables can be broadly classified into qualitative (or categorical) and quantitative (or numerical) variables. Qualitative Variable፡ When the characteristic being studied is non-numeric, the variable is called qualitative variable or attribute. It is a variable or characteristic which cannot be measured in quantitative form but can only be identified by name or categories. Examples include; gender, religious affiliation, type of automobile owned, place of birth, eye color, etc. When the data are qualitative, we are usually interested in how many or what portion fall in each category. For example, what percent of the population are males? What percent of the population owns a Nokia mobile apparatus? 8 Note that: Generally, although numerical codes can be assigned to the different categories of variables, arithmetic operations (addition, subtraction, multiplication and division) are not applicable to qualitative data. Quantitative Variable: It is a variable that can be measured and expressed numerically. Examples: balance in your checking account, minutes remaining in class, or number of children in a family, time taken to finish an exam, etc. Quantitative variables can be classified as either discrete or continuous. 1) Discrete variables: can only assume certain values and there are usually “gaps” between values. Discrete variables can be assigned values such as 0, 1, 2, 3, 4, 4.5, 7.75, etc…. and are said to be countable and typically discrete variables result from counting. Examples: the number of bedrooms in a house, or the number of cars sold at a car market, etc. 2) A continuous variable can assume any value within a specified range. Examples: The pressure in a tire, the weight of a stone, or the height of students in a class, the distance from Debre Markos to Bahir Dar, age, temperature, etc. Typically, continuous variables result from measuring something and therefore, variables must be rounded to the limit of the measuring device. Review exercises 1) A commonly (layman) definition of statistics is: a. A Collection of numerical values b. A single value c. The sum of several values d. The largest value in a set of observations 2) In descriptive statistics our main objective is to a) Infer something about the sample b) Describe the data we collected c) Infer something about the population d) Compute an average and conclude about the population from which the data is collected 3) Which of the following statements is true regarding a population? a) It must be a large number of values b) It must refer to people c) It is a collection individuals, objects, or measurements d) None of the above 9 4) Which of the following statements is true regarding a sample? a) It is a part of population b) It is a subset of the population c) It is taken as census is sometimes costly d) All of the above are correct 5) A qualitative variable a) Always refers to a sample b) Is not numeric c) Is numeric d) All of the above are correct 6) A discrete variable is a) An example of a qualitative variable b) Can assume only whole number values c) Can assume only certain clearly separated values d) Cannot be negative e) All except A 7) In inferential statistics our main objective is to a) Describe the population b) Describe the data we collected c) Infer something about the population based on the sample d) Compute an average 8) In each of these statements, tell whether descriptive or inferential statistics have been used. a) In the year 2015, the enrolment rate of elementary schools in Ethiopia will be 100%. b) The average household income for people aged 25-34 is birr 2000/month. c) Drinking coffee may raise cholesterol levels by 7%. d) Some economists say that National Bank of Ethiopia (NBE) may increase the interest rate on deposits to lower the money supply of the economy. 9) Classify each of the following variables as qualitative or quantitative. a) Color of the automobile b) Number of desks in classrooms c) Gender (1=female, 0=male) d) Number of pages in a book 10 10) Classify each of the following variables as discrete or continuous. a) Water temperature of the Sauna at a given health spa b) Income of a household c) Life time of batteries in a tape recorder d) Weights of a newly born infants at a certain hospital 11) Consider the following : Selling price of a house depend on the following factors: a. Number of bedrooms b. Size of the house in square feet c. Swimming pool (1=yes, 0=no) d. Distance from the center of the city e. Township f. Garage Attached (1=yes, 0=no) g. Number of bathrooms Which of the variables given above are qualitative and which are quantitative? Why? 12) Briefly explain the difference between the following concepts and give examples, if necessary. a) Qualitative variable vs. quantitative variable b) Quantitative data vs. qualitative data c) Descriptive statistics s vs. Inferential statistics d) Sampling vs. Census e) Parameter and statistic. f) sample vs. population 13) Describe the importance of Statistics for an Economist. 14) Select an article newspaper (say Ethiopian Herald) that involves a statistical study and write a paper answering the following questions. a. Is the study descriptive or inferential in nature? Explain your answer. b. What are the variables used in the study? Classify the variables as qualitative or quantitative 15) One of the following is not true? a. Population is sometimes referred to as the universe b. The height of Ras Dashen mountain is 4440m can be considered as continuous variable c. The ages of students at Debre Markos University is a variable d. None 11 16) The difference between the sample mean and the population mean is called a) Population mean b) Population standard deviation c) Standard error of the mean d) Sampling error 17) The number of TVs sold by a certain shop during the months of November, December, January and February, respectively are 25, 40, 35, and 32. Indicate whether the following conclusions belong to the domain of descriptive statistics or inferential statistics. a) During the four months, the average number of TVs sold per month was 33 b) Since the average number of TVs sold per month was small, the shop should invest more on advertisement. c) Out of the four months, the sale in November was the least. d) The number of TVs sold in December was the highest because of Christ mass. 18) Is students ID number a qualitative or quantitative variable? Why? 19) Is a plate number of a car a qualitative or quantitative variable? Why? How about a house number? Why? 20) Is Telephone number/region number a qualitative or quantitative variable? Why? 12 Chapter Two Sampling Theory Chapter Objectives: When you have completed this chapter, you are expected to: Comprehend the basic concepts of sampling theory. Understand the reasons for sampling. Identify the basic sampling techniques. Demonstrate a knowledge of basic sampling methods. Apply sampling theory in business and economics. 2.1. Basic Concepts of Sampling Theory Students are expected to know the following concepts in sampling theory: 1) Population or universe is a group of all elements /observations (persons, animals, objects, measurements, etc) under consideration in a certain problem. The word population is a technical term in statistics, not necessarily referring to people. Examples: focous All students in this university; All households in Debre Markos town; All light bulbs produced by a firm in a single day; All fish in a lake, etc. 2) Census is a collection of data from the whole population (that is, complete enumeration). It is the actual measurement or observation of all possible elements from the population or it is a survey of everyone in the population. 3) Reference population (source or target population) the population of interest, to which the researcher would like to generalize the results of the study. Example: If a researcher would like to study the effect of a new fertilizer on crop yield in Ethiopia, then the reference population is all farmers in Ethiopia who are using the new fertilizer. 4) Sampling theory is a study of relationships existing between a population and samples drawn from the population. Attaining a specified precision at minimum cost is the main intention of sampling theory. In sampling theory population is often required as an assumption. 5) Sample is the small group that is chosen for the study. It is a part or portion or sub set of a population taken so that some generalizations about the population can be made. The main concern in sampling is to ensure that the sample accurately represents the population we are 13 interested to study. That is, samples are taken in a way that they will be representative of the population. 6) Sampling is the process involving the selection of a finite number of elements from a given population of interest for purposes of an inquiry. It is a process of taking samples from a population of interest for purpose of an inquiry. Example: In industry, the quality of a product is assessed through sampling; the public opinion on social, economical and political problems is ascertained through sampling. 7) Sample size is the number of individuals or observations in a sample (usually denoted by n). 8) Parameter is any measurable characteristic of a population. Example: Population means, Population standard deviations, population medians, etc. 9) Statistic is a number resulting from manipulation of sample data. That is, it is any measurable characteristic of a sample. Example: sample means, sample standard deviations, sample medians, etc. A statistic is used to estimate a population parameter such as Population mean ( ), Population standard deviation ( ), etc. 10) The sampling error is the difference between a sample statistic and its corresponding population parameter. It is the error that occurs because a sample has been taken instead of a census. For example: the sample mean may differ from the true population mean. 11) Sampling Unit is the ultimate unit to be sampled (elements of the population to be sampled).It is the unit of selection in the sampling process. Examples: In a sample of households, the sampling unit is a household; In a sample of students, a student is the sampling unit. In a sample of districts, the sampling unit is a district, etc. 12) Sampling Frame is the list of all possible units in the reference population, from which a sample is to be drawn. Example: If a researcher would like to do a research on poverty levels of residents in a town and if s/he decided that the sampling unit for the study is an individual, then the sampling frame would be the list of all individuals living in that town. A student roster is a sampling frame for a sample of students. 13) Sample design is a set of procedures for selecting the units from the population that are to be in the sample. 14) Sampling fraction (sampling interval):- the ratio of the number of units in the sample to the number of units in the sampling frame or in the reference population. For example, a sampling fraction or ratio of 1:3 is equivalent to a sampling interval of 1 in every 3 units. This means that the sample constitutes 33.3% of the total units in the sampling frame or in the reference population. 14 An application of the terminologies Population: All students in Debre Markos University in 2001 E.C. Sampling Frame: All students appearing in the list of students prepared by the registrar on Hidar 30, 2001 E.C. Sample design: Probability sampling Sample size: 200 students selected from the sampling frame. Sampling unit (unit of analysis): a student Statistic: Students in the sample have spent an average of 300 birr per month. Parameter: Students in the university are probably spending, on average, between 250 birr and 350 birr per month (estimate derived from sample statistic). 2.2. Reasons for Sampling Why a Sample instead of a census? When studying characteristics of a population, there are many practical reasons why we prefer to select samples of a population. Some of the reasons for sampling are: a) A census can be extremely expensive and time-consuming. Contacting every member of a large population would require great expenditures of time and money, and sampling from the list can provide satisfactory results more quickly and at much lower cost. Efficiency is the commonly known advantage of sampling. For example: a researcher may wish to determine the average annual income for households in Ethiopia. A sample of households would take fewer days and lower cost than interviewing all the households in Ethiopia. Therefore, a sample has to be taken. b) The physical impossibility of checking all items in the population (sometimes census is impossible): Example: the population of fish, birds, mosquito and the like are large and constantly moving, being born and dying. Therefore, we just take some samples to do a research as it is impractical to have a census upon such types of populations. c) A census can be destructive: The Awash wine factory, like every other winery, employs wine tasters to ensure the consistency of product quality. Naturally, it would be counterproductive if the tasters consumed all of the wine, since none would be left to sell the thirsty customers. Likewise, firms wishing to ensure that its steel cable meets tensile-strength requirements couldn't test the breaking strength of its entire output. As in the Awash factory situation, the product "sampled" would be lost during the sampling process, so a complete census is out of the question. d) The sample results are usually adequate: In practice, a sample can be more accurate than a census. 15 e) Speed: The collection and analysis of data can be done more quickly if the data are not excessive. Time and energy are saved. That is, the data can be collected and summarized more quickly with a sample than with a census. This is a valid consideration when the information is urgently needed. f) It enables the researcher to get more detailed information about a particular subject under investigation. If only a few people are surveyed, the researcher can conduct an in-depth interview by spending more time with each person, thus getting more information about the subject. That is not to say the smaller the sample, the better; in fact, the opposite is true. In general, larger samples-if correct sampling techniques are used-give more reliable information about the population. Disadvantages of sampling: i. Reliability: If the sample is not a true representative of the population, then we may sacrifice reliability in favor of less time and money. ii. If complete information is required on each and every element of the population, census should be applied. 2.3. Sampling Methods The population is too large to consider for collecting information from its all members. Usually, a representative sub-group of the population (sample) is included in the investigation. Sampling involves the selection of a number of study units from a defined population. The main concern in sampling is, therefore, to ensure that the sample accurately represents the population we are interested to study. Sampling methods can be categorized as probability and non-probability. 2.3.1. Probability Sampling: A probability sample is a sample selected such that each item in the population being studied has a known chance (greater than zero) of being included in the sample. These methods remove human judgment from the sampling process and ensure a more representative sample and it has certain basic features. Methods of Probability Sampling: The four basic types of sampling methods are: Simple random sampling, Systematic sampling, Stratified sampling, and Cluster sampling. The choice of which to use in any given situation will depend on the types of a problem being investigated, aim of the research and the available resources. a) Simple Random Sample (SRS): In SRS, each item in the population has a known, the same, nonzero chance of being included in the sample. 16 Random samples are selected by using methods such as random numbers (which can be generated from computers) or lottery method. To select a simple random sample you need to follow the following procedures: Make a numbered list of all units in the population (sampling frame), Each unit on the list should be numbered in sequence from 1 to N (where N is the size of the population), Select the required number of study units, using a "lottery" or a table of random numbers. Lottery Method in SRS 1) Numbered or named papers representing a unit in the population are placed in a hat. 2) The papers are thoroughly mixed and the number of papers equal to the sample size is selected from the hat. For a sample of 200 students, the researcher would select 200 papers. 3) The sample then consists of all units of the population corresponding to the selected papers. Random Number Table Method in SRS 1) The researcher assigns a number to each unit of the population and constructs the random table. 2) Then s/he randomly selects a starting place (point), goes through the table across the rows or down the columns and lists the numbers as they appear on the table. 3) Members of the population with the selected numbers constitute the sample. 4) A random number table is a list of numbers generated by a computer that has been programmed to yield a set of random numbers. 5) It is possible for a unit’s number to be selected more than once. Advantage of SRS Ensures that the sample is unbiased in that every individual and every sample has an advantage of being chosen. SRS is the basic sampling method assumed in survey statistical computations. This can be used with confidence. Disadvantages of SRS SRS requires a sampling frame and this is sometimes impossible (the case of fish population), It is difficult to take samples if the reference population is scattered, If the population is extremely large, it is tedious and time consuming to number and select the sample, Minority subgroups of interest in the population may not be represented in the sample. Note that: In SRS, when we apply the table of random numbers, we have to ignore repeated digits and those lying above the range of the population size. The following table shows a random number generated by a computer. 17 731 065 777 796 870 963 130 610 759 454 704 173 030 130 611 005 796 465 951 662 591 414 219 145 343 330 606 637 765 155 590 333 873 496 739 665 456 265 126 687 034 005 258 910 055 349 929 365 984 496 905 172 400 609 844 408 846 838 362 542 485 489 230 221 293 378 496 696 911 898 308 662 250 825 716 795 080 180 487 769 074 750 467 029 647 057 017 108 798 719 839 769 780 814 610 744 629 042 308 361 067 619 658 839 744 159 596 527 650 205 151 875 325 634 664 409 052 842 734 503 675 794 821 221 194 412 879 012 804 975 965 539 105 841 188 430 132 407 945 213 351 859 816 246 321 714 049 895 120 705 025 756 235 042 620 205 048 563 859 040 Example: Suppose a researcher wants to know the impact of microfinance on the clients' household income. S/he wishes to select 10 clients out of 250 clients and a research assistant is required to select a random samples. Assuming that you are a research assistant, select a simple rand sample of 10 clients. Solution: 1. Number each client from 1 to 250 (based on alphabet of their names or identity numbers), 2. Using the random numbers shown above, find the starting point. To find the starting point, one generally closes one's eyes and places one's figure anywhere on the table. In this case, let us select number 005 in the 6th row and 2nd column, 3. Going down the column and continuing to the next columns, select the first 10 numbers. 18 4. The numbers are 005, 042, 159, 049, 173, 172, 029, 221,213 and 205. Therefore, clients with these numbers will be included in the sample for further analysis. b) Systematic Sampling (Quasi-random sampling): In systematic sampling, the elements to be included in a sample are picked at a constant interval. That is, the items or individuals of the population are arranged in some order and a random starting point is selected from 1 through k (where k population size N ) and then every kth member of the population is selected for the Sample size n sample. In systematic sampling: A complete list of all the elements within the population (sampling frame) is required. The procedure is to take every kth item from the sampling frame. Let N= population size; n=sample size; k=sampling interval, k=N/n Choose any number between 1 and k. suppose it is j (1 j k) . The jth unit is selected at first and then (j+k)th , then ( j+2k)th, …..etc. unit is selected until the required sample size is reached. Example 1: Suppose there are 2000 subjects in the population and a sample size of 50 subjects are needed. Select a systematic sample of these 50 subjects. Solution: The sampling interval (k) is 40 (2000/50). The number of the first subject to be included in the sample is chosen randomly, for example, by blindly picking up one out of 40 pieces of paper numbered 1 to 40. Suppose subject 12 was the first subject selected, then the sample would consist of samples whose numbers were 12, 52, 92, etc until 50 subjects (samples) are obtained. It is obvious that a sample chosen this way is not strictly random since not all the members of the population have an equal chance of being selected. Example 2: Suppose a researcher wants to know the impact of microfinance on the clients' household income. S/he wishes to select 10 clients out of 250 clients and a research assistant is required to select systematic samples. Assuming that you are a research assistant, select a systematic sample of 10 clients. Solution: 1. Number each client from 1 to 250 (based on alphabet of their names or identity numbers), 2. Since there are 250 clients and 10 are to be selected, the rule is to select every 25 th clients. This rule is determined by dividing 250 by 10 which gives 25, 3. The number of the first subject to be included in the sample is chosen randomly from numbers 1 to 25. In this case let us select number 5. 19 4. Then select every 25th number on the list starting from 5. The numbers include the following: 5, 30, 55, 80, 105, 130, 155,180, 205 and 230. Therefore, clients with these numbers will be included in the sample for further analysis. Note: The answer is not unique as it depends where the number of the first subject to be included is picked. Advantages of Systematic Sampling: Less time consuming and easier to perform than SRS, It is more convenient to use as compared to SRS, It provides a good approximation to SRS. Disadvantages of Systematic Sampling: If there is any sort of cyclic ordering of the subjects, the samples will not be representative of the population. Example: If subjects in the population are arranged in a manner such as: 1) Defective item 2) Non-defective item 3) Defective item 4) Non-defective item 5) etc, The selection of the starting point could produce a sample of all defective items or non-defective items depending on whether the number to be added (k) is even or odd. Example: starting point =defective item +even k=all defective item in the sample and starting point =non-defective item +even k=all non-defective items in the sample. Example: Moha Company stores boxes containing Pepsi and Mirinda in the following order. 1) Box containing Pepsi 2) Box containing Mirinda 3) Box containing Pepsi 4) Box containing Mirinda 5) . 6) . 7) . . . . . 200) The quality department of the company would like to check the expiry date of the products by taking a systematic sample size of 40 boxes containing either Pepsi or Mirinda. Assume that you are working in 20 the quality department of the company, select the systematic samples required. Is the sample you selected a representative? 21 c) Stratified Sampling: In stratified sampling, a population is first divided into subgroups, called strata (singular stratum), and a sample is selected from each stratum based on simple random or systematic sampling method. The strata are made according to various homogeneous characteristics such as sex, race, region or institutional affiliation such as faculty. This sampling method is appropriate when the distribution of the characteristic to be studied is strongly affected by certain variables. Note: Stratified sampling is applied if the population is heterogeneous. Stratified sampling can also be proportionate or non-proportionate. In the latter case, an equal number of elements are drawn from each stratum while in the former case a proportionate number is obtained. a) Proportionate Stratified Sampling: Number of units selected from each stratum is directly proportional to the size of the strata. If Pi represents the proportion of population included in the stratum i, and n represents the total sample size, the number of elements selected from stratum i is nxPi Examples: 1) Let us suppose that we want a sample size of 30 to be drawn from a population size of 8000 which is divided in to three strata of size 4000, 2400 and 1600. Adopting proportional allocation: i. Find the sample sizes under each stratum. Solution: We shall get the sample size for the different strata: a. N1=4000, we have P1=4000/8000=0.5 and hence n1=n. P1=30*0.5=15 b. N2=2400, we have P2=2400/8000=0.3 and hence n2=n. P2=30*0.3=9 c. N3=1600, we have P3=1600/8000=0.2 and hence n3=n. P3=30*0.2=6 N= N1 +N2+ N3, P= P1 +P2 +P3=1 n1 +n2 +n3=15+9+6=30 Thus, using proportional allocation, the sample sizes for different strata are 15, 9 and 6 respectively which is in proportion for the sizes of the strata namely 4000:2400:1600. 2) In a class of students, you can stratify the whole class on the basis of gender (F or M) and you would draw an equal number of students from each group (disproportionate) or an unequal number of students from each group depending on the proportion of males to female in the original class list (proportionate). Let us take a numerical example: If there are 50 students in a class of which 10 are female and if 10 students are needed for some study, a) select a proportionate stratified sample of 10 students (8M, 2F) b) select a disproportionate stratified sample of 10 students (5M, 5F) 22 Advantage: The representation of the sample is improved Disadvantages: If there are many variables of interest, dividing a large population in to representative subgroups requires a great deal of effort, If variables are somewhat complex or ambiguous (such as beliefs, attitudes, etc), it is difficult to separate individuals in to the sub groups according to these variables. Example (class work): Using the population of 20 students given below, select a sample of 8 students on the basis of gender (female/male) and grade level (freshman/sophomore). S.N Name Gender o Grade S.No Name Gender level Grade level 1 Abebe M Fr 11 Melat F Fr 2 Bekele M So 12 Nigusie M Fr 3 Birtukan F Fr 13 Petros M So 4 Chaltu F Fr 14 Rosa F So 5 Dagmawit F Fr 15 Regassa M Fr 6 Dagne M Fr 16 Selam F Fr 7 Huluka M Fr 17 Solomon M So 8 Lulit F So 18 Tigist F So 9 Melaku M So 19 Tibeyin F So 10 Mohammed M So 20 Tirhas F So Solution: Steps: 1) Divide the population in to two groups based on gender 2) Divide each subgroup further in to two groups of freshman and sophomore 3) Determine how many students need to be selected from each subgroup to have a proportional representation of each subgroup in the sample. There are four groups and since a total of eight students are needed for the sample, two students must be selected from each subgroup. 4) Select two students from each group by using SRS or systematic sampling. Solution: 1) Divide the population in to two groups based on gender as shown below: 23 Males Females S.No Name 1 2 3 4 5 6 7 Abebe K. Bekele M. Dagne K. Huluka G. Melaku J. Mohammed A. Nigussie K. Gender Grade Level M Fr M So M Fr M Fr M So M So M Fr 8 9 10 Petros L. Regassa K. Solomon K. M M M So Fr So S.No Name Gender 11 12 13 14 15 16 17 Melat A. Lulit L. Birtukan L. Rosa M. Chaltu C. Selam A. Dagmawit B. F F F F F F F Grade Level Fr So Fr So Fr Fr Fr 18 19 20 Tigist M. Tibeyin Y. Tirhas W. F F F So So So 2) Divide each subgroup further in to two groups of freshman and sophomore as shown below: Group 1 Group 2 S.No Name Gender Grade Level S.No Name Gender Grade Level 1 Abebe K. M Fr 1 Melat A. F Fr 2 Dagne K. M Fr 2 Birtukan L. F Fr 3 Huluka G. M Fr 3 Chaltu C. F Fr 4 Nigussie K. M Fr 4 Selam A. F Fr 5 Regassa K. M Fr 5 Dagmawit B. F Fr Group 3 Group 4 S.No Name Gender Grade Level S.No Name Gender Grade Level 1 Mohammed A. M So 1 Lulit L. F So 2 Melaku J. M So 2 Rosa M. F So 3 Petros L. M So 3 Tigist M. F So 4 Solomon K. M So 4 Tibeyin Y. F So 5 Bekele M. M So 5 Tirhas W. F So 3) Determine how many students need to be selected from each subgroup to have a proportional representation of each subgroup in the sample. There are four groups and since a total of eight students are needed for the sample, two students must be selected from each subgroup. 4) Select two students from each group by using random numbers. In this case we can select the following students: Group 1: Student 5 & 4, Group 2: Students 5 & 2, Group 3: Student 1 & 3, Group 4: Students 3 & 4. 24 5) The stratified sample then consists of the following students: d) S.No Name Gender Grade Level 1 Nigussie K. M Fr 2 Regassa K. M Fr 3 Mohammed A. M So 4 Petros L. M So 5 Birtukan L. F Fr 6 Dagmawit B. F Fr 7 Tigist M. F So 8 Tibeyin Y. F So Cluster Sampling: if the population is homogeneous and very large or resides in a large area, it is costly and time consuming to take samples by using the three methods just mentioned above. In this case, we divide the population in to groups called clusters and then we select representative clusters randomly. Finally, the samples will be taken from the sample clusters. We can take either all members of the sample clusters or we may select samples from the clusters by using other sampling techniques. Procedures: 1) The reference population is divided in to clusters or subgroups, preferably similar in size, 2) A sample of the clusters is taken by random or systematic sampling, 3) All the units in the selected clusters are then studied or we may select samples from each cluster. If part of the elements in each cluster is included in the sample, then the procedure is called two stage sampling. The first stage is selecting a sample of clusters and the second stage is selecting a sample of elements from each cluster. Advantage: A list of all individual study units in the reference population is not required. Reduces cost simplify field work and it is convenient Disadvantage: The members of the clusters are often more homogeneous than the members of the whole population and therefore, it may not be representative. The elements in a cluster may not have the same variation in characteristics as elements selected individually from the population 25 e) Multi-Stage sampling: is a sampling technique that is used when the reference population is large and widely scattered. Selection of samples is done in stages until the final sampling unit is obtained. The number of stages of sampling is the number of times a sampling procedure is carried out. The primary sampling unit (PSU) is the sampling unit in the first sampling stage and the secondary sampling unit (SSU) is the sampling unit in the second sampling stage, etc. For example: the PSU can be the weredas, the SSU can be the kebeles, etc. From PSUs, we can select samples based a suitable method and each of these selected PSUs is further sub-divided in to second stage units (say kebeles) and from these SSUs again a sample is taken by some suitable methods. Further stages may be added if required. Example: Multistage sampling procedure was used to conduct a research entitled “Health Service Utilization in Amhara Region of Ethiopia.” Procedures followed: Previous provinces of Gondar, Gojjam, and Wollo are divided in to two zones. One of the two Gondar zones, one of the two Gojam zones and one of the two Wollo zones were randomly selected. Later one more zone, North Shoa was included (total four zones). Two districts from all the zones except the North Shoa (one district only) were selected (Total seven districts). Two rural and one urban kebeles were chosen from each selected district were considered (14 rural kebeles and 7 urban kebeles). Advantages • Cuts the costs of preparing sampling frame. Disadvantages • Gives less precise estimate than SRS for the same sample size 2.3.2. Methods of Non-Probability Sampling Non-Probability Sampling: In non-probability sampling, not every unit in the population has a chance of being included in the sample and the process involves at least some degree of personal subjectivity instead of following predetermined, probabilistic rules for selection. This sampling technique is: Used when a sampling frame doesn't exist, It is non-random selection (unrepresentative) Inappropriate if the aim is to measure variables and generalize findings Easier, quicker and cheaper to carryout than probability designs. There are three non- probability sampling methods. These are: 26 a) Convenience Sampling: is a method in which a sample is chosen with ease of access being the primary concern. Example: Interviews conducted in convenient locations such as student lounge. b) Purposive (Judgmental) Sampling: the researcher exercises deliberate subjective choice in drawing samples what s/he regards as more informative for a study undergoing. c) Quota Sampling: is a method that ensures that a certain number of sample units from different categories with specific characteristics are represented. Here, judgmental and convenience sampling methods are combined. Quota sampling can be applied for affirmative action. Example: Suppose we know that 54% of the adults in a community are females, and the study requires 100 respondents as a sample. In quota sampling, we might interview the first 54 females and the first 46 males. 2.4. Errors in Sampling There are two types of errors: 1. Sampling error: is the discrepancy between the population value (parameter) and sample value (statistic). It may arise due to inappropriate sampling technique applied. It can be minimized by increasing the size of the sample. When n = N, sampling error = 0 2. Non-sampling error (bias): are due to procedure bias such as: Subjects’ non-response Due to incorrect response Problem with sampling frame Measurement error Errors at different stages in processing the data. Ways to reduce data error Ensure that survey instruments are well prepared, simple to read, and easy to understand. Properly select and train interviewer to control data gathering bias or error. Use sound editing, coding, and tabulating procedures to reduce the possibility of data processing error. Review Exercises 1) What are the reasons of sampling? Discuss and give example for each reason. 2) Differentiate between parameter and statistic. Which one is the result of taking a sample? 3) Define systematic sampling and explain how it is carried out. Describe how you would obtain a systematic sample of 80 students from a population of 1600 students. 4) Briefly explain the difference between the following concepts and give examples, if necessary. Sampling vs. Census Cluster sampling vs. Stratified sampling Sampling frame vs. Sampling unit 27 5) Assume that you are going to undertake research on the Ethiopian culture. Before taking a sample, you observed that the culture is too diversified and large in number. Which type of sampling method you are going to use so that your samples will represent the whole cultures. Why? 6) Briefly explain cluster sampling. In which type of population it is preferred to select the samples from the population? 7) Assume that there are 500 students in FBE, DMU in five departments with students' size of 150, 100, 50, 150 and 50. Assume that 20 students are to be selected from these five department students for scholarship based on probability sampling. Further assume that students from all departments have equal chance of being selected, i.e., departments with large number of students will send more students than others. If you are assigned to select 20 students from FBE, then a) Which type of sampling method you are going to use? b) Determine the sample size to be selected from each department. 8) To study the reaction of students to a policy issued by a college, a sample of 100 students is required. The number of male students is 1000 and the number of female students in the college is 1500. If you want to select your sample of 100 students using a proportional allocation, how many students of each sex should you include in your sample? 9) Suppose you are a Woreda administrator having five kebeles with respective population size 10000, 5000 15000, 20000, and 50000. If you are supposed to select 1000 representatives of the Woreda, determine the number of individuals to be selected in each Kebele so that your selection to be fair. 10) Classify each of the following samples as simple random, systematic, stratified or cluster a. In a large school district, all teachers from two buildings are interviewed to determine whether they believe the students have less homework to do now than in the previous years. b. Every 7th customer entering a shopping mall is asked to select his or her favorite shoes. c. Nursing supervisors are selected using random numbers to determine annual salaries. Choose the best answer from the given choices and encircle it. 1. Which of the following is not a reason for sampling? a. The destructive nature of certain tests b. Sometimes census is impossible c. The adequacy of sample results d. None 28 2. If n=N then, the sampling error is: a. Less than zero b. Greater than zero c. Equal to zero d. None of the above 3. Which of the following is a method of non-probability sampling? a. Simple Random sampling b. Systematic sampling c. Stratified sampling d. Quota Sampling 4. A sample size a) Is the number of sampling units included in the sample b) Has more than 30 observations c) Is usually identified as n d) All of the above 5. In a simple random sample a) Every kth item is selected to be in the sample b) Every item has a chance to be in the sample c) Every item has the same chance to be in the sample d) All of the above 29 CHAPTER THREE: DATA COLLECTION AND PRESENTATION Chapter – Objectives: When you have completed this chapter, you will be able to: identify the types of data, identify the sources of data, convert raw data into a data array, organize data using frequency distributions, visually represent data using graphs and charts. 3.1. Data collection 3.1.1 What is data? Data are facts/values that variables will assume. Data are a raw fact that will be used to draw a conclusion or make a decision. It is a raw numerical description of a variable ready to be analyzed which is obtained by measuring or counting. In research, statisticians use data in many different ways. Data can be used to describe situations or events or to make an inference 3.1.2 Classification of data Data are classified as: i) quantitative or qualitative data ii) Primary or secondary data iii) Time series or cross sectional data i) a. quantitative or qualitative data Quantitative data are data that is expressed numerically or they are numerical observations of variables. Example: age, Grade Point Average (GPA), Sales, etc. Valid computations such as mean, variance, etc are possible in the case of quantitative data.ie b. we can count that Qualitative data: data that is non-numeric. Example: marital status (married single, widowed, divorce), race (Asian, African, etc), gender (male/female), blood type (A, B, O, AB). Valid Computation: Proportions in each category are possible, Example. What percent of students in this class is female? it is not countig 30 ii) Primary or Secondary data a) Primary Data Data originally collected by the researcher for the purpose/problem at hand. Data generated from primary source of data. Data that are collected by the investigator himself for the purpose of a specific inquiry or study. b) Secondary data it is used by other body When an investigator uses data, which have already been collected by others, such data are called “secondary data.” sourse Data generated from a secondary source of data. Data generated by someone else for some other purpose. The secondary data can be obtained from journals, reports, government publications, publications of professionals and research organizations, internet, videos, library, etc. One must be very careful before using secondary data as it may contain errors like transcribing errors, estimating errors, errors due to bias, etc iii) Cross sectional or time series data a. Cross Sectional Data: A data collected from a population at a given point in time. Example: The data collected on household of a town in 2001 can be presented as a cross sectional data as follows: it focous ato abeb incameis in month 2000birr Observation Monthly Number household income in Birr 1 200 2 300 3 189 b. Time series data: Data collected overtime on one or more variables. Example: Year Unemployment rate 1950 5% 1951 8% 31 1952 10% 3.1.3. Methods of data collection Sources of data There are two sources of data: These are primary sources and secondary sources. i. Primary sources It is source of data that provide first hand information for the use of immediate purpose. Data collected from primary sources are called primary data. Data collected from primary sources are new data which had not existed before and for which the researcher received full credit. ii. Secondary sources Individuals or agencies which provide data originally collected for other purpose by them or by others. Usually they are published or unpublished materials, records, reports, magazines, market reports, etc. Data which is not originated by the investigator himself but which he gets from some one’s records. Compared to primary data, which is costly but accurate and more reliable, a secondary data is less costly and less accurate. Primary data at some time can be secondary if someone else uses it. Secondary sources exist as storage of previously collected information. Example: Archival or library sources, published books, unpublished documents, videos, internet, annual reports, statistical abstracts, census of population, economic censuses, etc. Methods of collecting primary data a) Survey research b) Experimentation c) Observational research a) Survey research In survey research, we communicate with a sample of individuals to generalize on the characteristics of the population from which the samples were drawn. Types of surveys: Three most common surveys are:i. The mail survey: It can be electronic mail (e-mail) or through the post office. 32 Questionnaire is a set of questions printed on a paper. Questionnaires: - are groups or sequences of questions designed to collect data upon a subject. The questionnaire is either filled out personally by the respondent or administered and completed by interviewer. Types of questions: Multiple choice Dichotomous (having only two choices (yes/no, female/male, etc). Open – ended (where the respondents are free to give any responses). Characteristics of mail survey If one drafts a detailed questionnaire, it can be mailed to the respondent for filling or can be put in charge of enumerators who go around and fill them after obtaining the desired observation. It is relatively less costly The individual should be literate to give an appropriate response Non-response error may be high if mailing is costly. This survey can be used to cover a wider geographic area than telephone surveys or personal interviews since mailed questionnaire surveys are less expensive to conduct. It has low number of responses and inappropriate answers to questions. It has low return rate. Some people may have difficulty in reading or understanding the question. Enumerator method Here, a questionnaire is designed but selected agents called ’enumerators’ do the task of filling the questionnaire. The method can be adopted even if the respondents are illiterate It is more expensive than the mailed questionnaire method. Non-response is low. ii. The personal interview: It is an oral questioning of respondents either individually or in group. Characteristics of personal interview face to face It tends to be relatively expensive and time consuming and hence not ideal to large group of informants. It offers a lot of flexibility in allowing the interviewer to explain questions, to probe more deeply in to the answers provided. It more accurate and reliable. It maximizes trust and cooperation between interviewer and the interviewee. It has a higher rate of response. It decreases refusals. The investigator presents himself personally before the informant and questions carefully. 33 It is useful in situations where great depth study is required. In face to face interview, the interviewer can see and assess the respondent’s non–verbal behaviors. Face – to- face interviews can take place with respondents who don’t have phones or the ability to read a mailed questionnaire. iii. The Telephone interview Characteristics of telephone interview it isnot aqiurit It is similar to the personal interview, but uses telephone instead of personal interaction. It makes it possible to complete a study in a relatively short span of time. It has high response rate. It is less effective in a community with few number of telephone lines. It is less costly than personal interview. A major drawback is that some people in the population may not have phones or may not be at home when the calls are made. Hence, not all people have a chance of being surveyed. b) Experimentation We record the results of our experiment. In experimentation, researchers are interested to identify the cause and effect relationships between variables. c) Observational Research We see what is happening and record it. E.g. traffic accident, etc Observation relies on watching or listening, then, counting or measuring. There are no respondents. It is time consuming/expensive. 2 3.2. Data Presentation 3.2.1. Tabular Methods of Data Presentation Tabulation is the arrangement of information or data in tables. There are various techniques of tabulation. a) Data Array is a table showing data arranged in descending or ascending order. Descending (100, 99, 98, 97 ……..) Ascending (1, 2, 3,4,5,6,7,8,9 …………) Examples: An alphabet list of post office renters can be considered as a data array of qualitative information. A list of monthly income recorded for several years and arranged in descending or ascending order is a data array. In general, the data array offers a number of advantages: 34 a) We can determine at a glance the highest and lowest values contained in the data. b) We can identify groups of similar data values. c) We can easily see differences between values in the data. Given the following data set on Household Income (raw or ungrouped data) 112 100 127 120 134 105 110 118 109 112 110 118 117 116 118 114 114 122 105 109 107 112 114 115 118 118 122 117 106 110 116 108 110 121 113 119 111 120 104 110 120 113 120 117 105 118 112 110 114 114 Data Array Ascending order minmam100 value 110 112 116 119 104 110 113 117 120 105 110 113 117 120 105 110 114 117 120 105 110 114 118 120 106 110 114 118 121 107 111 114 118 122 108 112 114 118 122 109 112 115 118 127 109 112 116 118 134 Descending order (take the reverse) maximam value Maximum data value = 134 Minimum data value = 100 Range = 134 – 100 = 34 b) Frequency Distribution A frequency distribution is a table that group data in to non-overlapping intervals called classes and records the number of observations in each class. The frequency distribution summarizes data in a condensed form that can be readily understood and easily interpreted. Key Terms in frequency distribution Class each category of the frequency distribution is called a class. Frequency is the number of data values falling within each class. Total frequency: - the sum of class frequencies. 35 : : xi x1 , x2 ...........xn class f i f1 , f 2 ........... f n frequency n +…+ = total frequency. It implies f i 1 i = total frequency = n = number of observation (sample size) Class Limits are the boundaries for each class. These determine which data values are assigned to that class. Class limits can be lower or upper class limits and they have the same decimal value as the data value. Class boundaries are those limits which are determined mathematically so that no gap exists between classes. It is also called true class limits. Class interval is the width of each class. This is the difference between the lower limits/upper limit of the class and the lower limit/upper limit of the next higher class. range number of classes desired Range Maximum value - minimum value Approximate class width Class Mark is the midpoint of each class. This is mid way between the upper and lower class limits. Guidelines for the frequency distribution In constructing a frequency distribution for a given data set, the following guidelines should be followed. a) The set of classes must be mutually exclusive. That is, a given data value should fall into only one class/category. There should be no overlap between classes and limits such as the following would be inappropriate: Class frequency 15-20 4 20-25 5 This is not allowed, since a value of 20 could fit in to either class and a clear boundary has to be set. Class frequency 17.0-23.5 5 22.0-28.5 10 This is not also allowed, since there is an overlap between the classes. If we have a data value say 22 in which class shall we group it? In both classes, avoid this problem. 36 b) The class must be exhaustive. That is, we have to include all possible data values. No data value should fall outside the range covered by the frequency distribution. In the data set given above, maximum data value = 134, minimum data value = 100. If the last class contains a class limit of 128-133, then it is not exhaustive (complete) as the maximum value (134 is not included in the classes). c) If possible, the classes should have equal widths. Unequal class widths make it difficult to interpret both frequency distribution and their graphical presentation. One exception occurs when there is an open-ended distribution i.e., it has no specific beginning value or no specific ending value. Example: class < 10 (meaning that any value below 10 will be tallied in this class) 10 - 20 21 – 31 32 – 42 43 – 53 54 – 64 >65 (means values above 65 will be tallied in the last class) Generally, in open – ended classes, the lowest class lacks a lower limit or the highest class lacks an upper limit. Open – ended classes are classes with either no lower limit or no upper limit. d) Selecting the number of classes to use. There is no hard and fast rule to determine the number of classes of a data set but it is a subjective process. If we have too few classes important characteristics of the data may be buried within the small number of categories. If there are too many classes, many categories will contain either zero or a small number of values. In general 5 to 20 classes will be suitable or recommended. e) f) When possible, class widths should be rounded numbers (e.g. 5, 10, 25, 50,100 etc) It possible, avoid using open – ended classes. Example: The following frequency distribution is given Births (per 1000 population) 10-15 Number of countries (f) 29 15-20 8 20-25 10 25-30 look nubmer class 12 30-35 10 35-40 4 40-45 18 45-50 12 50-55 2 data or freqace 105 Take the class limit (20-25)=[20, 25)=20≤X<25. All values within are at least 20 but less than 25. 37 20 x 25 =10, is the number of countries with a birth rate in this category. Frequency of Class interval/width is the difference between the lower class limit and that of the next higher class. (25 – 20 = 5) Class mark = looke lower limit upper limit 25 20 22.5 2 2 Types of frequency distributions There are three types of frequency distribution tables. These are:a) the absolute frequency; b) the relative frequency; c) the cumulative frequency . a) Absolute frequency: An absolute frequency distribution table shows the absolute number of occurrences of an entry or groups of entries in a data set. To construct an absolute frequency distribution table, list all the scores in the first column and count the number of times each score occurs in the original data set. Record this against each item in the second column. b) Relative frequency: The relative frequency distribution table shows the number of occurrence of each item or class of items in the data set as a proportion of the total number of observation. This can be expressed in decimal, fraction or percentage form. = where n is total number of observations, RF= Relative frequency, AF = Absolute Frequency, TF = total Frequency (number of observations that is, n) c) Cumulative frequency: The cumulative frequency distribution table shows the absolute frequency of occurrence added at each successive class in the data set. Alternatively one can use the relative cumulative frequency table based on relative frequencies. Given the following frequency distribution Class Class Absolute Cumulative Relative Cumulative boundaries frequency frequency frequency Relative Limits frequency 1 24-30 23.5-30.5 3 3 3/25 3/25 31-37 30.5-37.5 1 3*1=4 4 1/25 4/25 38-44 37.5-44.5 5 9 5/25 9/25 45-51 44.5-51.5 9 18 9/25 18/25 52-58 51.5-58.5 6 24 6/25 24/25 59-65 58.5-65.5 1 25 1/25 25/25 Total 25 1 25/25=1 1 [23.5-30.5)--implies all values within are at least 23.5 but less than 30.5 38 n f i n 25, i 1 n i 1 fi 1 n look < 37.5 = 4 30.5 = 22 The class boundaries in the second column are used to separate classes so that there are no gaps in the frequency distribution. The basic rule of thumb is that the class limits should have the same decimal place value as the data, but the class boundaries should have one additional place value and end in a 5. Example: lower limit – 0.5 = 31-0.5 = 30.5 => lower boundary look upper limit +0.5 = , 37+0.5 = 37.5 => upper boundary The “less than” and “more than” cumulative frequencies The “less than” cumulative frequency of a class is the total frequency of all values less than the upper boundary of the class and the “more than” cumulative frequency of a class is the total frequency of all values which are greater than the lower boundary of the class. Example: Class Limits Class boundaries Upper boundaries Absolute frequency Relative Frequency 2/50=0.04 100-104 105-109 110-114 115-119 120-124 125-129 130-134 99.5-104.5 104.5-109.5 109.5-114.5 114.5-119.5 119.5-124.5 124.5-129.5 129.5-134.5 Total 104.5 109.5 114.5 119.5 124.5 129.5 134.5 2 8 18 13 7 1 1 50 0.04 0.16 0.36 0.26 0.14 0.02 0.02 1 Less than cumulative frequency 2 10 28 41 48 49 50 Lower boundaries 99.5 104.5 109.5 114.5 119.5 124.5 129.5 More than cumulative frequency 50 48 40 22 9 2 1 Example: The following data is given on a monthly household income of a community, construct a frequency distribution and calculate a) The absolute, relative and cumulative frequencies b) The less than and the more than cumulative frequencies c) Interpret the values found at (a) and (b) above Date set 112 100 127 120 134 105 110 118 109 112 110 118 117 116 118 114 114 122 105 109 107 112 114 115 118 118 122 117 106 110 116 108 110 121 113 119 111 120 104 110 39 120 113 120 117 105 118 112 110 114 114 n = 50 Solution: Steps: 1. Array the data 2. Determine the number of classes Rule of thumbs i) Recommended number of classes (based on number of observation) Number of observation Number of classes <50 5–7 50 – 200 7–9 200 – 500 9 – 10 500 – 1000 10 – 11 1000 – 5000 11 – 13 5000 – 50,000 13 – 17 17 – 20 > 50,000 So, the recommended number of classes for this data set can be 7. ii) We could use the Sturge’s formula to determine the number of classes (k): k 1 3.322 log n where n is the number of observations. In this case, k =1+3.322log50, log 50 = 1.7 = 1+3.322x1.7=1+5.64 = 6.64 iii) Apply the rule: This guide suggests you to select the smallest number (k) for the number of classes such that n = 50, 7 is greater than the number of observations. = 32, 32 < 50, = 64 > 50, so the recommended number of classes is 6. 3. Determine the class interval /width Width = Range/Number of class Highest value = 134 Lowest value = 100 Range = 134 – 100 = 34, k recommended = 7, width = the nearest whole number 4.9 = 4.9 (round the answer up to ) 4. Select a starting point for the lowest class limit. This can be the smallest data value or any convenient number less than the smallest data value. In this case let us use 100 as a starting point. Add the width to the lowest score taken as the starting point to get the lower limit of the next class. Keep adding until there are 7 classes. Subtract one unit from 40 the lower limit of the second class to get the upper limit of the first class. Then add the width to each upper limit to get all the upper limits. 105 – 1 = 104 1st class = 100 – 104 2nd class = 105 – 109, etc Find the class boundaries by subtracting 0.5 from each lower class limit and adding 0.5 to each upper class limit. 99.5 – 104.5 = 99.5 ≤ x < 104.5, [99.5, 104.5), half closed interval. 104.5 – 109.5 = 104.5 ≤ x< 109.5, [104.5, 109.5) 5. Tally the data 6. Find the frequency from the tallies. The completed frequency distribution is given as: Class Upper boundaries boundaries frequency Frequency frequency boundaries frequency 100-104 99.5-104.5 104.5 2 0.04 2 99.5 50 105-109 104.5-109.5 109.5 8 0.16 10 104.5 48 110-114 109.5-114.5 114.5 18 0.36 28 109.5 40 115-119 114.5-119.5 119.5 13 0.26 41 114.5 22 120-124 119.5-124.5 124.5 7 0.14 48 119.5 9 125-129 124.5-129.5 129.5 1 0.02 49 124.5 2 130-134 129.5-134.5 134.5 1 0.02 50 129.5 1 50 1 Class Absolute Relative Less than Lower More than Limits Total * Note that the sum of the relative frequencies is always 1 or 100%. That is, Proof: fi f1 n 1 Therefore, n then f2 f3 ( n ) n n n ......... fi ( n ) 1. f f 2 f 3 ...... f n f n f1 f 2 f 3 ...... f n n , and 1 n n fi ( n ) 1 c) Interpretation 31 (18+13) of the households earn a monthly income from birr 110 – 119 41 62% of the households earn a monthly income from birr 110 – 119 (31/50*100%) 28 of the households earn a monthly income less than birr 114.5 40 of the households earn a monthly income at least birr 109.5 Note: One can construct several different but correct frequency distributions for the same data by using: a different class width, a different number of classes or a different starting point The reasons for constructing a frequency distribution are: a) To organize the data in a meaningful way b) To enable researchers to draw charts and graphs for the presentation of data. c) To enable a reader to make comparisons among different data sets. 42 3.2.2. Graphic Method of Data Presentation After the data have been organized into a frequency distribution, they can be presented in graphical form. Why graphs? Graphs are used to: Convey the data to the viewers in pictorial /graphic form, Get the audiences’ attention in a publication or a speaking presentation, Discuss an issue, reinforce a critical point, or summarize a data set, Make more understandable than data presented in tables and frequency distribution, Discover a trend or pattern in a situation over a period of time. The three most common used graphs in research are:a) The Histogram b) The frequency polygon c) The cumulative frequency graph or O-give (pronounced as o -jive ) a) The Histogram: - is a graph that displays the data by using adjacent vertical rectangles (unless frequency of a class is zero) of various heights to represent the frequencies of the classes. That is, in a histogram the class boundaries are marked on the horizontal axis and the class frequencies on the vertical axis. N.B: The length of adjacent rectangles of a histogram (a long the y-axis) can be the absolute or relative frequencies of a class. The tallest rectangle in a histogram is associated with a class having the greatest number of observations (frequencies). Example-1: Construct a histogram given the following frequency distribution. Class Absolute boundaries frequency 99.5-104.5 2 104.5-109.5 8 109.5-114.5 18 114.5-119.5 13 119.5-124.5 7 124.5-129.5 1 129.5-134.5 1 Total 50 Solution: Steps: 1) Draw x – y axis 2) Label the class boundaries on the x – axis and the frequency on the Y – axis. 3) Using the frequencies as the heights, draw vertical bars for each class 43 The class with the greatest number of data values (18) is 109.5 – 114.5 We should also know that we would have reached the same conclusions and the shape of the histogram would have been the same had we used a relative frequency distribution instead of the absolute (actual) frequencies. The only difference is that the vertical axis would have been reported in percents (proportions) of households instead of the number of households. b) The frequency polygon :The frequency plygon consists of line segments connecting the points formed by the interesection of the class marks with the class frequencies. Relative frequencies or percentages may also be used in constructing the figure. Empty classes are included at each end so the curve will intrsect the X – axis. Using the frequency distribution given in example 1 above, construct a frequnecy polygon. Solution:Steps 1. Find the class marks Class boundaries Class mark Frequency 99.5 - 104.5 102 2 104.5 - 109.5 107 8 109.5 - 114.5 112 18 114.5 - 119.5 117 13 119.5 - 124.5 122 7 124.5 - 129.5 127 1 129.5 - 134.5 132 1 2. Draw the x – y axis. Label the x – axis with the class marks and use a suitable scale on the y – axis for the frequencies (absolute or relative). 3. Connect the coordinated (x,y) with line segments. 44 The cumulative frequency graph ( o-give): The o-give is a graph that displays cumulative values for frequencies, relative frequencis or percentages. These values can be either “more than” or “ Less than” Example: construct an o-give for the frequency distribution given in example 1 above. Solutions : Steps 1. Find the cumulative frequency for each class Less than cumulative Class boundaries frequency found by 99.5 - 104.5 2 2+0 104.5 - 109.5 10 2+8 109.5 - 114.5 28 2+8+18 114.5 - 119.5 41 2+8+18+13 119.5 - 124.5 48 2+8+18+13+7 124.5 - 129.5 49 2+8+18+13+7+1 129.5 - 134.5 50 2+8+18+3+7+1+1 2. Draw the x – y axis and lable the x– axis with the class boundaries and y – axis with the cumultive frequencies. 3. Plot the cumulative frequency at each upper class boundary. Upper class boundaries are used since the cumulative frequencies represent the number of data values accumulated upto the upper boundary of each class. 45 Cumulative frequency graphs (less than cumulative frequency) are used to visually represent how many values are below a certain upper class boundary. For example, to find how many households earn less than 114.50 birr, we can locate 114.5 birr on the x – axis, draw a vertical line up until it intersects the graph, and then draw a horizontal line at the point to the y – axis. The value is 28 households. The “More than” Cumulative Frequency (More than the lower boundary) Lower boundaries CF more than 99.5 50 (∑fi) more than 104.5 48 (∑fi-2) more than 109.5 40 (∑fi-10) more than 114.5 22 (∑fi-28) more than 119.5 9 (∑fi-41) more than 124.5 2 (∑fi-48) more than 129.5 1 (∑fi-49) Note: The abscissa (x-value) of the point of intersection of the two o-give curves (less than and more than) gives the median of the given data. 3.2.3. Other Methods of data presentation a) Line graphs b) Bar charts c) a) Pie – charts Line graphs (charts): Line charts are particularly effective for business and economic data because we can show the change or trends in a variable overtime. Time series data are most effectively presented on a line chart. The variable of interest, such as the number of units sold or the total values of sales, is scaled along the y – axis and time along the x – axis. Line graphs are widely used by investors to support decisions to buy and sell stocks and bonds in the financial market. The idea is to try to show a trend that will likely continue into the future, and to use that pattern to make accurate prediction for the immediate future. 46 Example: Given the following data on unemployment rate over of a country from 1992 to 2000 Year NB: Two or more Unemployment rate 1992 14.80% 1993 13.70% 1994 11% 1995 10.20% 1996 11.30% 1997 12.40% 1998 13.50% 1999 14.60% 2000 15.70% series of data can be plotted on the same line chart. Thus a chart can show the trend of several different variables and this allows for a comparison of several series over the same period of time. b) Bar Charts: This is used when the horizontal axis deals with information that is qualitative or non – continuous in nature, e.g. Gender, Marital status, etc. When we represent data using bar charts, the bars are not joined together. All the bars must have equal width and the distance between bars must be equal Example Education level Earnings/year High school Diploma 22,895.00 Bachelor Degree 40,478.00 Master’s Degree 73,165.00 47 it show cricle c) Pie – Chart: - is useful for displaying a relative frequency distribution. A circle is divided proportionally to the relative frequency and portions of the circle are allocated for the different groups. Example: Samples of 200 athletes were asked to indicate their favorite type of running shoe. Draw a pie-chart based on the following data. Number Relative Angle Types of shoe of athletes frequency Percent Nike 92 0.46 46% 46% x 3600 = 165.60 Adidas 49 0.245 24.50% 24.5% x 3600 = 88.20 Reebok 37 0.185 18.50% 0.185 x 3600 = 66.60 Asics 13 0.065 6.50% 0.065 x 3600 = 23.40 Other 9 0.045 4.50% 0.045 x 3600 = 16.20 Total 200 1 100% 48 3600 Review Exercises Multiple Choice Questions 1) To find the class mark a. We have to divide the class interval in to half b. We have to find the average of the lower and upper class limits in a class c. We have to divide the upper class limit in to half d. All are true 2) One of the following is not true? a. The sum of the relative frequencies is always 1 b. Telephone interview is an example of primary source of data c. Face-to face interview is less costly than the mail survey d. Internet is a secondary source of data 3) In a frequency distribution, the categories/classes must a) Be mutually exclusive and exhaustive b) Have at least 5 observations c) Be of the same size d) Contain open ended classes 4) To determine the class interval (width) a) Divide the class frequencies in half b) Divide the class frequency by the number of observations c) Find the difference between consecutive lower class limits or upper class limits d) Count the number of observations in the class 5) The class frequency is a) The number of observations in each class b) The difference between consecutive lower class limits c) Always contains at least 5 observations d) Usually a multiple of the lower limit of the first class 6) A research organization is making a study of the selling price of personal computers (PCs). There are 45 PCs in the study. How many classes would you recommend? (Apply the 2 rule). a) 10 b) 20 c) 6 d) 3 7) To convert a frequency distribution to a relative frequency distribution a) Find the difference between consecutive lower class limits b) Divide the absolute frequency by the total number of observations c) Divide the lower limit of the first class by the class interval 49 k d) Multiple the class frequency by 100 8) Which one is not correct? a) Pie-chart is important to show the trends or changes in a variable overtime. b) A line obtained by taking class marks the y-axis and class limits/boundaries on the x-axis is called frequency polygon. c) Line chart is important to show the trends or changes in a variable overtime. d) None of the above 9) Data which are collected as afresh and they happen to have original characteristics are a) Chronological data b) Secondary data c) Quantitative data d) Primary Data 10) The difference between a histogram and a bar chart is: a) The midpoints are connected with a histogram but not with a bar chart b) The bars must be next to each other on a histogram and separated in a bar chart c) Cumulative frequencies are required in a bar chart d) None of the above Workout/Short answers/ explanations 1) If the maximum and minimum heights of students in a class are 1.90m and 1.40m, and if it is desired to group the students in to five classes based on their height, what will be the size of the class width? 2) Explain the terms: Primary data and secondary data. Give some illustrations. 3) Define the following concepts: a Frequency Distribution b Frequency c Relative Frequency d Less than Cumulative Frequency e More than Cumulative Frequency f Class Mark g Class Width 4) From a certain frequency distribution table, if the 3rd class upper class boundary and lower class limit are 20.5 and 16 respectively, determine the class mark of the 3rd class. 50 5) Given the following frequency distribution:, find the values of x , y z, a, b, and c Class limit Absolute frequency Cumulative Frequency Relative Frequency 100-104 2 2 0.02 105-109 8 10 0.08 110-114 X Y z 115-119 10 a 0.1 120-124 20 b 0.2 125-129 15 c 0.15 130-134 20 100 0.2 6) The following table is a grouped frequency distribution of money spent per visit by a random sample of 100 customers a department store. Amount Spent (in birr) Number of customers 3-7 10 8-12 30 13-17 35 18-22 20 23-27 5 100 i) State for each of the above classes a. The class limits b. The class boundaries c. The class marks d. The class width ii) Construct a) A histogram b) The cumulative frequency distribution c) The relative frequency as well as the relative cumulative frequency distribution iii) If possible, find the number of customers who spent: a) At most Birr 12.50 b) Birr 12.50 or more 51 c) Less than Birr 12.50 d) At least Birr 17.50 e) Exactly Birr 12 7) A distribution has a constant class width with 6 classes and 8 as class mark of the second class. a) If the class mark of the 4th class is 18, find i) The class width ii) The class limits and class boundaries of the distribution 8) The balances of payment (BOP) of Ethiopia over the years 1985-1990 were as follows. Year BOP (million) 1985 -40 1986 0 1987 10 1988 -5 1989 20 1990 30 Present the above data by an appropriate graph. 52 Chapter Four Measures of Central Tendency Chapter Objectives When you have completed this chapter, you will be able to: Calculate the arithmetic mean, the weighted mean, the geometric mean, the harmonic mean, the median and the mode for ungrouped and grouped data; Explain the characteristics properties/uses of each measure of central tendency; Identify the position of the mean, median and mode for symmetric and skewed distributions; Understand other measures of location (quartiles, deciles and percentiles). 4.1. Introduction In this chapter, we shall continue to develop methods to describe data by finding a typical single value to describe a set of data. We refer to this single value as a measure of central tendency. Measures of central tendency describe a distribution near its center. They provide indications on middle values or most likely or most frequent values. In other words, they tell us where the center of the distribution of the data is located. The Summation Notation Often statistical formulae require the addition of many variables. Summation or sigma notation is a convenient and simple form of shorthand used to give a concise expression for a sum of the values of a variable. In statistics, the symbol x x i (Greek letter sigma) means to add or find the sum. For example, means to add the numbers represented by the variable X. Thus, if X represents 5,2,8,4, and 6, then 5 i =5+2+8+4+6=25. Sometimes a subscript notation is used, such as: x i 1 i . This notation means to find the sum of five numbers represented by X. This notation is read as follows: sum the values of Xi 5 from X1 through X5. x i 1 i x1 x2 x3 x4 x5 . 53 Generally, In order to make formulas more general, variables can be used with the summation notation. For n example, x i 1 i means to sum up values of X from 1 to n where n can be any number. Often an abbreviated form of the summation notation is used. For example, ΣX means to sum all the values of X. When only subsets of the values of X are to be summed then the full version is required. Thus, the sum of n 1 all elements of X except the first and the last would be indicated as: x i2 i which would be read as the sum of X with i going from 2 to n-1. Some formulas require that each number be squared before the numbers n are summed. This is indicated by: x i 1 5 x i 1 2 i 2 i means to square each value before summing. 5 2 2 2 8 2 4 2 6 2 25 4 64 16 36 135 It is very important to note that it makes a big difference whether the numbers are squared first and then summed or summed first and then squared. The symbol (ΣX)² indicates that the numbers should be summed first and then squared. For the present example, this equals: (5 + 2 + 8 + 4+6)² = 25² = 625. This, of course, is quite different from 135. Sometimes a formula requires that the sum of cross products be computed. For instance, given X Y 2 3 1 6 4 5 What is ΣXY? The sum of cross products (2 x 3) + (1 x 6) + (4 x 5) = 32 54 The notation ( x x) 2 means perform the following steps: 1) find the mean ( x x ) n 2) Subtract the mean from each value 3) Square the answers 4) Find the sum Example: Find the value of ( x x) 2 for the values 5, 2,8,4,6. x x- x (x- x )2 5 0 0 2 -3 9 8 3 9 4 -1 1 6 1 1 ( x x) 20 2 Basic properties of summation notation: 1. Σ(X Y) = ΣX ΣY Example: X Y 3 8 2 3 4 1 Σ(X + Y) = 11 + 5 + 5 = 21 ΣX = 3 + 2 + 4 = 9 55 ΣY = 8 + 3 + 1 = 12 ΣX + ΣY = 9 + 12 = 21 2. (ΣX) (ΣY) ΣXY In the above example :( ΣX) (ΣY) = 9 *12 = 108 ΣXY=3*8+2*3+4*1=34.Thus, 108 34 3. ΣX2 (ΣX)2 In the above example, ΣX2=9+4+16=29 (ΣX)2= 9*9=81. Thus, 29 81 n 4. For any constant c, n c nc , cx i 1 i 1 n c xi i i 1 4 Example: 5 5 5 5 5 4 * 5 20 i 1 n n i 1 i 1 5 5xi 5 xi if xi 5,2,4,8,6 x i 1 i 5 2 4 8 6 25 5 xi 5 * 5, 5 * 2, 5 * 4, 5 * 8, 5 * 6 5 5x i 1 i 25 10 20 40 30 125 5 5 xi 5 * 25 125 i 1 Solved Exercises Data i xi 1 1 2 2 3 3 4 4 56 1. Find 2. Find Data i xi 1 -1 2 3 3 7 and c which is a constant = 11 3. Find 4. Find 5. Find Data I xi yi 1 10 0 2 8 3 3 6 6 4 4 9 5 2 12 57 6. Find 7. Find 8. Find 9. Find 4.2. Types of measures of central tendency 1) Arithmetic Mean: The arithmetic mean is the sum of the data set values divided by the number of observations. Arithmetic mean or average value of a variable is the most important numerical measures of central tendency. For ungrouped data, the population mean (usually denoted by “”) is the sum of all the population values divided by the total number of population values: N X i 1 i N where : N number of elements in the population population mean The population mean applies when the data represent all of the items within the population. For ungrouped data, the sample mean is the sum of all the sample values divided by the number of sample values: n X X i 1 i n X sample mean n number of elements in the sample/sample size A sample of five executives received the following salaries (Birr in thousands): 14.0, 15.0, 17.0, 16.0, and 15.0, find the mean salary. X Xi 14.0 ... 15.0 77 15.4 n 5 5 Therefore, the mean salary of the executives is Birr 15,400.00 58 Properties of Arithmetic mean a) Arithmetic mean is the most widely used measure of location/central tendency. b) All the values are included in computing the mean. c) A set of data has a unique mean. d) Every set of quantitative data has a mean. e) The mean is affected by large or small data values, called outliers and may not be the appropriate average to use in this situations. f) We cannot determine a mean for open ended data. g) The sum of the deviations of each value from the mean is always zero. ( x x) 0 Example: Given xi 5,2,4,8,6 x 5 ( x x) (5 5) (2 5) (4 5) (8 5) (6 5) 0 3 1 3 1 0 Mathematically, ( x x) 0 x - nx x - n ( x) n ( x x) x - x, where x is a constant xx 0 00 h) If x1 and x 2 are the arithmetic mean of n1 and n 2 observations respectively, then the combined mean will be : xc n1 x1 n2 x2 (is the same as the weighted mean) n1 n2 Example: 1) The mean age of 12 men and 10 women are 45 and 42 respectively. What is the combined mean age? Solution: xc i) 12 * 45 10 * 42 43.6 12 10 Short cut formula can be used if the figures in the calculation have many digits. First transform the observations (xi’s) as yi= xi-c, where c is any chosen value near the center, then x = y c j) The arithmetic mean is affected by both change of origin and scale. That is, Given a mean for data values, if we add or subtract a constant number c from all data values, the new mean will be the old mean plus or minus c (change of origin). Given a mean for data values, if we multiply all data values by a constant number c, then the new mean will be c times the old one (change of scale). Example: The mean life of a certain brand of bulbs is 1030 hours. 59 a) If a new process adds 50 hour to the life of each bulb, what will be the mean life of them? (ans. 1080 hours ) b) If you apply a recently developed method of production, the life of each bulb is doubled, what will happen to the mean life of them? (ans. 2060 hours ) Arithmetic mean for grouped data The mean of a sample of data organized in a frequency distribution is computed by the following formula: k X fX i 1 k i f i 1 fi i th class frequency i where: X i class mark of the i th class i k number of classes Example: Compute the arithmetic mean of for the following grouped data: Class Boundaries 5.5-10.5 10.5-15.5 15.5-20.5 20.5-25.5 25.5-30.5 30.5-35.5 35.5-40.5 Class mark (Xi) 8 13 18 23 28 33 38 fi fiXi 1 2 3 5 4 3 2 8 26 54 115 112 99 76 7 i 1 f i 20 7 f X i 1 i i 490 X 490 24.5 20 2) Weighted mean: It is a special case of arithmetic mean. It occurs when there are several observations of the same value which might occur if the data have been grouped in to a frequency distribution. It is the mean value of data values that have been weighted according to their relative importance. The formula for the weighted mean for a population or a sample will be as follows: or X ixi i Where: is population weighted mean X =is sample weighted mean i Weight assigned to the ith data value xi The ith data value Examples: 60 i. During a one hour period on Saturday afternoon a waiter served fifty drinks. She sold 5 drinks for birr 0.50, 15 for birr 0.75, 15 for birr 0.90, and 15 for birr 1.10. Compute the weighted mean price of the soft drinks. X 5 * 0.50 15(0.75 0.90 1.10) 50 0.875 ii. A student scored an A in Sophomore English (3 credit hours), a C in Psychology (3 credit hours), a B in Microeconomics-I (4 credit hours) and a D in Civics (2 credit hours). Assuming A has 4 grade points, B has 3 grade points , C has 2 grade points and D has 1 grade points, calculate the grade point average (GPA). Ans. 32/16=2.66 3) Geometric mean: The geometric mean (GM) of n positive numbers is defined as the nth root of their product. The formula is: GM = n X 1 X 2 X 3.... Xn n xi , => multiplication The geometric mean is useful in finding the average of percents, ratios, indexes, or growth rates. It has a wide application in business and economics because we are often interested in finding the percentage changes in sales, revenues, profits, GDP, etc. Examples a) The GM of 4 and 16 is b) The GM of 1,3,9 is 3 4 *16 8 1* 3 * 9 3 c) The interest rates on three bonds were 5, 21, and 4 percent. The average interest rate is: GM 3 5 21 4 7.49 d) The returns on investment earned by a company for four successive years were 30%, 20%, -40% & 200%, what is the geometric rate of return on investment? Solution: 30% return means additional gain from what we have (i.e. from 100%). Then 30% return is expressed as 1.3, -40% implies reduction ( 1-0.4 = 0.6) GM 4 (1.3) * (1.2) * (0.6) * (3.0 ) =1.294 The GM of the return is therefore 1.294-1= 29.4% Another use of the geometric mean is to determine the percent increase in sales, production or other business or economic series from one time period to another. GM n value at end of period value at beginning of period 1, n= time gap/time period 61 Example: 1) The production of soaps for a soap factory increased from 755,000 in 1992 to 835,000 in 2000. What would be the rate of production increase? Rate of production increase GM 8 835,000 1 1.27% 755,000 2) If the population of Ethiopia increased from 53,000,000 in 1980 to 73,000,000 in 2000. What is the average annual increase? GM = 20 73,000,000 1 53,000,000 = 0.016 = 1.6% 3) If a person receives a 20% raise after one year of service and a 10% raise after the second year of service, the average percentage raise is not 15% ( 20% 10% ) but 14.89% as shown 2 below: GM 1.2 *1.1 1.1489 or GM 120 *110 114.89% His salary is 120% at the end of the first year and 110% at the end of the second year. This is equivalent to an average of 14.89%, since 114.89%-100%=14.89%. This answer can also be shown by assuming that the person earns Birr 10,000 to start and receives two raises of 20% and 10%. Raise 1=10,000*20%=Birr 2000 Raise 2=12,000*10%=1200 His total salary is Birr 3200. The total is equivalent to: Birr 10,000*14.89%=Birr 1489 Birr 10,000 +1489=11,489*14.89%=Birr 1710.71 Total increase= Birr 1489 + Birr 1710.71=3199.71 (almost equal to Birr 3200) 4) The price of a certain commodity in 1970 was 1.06 times that of 1969, in 1971 it was 1.04 times that of 1970. In the next two years it was 1.10 and 1.23 times that of the respective preceding years. What is the average annual percentage increase in the given period? GM 4 1.06 *1.04 *1.10 *1.23 1.105 (1.105 1) *100% 10.5% (the average annual increase is 10.5%) For grouped data geometric mean is calculated as: 62 GM n x1 1 * x2 2 * ...... * xm f f fm Where fi is the frequency of the ith class mark, Xi is class mark m is number of values n=total number of observations Example: 1) Find the geometric mean for the following grouped data on the percentage increase in salary of 16 employees of a company. % increase in salary Number of Class mark employees 0-4 5 2 5-9 6 7 10-14 3 12 15-19 2 17 5 6 3 2 Solution: GM 16 2 * 7 *12 *17 5.85% . The geometric mean percentage increase in salary is 5.85% If 'n' is a large number, the computing the nth root of the product is a tedious work. To facilitate the computation of GM, we make use of logarithms. n X 1 X 2 X 3.... Xn n xi logGM=log n Take log X 1 X 2 X 3.... Xn log xi log x1 log x 2 ... log x n n n log GM=log (xi ) logx i ] GM anti log[ n 1 n log x i n 4) Harmonic Mean The harmonic mean of n positive observations is defined as the number of values divided by the sum of the reciprocals of each value. That is, HM = n 1 1 1 ... x1 x 2 xn n n 1 x i 1 i It is used for average rates of change. Example: Speed. Example: Find HM of 60, 50 & 40 HM = 3 = 48.65 1 1 1 60 50 40 63 64 Example: Suppose a person drove 100kms at 40km/hr and returned driving at 50km/hr. What is the average speed? Solution Speed Dis tan ce Time t1 Dis tan ce S 100km 2.5 hours to make the first trip Speed V 40km / hr t2 Dis tan ce S 100km 2 hours to return Speed V 50km / hr Total time 2.5 hours 2 hours 4.5 hours Total distance 100km 100km 200km S 200km V 44.44km / hr t 4.5 hr Arithmetic mean (weighted mean) 2.5 * 40 2 * 50 44.44km/ hr 4.5 This value can be found by using the harmonic mean formula: HM= 2 1 1 40 50 44.44km/ h Here, we don't calculate the arithmetic mean to find the average speed because the man traveled equal distances by different speed on three days. If, however, he had traveled for equal times in 3 days the arithmetic mean would be had correct average. If we want to use arithmetic mean, we have to take weights in to account: Harmonic mean for grouped data HM= n f f1 f 2 ... n x1 x 2 xn n n i 1 fn xi Xi= class mark Relationship between Arithmetic mean, Geometric Mean and Harmonic Mean For a set of data containing n-positively valued observations, the following relationships always holds: HM GM AM 65 The three means become equal iff all values in the set of data are equal. 5) Median (MD) The median of a set of values arranged in the order of their magnitudes, i.e., in an array, is the middle value or the arithmetic mean of two middle values. Median is that value of a variable which divides an array of items in such a manner that the number of items below it is equal to the number of items above it. a) Median for Ungrouped Data n 1 If the number of observations is odd, then, MD =value of observation 2 th Example: Find the median of the following data set: 1, 5, 3, 9, 10, 12, 6 Solution: First array the data: 1, 3, 5, 6, 9, 10, 12, n = 7 odd n 1 7 1 th MD = observation = observation = 4 observation = 6 2 2 th th th If the number of observations is even, then, MD = th n n observation 1 observation 2 2 2 Find the median of the following data set: 1, 5, 2, 9, 7, 10, 12, 13 Solution: First array the data: 1, 2, 5, 7, 9, 12, 13, n = 8 even th th n n obsn 1 obsn 2 2 MD = = 2 th th 8 8 obsn 1 obsn 4 th obsn 5th obsn 2 2 = 2 2 The 4th observation is 7 & the 5th observation is 9, then, MD = 79 =8 2 b) Median for Grouped data For grouped data, median is calculated by using the following formula: n cf MD md 2 f *i Where md is the lower class boundary/class limit of the median class n is total number of observations cf is the cumulative frequency preceding the median class i is the class interval/width 66 f is frequency of the median class Example: find the median from the following frequency distribution Class Limit Frequency Cumulative Frequency 30-40 2 2 40-50 18 20 50-60 24 44 60-70 20 64 70-80 8 72 80-90 3 75 total= 75 Solution: Steps: a. Find the cumulative frequency b. Find f i n 75 odd 75 1 c. Find the median class: observation 38 th observation 2 th d. In which class does the 38th observation fall? In the 3rd class and thus the 3rd class is the median class e. Find the cumulative frequency preceding the median class. 20 in this case. f. Find the class width. 10 in this case. g. Find the frequency of the median class. 24 in this case. 75 20 *10 57.29 MD 50 2 24 Properties of Median 1. Array is a must before we calculate the median. 2. There is a unique median for each data set. 3. Geometrically, median divides the histogram or cumulative frequency curves into two parts with equal area. 4. Median remains unaffected by the magnitude of the extreme values. 67 5. It can be calculated for an open ended frequency distribution if the median class doesn't lie in an open ended class. 68 6) Mode (MO) Mode is the most frequent value in a data set. The mode is the value of the observation that appears most frequently. The mode of the distribution is the value that has the greatest concentration of tendencies, i.e., the value that occurs with greatest number of times in a distribution. The data value that occurs with greatest frequency is a mode. Example: the examination scores for ten students are: 81, 93,84,75,68,87,81,75, 81and 87. Because the score of 81 occurs three times, it is the mode A data set may have A. No mode at all, e.g. 1, 3, 9, 0, 7, 8 B. One mode (unimodal) e.g. 1, 3, 1, 7, 1, 9, mode is 1 C. Two modes (bimodal) e.g. 7,2,4,4,7 , mode are 7 and 4 D. Many modes (multimodal) e.g. 1, 0, 0, 1, 3, 2, 2, 3, 7, 7, 4, 9, mode are 1, 0, 3, 2, 7 Mode of a grouped data The approximate modal value grouped data is calculated by the following formula: Mode Lo f f1 f f1 i L0 i f f 1 f f 2 2 f f 1 f 2 Where: Lo lower classs boundary of the modal class (i.e., the class with the highest frequency) f is the frequency of the modal class f1 frequency of the class immediatel y preceding the modal class class f2 frequency of the class immediatel y following the modal class i class interval/w idth Note: the data is to be arranged in an array. Example: Find the mode of the following distribution: Class Limit 90-100 100-110 110-120 120-130 130-140 140-150 150-160 160-170 Solution: Mode 120 Frequency 10 37 65 80 51 35 18 4 80 65 150 *10 120 123.41 2 * 80 65 51 44 69 Properties of mode It is the easiest average to compute. It can be obtained for both qualitative and quantitative data. It is not affected by extreme values. The mode may not exist for a data set. It is not unique. A data set can have more than one mode. The mode is not based on all observations. Distribution, shape and measures of central tendency The relative values of the mean, median and mode are very much dependent on the shape of the distribution for the data they are describing. The data distributions may be described in terms of symmetry and skewness. In other words, data can be either symmetric or skewed depending on how the data are distributed around the center. Symmetry (normal, bell shaped) distribution: occurs when the data values are evenly distributed around the center. In a symmetrical distribution, the left and right sides of the distribution are mirror images of each other, and the values of the mean, median and mode are equal. Skewed distribution: occurs when the data values are not evenly distributed around the center. Skewness refers to the tendency of the distribution to “tail off” to the right or left. Skewness is lack of symmetry of a distribution. Right (positively) skewed distribution: The mean is greater than the median, which in turn is greater than the mode. In such distributions, the median tend to be a better measure of central tendency than the mean. In a positively skewed distribution (when the majority of the data values fall to the left of the mean and cluster at the lower end of the distribution), the arithmetic mean is the largest of the three measures as the mean is influenced by a few extremely high values more than the Median or Mode. Mode<Median<Mean Left (negatively) skewed distribution: the mean is less than the median, which in turn is less than the mode. As with the positively skewed distribution, the median is less influenced by extreme values and tends to be a better measure of central tendency than the mean. Mean<Median<Mode 70 4.3. Quartiles, Deciles and Percentiles Descriptive measures that describe the position (place) of value in a given data or distribution are positional averages. Measures which divided data in to many equal parts are called quantiles (fractiles). The most important of these are quartiles, deciles and percentiles. To obtain such measures, first of all, we have to order the data in an increasing order. Quartiles Quartiles divide the data in to four equal parts. The j th quartile denoted as Qj where j=1, 2, 3 is defined as j (n 1) Qj observation 4 th Q1 gives the value where 25% of the observations lie below and 75% above it Q2 gives the value where 50% of the observations lie below and 50% above it Q3 gives the value where 75% of the observations lie below and 25% above it Example: Find the quartiles (Q1, Q2, & Q3) from the following distribution 8, 4, 8, 3, 4, 8, 5, 5, 10, Solution: Arrange first: 3,4,4,5,5,8,8,8,10 1(9 1) Q1 item (2.5) th item 2 nd item 0.5(3rd item 2 nd item ) 4 0.5 * (4 4) 4 4 th 2(9 1) Q2 item (5) th item 5 4 th 3(9 1) Q3 item (7.5) th item 7 th item 0.5(8th item 7 th item) 8 0.5(8 8) 8 4 th For grouped data, i*n cf 4 Qj i fi *w i Where i=1, 2,3 i = lower class boundary of the ith quartile class (the class which contains the ( wi =class width fi=frequency of the ith quartile class n=total number of observations 71 i * n th ) item ). 4 cf=the cumulative frequency of the class preceding the ith quartile class i*n cf 4 Q1 1 fi Class Boundaries Fi Cf 5.5-10.5 1 1 10.5-15.5 2 3 15.5-20.5 3 6 20.5-25.5 5 11 25.5-30.5 4 15 30.5-35.5 3 18 35.5-40.5 2 20 *w i th n 20 ( ) th item item 5 th item is Q1 and it falls in the 3rd class 15.5 - 20.5 is first quartile class 4 4 1 * 20 3 4 * 5 18.83 Q1 15.5 3 Q2 ? th 2n 40 ( ) th item item 10 th item is Q 2 and it falls in the 4 th class 20.5 - 25.5 is second quartile class 4 4 2 * 20 6 4 * 5 20.5 4 24.5 median Q2 20.5 5 Q3 ? th 3n 60 ( ) th item item 15 th item is Q 3 and it falls in the 5 th class 25.5 - 30.5 is third quartile class 4 4 3 * 20 11 4 * 5 25.5 5 30.5 Q3 25.5 4 72 73 Deciles Deciles are measures that divide a distribution/data set in to ten equal parts The jth decile for a simple frequency distribution (ungrouped data) denoted as Dj, where j=1, 2, 3.....9 is defined as j (n 1) Dj observation 10 th D1 gives the value where 10% of the observations lie below and 90% above it D2 gives the value where 20% of the observations lie below and 80% above it D3 gives the value where 30% of the observations lie below and 70% above it . . D9 gives the value where 90% of the observations lie below and 100% above it For grouped data, i*n cf 10 Dj i fi *w i Where i=1, 2,3,4.....9 i = lower class boundary of the ith decile class (the class which contains the ( i * n th ) item ). 10 wi =class width fi=frequency of the ith decile class n=total number of observations cf=the cumulative frequency of the class preceding the ith decile class Percentiles Percentiles divide a distribution/data set in to 100 equal parts. The jth percentile for a simple frequency distribution (ungrouped data) denoted as Pj, where j=1, 2, 3.....99 is defined as j (n 1) Pj observation 100 th 74 P1 gives the value where 1% of the observations lie below and 99% above it P2 gives the value where 2% of the observations lie below and 98% above it P3 gives the value where 3% of the observations lie below and 97% above it . . P99 gives the value where 99% of the observations lie below and 1% above it For grouped data, i*n cf 100 Pj i fi *w i Where i=1, 2,3,4.....99 i = lower class boundary of the ith percentile class (the class which contains the ( i * n th ) item ). 100 wi =class width fi=frequency of the ith percentile class n=total number of observations cf=the cumulative frequency of the class preceding the ith percentile class Observe that: 1. Q2= D5= P50=Median 2. Dj= P10j, j=1, 2, 3,4,5,6,7,8,9. 3. Qj= P25j, j=1, 2, 3 Review exercises Choose the best answer 1. Which of the following measures of central tendency is affected most by extreme values (outliers)? a. Median b. Mean c. Mode d. Geometric Mean 2. In a set of observations, which measure of central tendency reports the value that occurs most often? a. Mean b. Median c. Mode d. Geometric Mean 3. The relationship between the geometric mean and the arithmetic mean is a. They will always be the same 75 b. The geometric mean will always be larger c. The geometric mean will be equal to or less than the mean d. The mean will always be larger than the geometric mean 4. Suppose you compare the mean of raw data and the mean of the same raw data grouped into a frequency distribution. These two means will be a. Exactly equal b. The same as the median c. The same as the geometric mean d. Approximately equal 5. In a set of 10 observations the mean is 20 and the median is 15. There are 2 values that are 6, and all other values are different. What is the mode? a. 15 b) 20 c)6 d) None of the above 6. Which of the measures of central tendency is the largest in a positively skewed distribution? a) Mean b) Mode c) Median d) Geometric Mean 7. The weighted mean is a special case of the a) Mean b) Mode c) Median d) Geometric Mean Workout/explain the following questions 1. Show that the sum of the deviations of each value from the mean is always zero. 2. Show that ( x x) 2 x2 nx 2 3. Given the data values: 5,12,8,3,4, find x, x , ( x) , (x - x), (x - x) 2 2 2 4. Calculate the per-capital income (average income) from the following data. Salary ( in birr) No of Persons 120.00 4 400 .00 4 10,000.00 1 50,000.00 1 5. A teacher assigns weights 4, 2, 3 respectively to seminar work, class work and monthly tests of students. What is the average academic performance of a student scoring the following marks: Work Marks (100%)-x Weights-w Wx Seminar 45 4 180 Class work 62 2 124 Monthly test 52 3 156 Total 9 460 Weighted mean 460/9=51.1 76 The weight shows that seminar work is twice as important as the class work from the teacher’s point of view. 6. Two colleges show the following results. Which one is better on average? Category College A College B First year 70% (200 students) 80% (150 students) Second year 60% (150 students) 60% (100 students) Third year 80% (100 students) 80% (50 students) 7. In a class of 40 students, 10 have failed and their average marks are 30. The total mark secured by the entire class was 2400. Find the average mark of those who have passed. 8. The average salary of 20 individuals working in a small scale industry was Birr 1000. But five qualified persons were employed and then increased the average salary into Birr 1200. What was the mean salary of the newly employed employees? 9. A nation faces a rate of inflation of 2% in 1990, 5% in 1992, and 12.5% in 1993. Find the geometric mean of the inflation rates? 10. A firm pays 5 1 of its labour force an hourly wage of Birr 5, of the labor force a wage of Birr 12 3 6 and ¼ a wage of Birr 7. What is the average wage paid by this firm? 11. In a certain examination, the average grade of all students in section A is 70 and students in section B is 75. If the average of both classes combined is 72, find the ratio of the number of students in section A to the number of students in section B. 12. The average weekly wage of workers in a certain firm is Birr 50. The mean wage of female workers is Birr 52 and that of male workers is Birr 42. What is the percentage of female workers and male workers in the firm? 13. A household purchased Birr 600 worth teff for consumption in three equal purchases of Birr 200 each over a three months period. The first pack of teff was Birr 2.95/kg, the second Birr 3.10/kg and the third Birr 3.25/kg. What was the average price per kg paid for all the teff? 14. If sixty percent of the populations in Ethiopia earn average monthly income of Birr 1,000.00 and the remaining populations earn Birr 2,000.00, calculate the average monthly income of the whole population in Ethiopia. 15. The mean of 200 items is 50. Later on it is discovered that two items were wrongly taken as 92 and 8 instead of 192 and 88. Find the correct mean. 77 16. Find out the mean from the following data: Series X Series Y Arithmetic Mean 12 20 No of items 80 60 17. The mean age of all students in a class of 50 students is 17 years. If the mean age of 30 of them is 18 years, find the mean age of the remaining 20 students. 18. The mean marks obtained by 300 students are 56. The mean of the top 100 students of them was found to be 80 and the mean of the bottom 100 of them was found to be 22. What is the mean of the remaining 100 students? 19. The arithmetic mean of two observations is 10 and the geometric mean is 8. Find out the values of the two items. 20. Central Statistical Authority has calculated the per-capita income of the one million individuals and it was found to be Birr 1500. Later, it is found that a person with income of amount 20,000.00 is not taken in to account. Calculate the correct per capita income including this person's income in to manipulation. 21. The arithmetic mean of 20 observations is found to be 20. Later on, sample values 5 and 15 were incorrect. The correct values are 9 and 12. Find the correct mean. 22. A student scored B, A, C, & B in ECON 211, ACCT 201, MGMT 211, and FLEN 201 having credit hours 4, 3, 1, and 3 respectively. Calculate GPA of this student. 23. The average monthly salary of employees in a company was Birr 2,500.00. Recently, each employee is given additional monthly salary of Birr 200.00. Calculate the new average monthly salary of instructors. 24. The mean age of 100 persons was found to be 30. Later, it was discovered that age 60 was misread as 40. Find the correct mean. 25. Out of the total population of Ethiopia, 60% earn mean income of Birr 2,000.00 and the rest earn mean income of Birr 5,000.00. Find the average income of the entire population. 26. The mean weight of 150 students in a certain class is 60 kg. The mean weight of boys is 70 kg and that of girls is 55 kg. Find the number of boys & girls. 27. A motor car covered a distance of 100kms at four times. The first time at 50km/hr, the second time at 40km/hr, the 3rd time at 45km/hr and the 4th time at 30km/hr. Calculate the average speed. 78 28. If the arithmetic mean of the following frequency distribution is 28, find the missing frequency. Class Limit 0 – 10 10 – 20 20 - 30 30 – 40 40 – 50 50 - 60 29. If the median & mode are 25 and Frequency 12 18 27 f1 17 6 24 respectively. Find the missing frequencies and arithmetic mean from the following frequency distribution. Class Limit 0 – 10 10 – 20 20 - 30 30 – 40 40 – 50 Total Frequency 14 f1 27 f2 15 105 30. For a sample of 50 stocks traded yesterday on the American Stock Exchange, 10 showed a decline of $1.00, 15 showed no change, and 25 increased by $2.00. Find the weighted mean. 31. In the following grouped data, X is the class mark and C is any constant. If the arithmetic mean of the original distribution is 35.84. Find the value of X corresponding to the value X-C=0 X-C -21 -14 -7 0 7 14 21 f 2 12 19 29 20 13 5 32. If the class midpoints in a frequency distribution of age of a group of persons are 25, 32,39,46,53 and 60. What are the class boundaries of the first class? 33. The following frequency distribution reports the number of students enrolled in each of the 50 sections of various courses taught in the College of Business last summer. Students Frequency 0 up to 10 3 10 up to 20 8 20 up to 30 16 30 up to 40 10 40 up to 50 9 50 up to 60 4 Total 50 a. Determine the mean number of students per section. b. Determine the median number of students per section. 79 Chapter Five Measures of Dispersion Chapter Objectives: Dear reader, when you have completed this chapter, you will be able to: Compute and interpret the quartile deviation, the mean deviation, the variance and the standard deviation of ungrouped and grouped data. Explain the characteristics, uses, advantages and disadvantages of each measure of dispersion. Compute and interpret the inter quartile range and its relative measure. Compute and interpret the relative measures of dispersion Compute and interpret the Z-score Understand and measure Moments, Skewness and Kurtosis. 5.1. Types of Measures of Dispersion /Variation Dispersion is the scatter or variation of items from a measure of central tendency. It measures the extent to which the values vary among themselves. Example 5.1. - Consider the following data on the expenditures of two groups of workers: Group A: Br 6200 2200 17000 17000 12000 (the mean is Br, 2400) Group B: Br 1600 1700 13000 4200 32000 (the mean is Br 2400) We simply conclude that the two groups spend identical amount, if we were given only the average expenditure of the two groups without knowing the actual expenditures. But the actual observations indicate that more variation is observed in group A. To be specific, it is often difficult to assert which set of data is better represented by its mean value unless we refer to dispersion. This points to the possibility when any two or more sets of sample data having the same mean (as in the previous example), may differ considerably in terms of the degree of dispersion. For instance, the average income in a community is not an adequate indicator of the well being of the community since it doesn’t show us the inequality among the residents. But, the measure of dispersion can show us this inequality. Therefore, it is useful to have a measure of dispersion to observe variability of data. A measure of dispersion may be in an absolute form or relative form. An absolute measure is said to be in an absolute form when it shows the actual amount of variation of an item from a measure of central tendency while a relative measure is a quotient obtained by dividing the absolute measure by a quantity in respect to which the absolute deviation has been computed. Relative measures are unitless and are used to compare variability between different sets of data. 80 The following are some of the qualities of a good measure of dispersion. It should be based on all observations It should be easily calculated. It should be easily understandable It should be affected as little as possible by sampling fluctuations. It should be capable of further statistical treatment. There are many types of measures of dispersion as listed below 1. Range 2. Quartile deviation 3. Mean deviation 4. Variance and standard deviation 5. Coefficient of variation As stated so far, when these measures express the magnitude of dispersion in the same unit of measurement in which the data are recorded, they are known as measures of absolute dispersion. However, when dispersion is expressed in percentages or ratios, these measures are called measures of relative dispersion. 1. Range Range is defined as the difference between the smallest and the largest observations in a given set of raw data. Obtaining range from raw data thus requires identifying only these two extreme values, and taking the difference between them Properties of range Only two values are used in its calculation It is influenced by an extreme value. It is easy to compute and understand. It is the crudest measure of dispersion. It cannot be determined for an open ended data. The grater the range, the higher the variability of the data and vice versa. Example 5.2. Find the range of the raw data given in example 5.1. above. Solution: For Group A – The highest expenditure = 6200 birr - The lowest expenditure = 1200 birr Range = highest value – lowest value = 6200 – 1200 = 5000 Birr For Group B – The highest expenditure = 4200 81 - The lowest expenditure = 1300 Range = 4200 – 1300 = 2900 Birr Therefore, in terms of expenditure more variation is observed in group A. Note that: for discrete grouped data we use the same formula as given above, i.e, highest value minus lowest value. Example 5.3. Compute the range of the following data. Table 5.1. Results (out of 35%) of 20 students in Econometrics test. Xi 6 24 18 22 30 15 Fi 3 2 5 1 4 5 Maximum value = 30 marks Minimum value = 6 marks Range = Highest value – lowest value = 30 – 6 = 24 In case of continuous grouped data, range can be obtained in the following three ways: i) In the first, range is found by taking the difference between the upper class limit of the last class and the lower limit of the first class. This is because the lowest and the highest observations are not identifiable in the case of continuous grouped data. That is, Range = UCLL – LCLF Where UCLL = Upper class limit of the lest class LCLF = Lower class limit of the first class ii) In the second, range is found by taking the difference between the upper class boundary of the last class and the lower class boundary of the first class. That is, Range = UCBL – LCBF Where UCBL = Upper class boundary of the last class LCBF = Lower class boundary of the first class. iii) In the third, range is found by taking the difference between the mid points of the first and the last class. This does yield a result closer to the actual range as it reduces the margin by which it is in error when computed by using the first the second methods. Example 5.4. – Compute the range of the data given below in table 5.2. Table 5.2. Results (out of 35%) of 40 students in Econometrics test Score (35%) Class Boundary Number of Students (Fi) 6 – 10 5.5 – 10.5 5 11 – 15 10.5 – 15.5 10 16 – 20 15.5 – 20.5 15 21 – 25 20.5 – 25.5 7 82 26 – 30 25.5 – 30.5 3 Solution Range = UCBL – LCBF = 30.5 – 5.5 = 25 or Range = UCLL – LCLF = 30 – 6 = 24 or can be computed as the difference between the mid point of the last class and the mid point of the first class. That is, Range = 28 – 8 = 20 It may have been noted that range is measured in an absolute form in the above discussions. It implies that such a measure cannot be used for comparing variabilities expressed in different units. Therefore, there is a need to have a measure of relative dispersion /variation. The relative range or coefficient of range is defined as: Range Highestvalue LowestValu e x100% x100% for raw data & Sumofexter emevalue Highestvalue Losestvalue discrete grouped data. UCBL LCBF x100% for continuous grouped data. LCB F UCBL Example 5.5. Compute the coefficient of range for the following raw data. 2, 4, 6, 8, 16, 18, 20 Solution:Coefficient of range = 20 2 18 X 100% X 100% = 81.8% 20 2 22 Example 5.6. Find the coefficient of rage (relative range) for the data given in table 5.2. Solution:UCBL = 30.5 LCBF = 5.5. Coefficient of range = = 30.5 5.5 X 100% 30.5 5.5 25 X 100% = 69.4% 36 83 Besides being simple to compute and understand, range is as good a measure of dispersion as any other where the data consist of a few observations and is advantageous when one wants to know only the extent of the extreme dispersion under “ordinary” conditions. However, its major drawbacks include; (i) it tells us noting about the dispersion of the values which fall between the two extremes, (ii) it is highly sensitive to sample size, (iii) highly affected if the value of the two extremes change. Despite these and some other limitations, it is often used to express the degree of dispersion. 2. Quartile Deviations Quartiles are the values which divide the array into four equal parts. Q1 gives the value of the item which is the way up the distribution, Q2 gives the value of the item which is half of the way and Q3 is the value of the item 3/4th the way up the distribution. Inter-quartile range is the difference between Q3 and Q1. That is; Inter-quartile range = Q3 – Q1 Quartile deviation, denoted as Q D , is defined as QD = Q3 Q1 2 Quartile deviation is also called semi-quartile range. Example 5.7. Find the Quartile deviation of the following data. Table 5.3. Results (out of 35%) of 40 students in Econometrics test. Scores (35%) Class Boundary) Frequencies (fi) Less than cumulative frequencies 6 –1 0 5.5 – 10.5 5 5 11 - 15 10.5 – 15.5 10 16 – 20 15.5 – 20.5 15 30 (Q1 value – 3oth value) 21 – 25 20.5 – 25.5 7 37 26 – 30 25.5 – 30.5 3 40 40 Solution: since the ith quartile is computed as Qi = LQi + in 4 CF xCWQi PQi FQi Where: n = sample size LQi = lower class boundary of the quartile class 84 15 (Q1 class, as in = 10th value) 4 CFPQi = Cumulative frequency of the preceding quartile class CQWi = Class width of the quartile class Fqi = frequency of the quartile class 1x40 4 5 Q1 10.5 10 x5 = 10.5 25 15 = 13 Q3 15.5 3x40 4 15x5 15 = 20.5 Quartile deviation (semi – quartile range) = = Q3 Q1 2 20.5 13 2 = 3.75 Note that: The coefficient of quartile deviation, which provides us a relative measure, is defined as Coefficient of Q3 Q1 Q Q1 2 QD x100% 3 x100% Q3 Q1 Q3 Q1 2 Example 5.8. Compute the coefficient of quartile deviation for the data given in table 5.3. Solution Q3 = 20.5 Q1 = 13 Coefficient of QD Q3 Q1 20.5 13 7.5 X 100% = 22.4% Q3 Q1 20.5 13 33.5 Advantages of Quartile deviation include It is easy to compute and understand It can be computed for open-ended classes given that Q3 & Q1 can be found. It is not affected by extreme values Disadvantages of Quartile deviation include It ignores the first 25% and the last 25% items It is not capable of mathematical manipulations. Its value is very much affected by sampling fluctuations. It doesn’t show the scatter around the average, but only a distance on scale. 3. Mean Deviation 85 The mean deviation, also called the average deviation, measures the average deviation /scatters of a set of observations about a central value, usually the mean or the median of the distribution. It is computed by subtracting the mean/median from each individual observations, summing all the deviations ignoring the negative sign, and dividing the sum by the total number of observations. The negative sign is ignored, for instance, otherwise the sum of the deviation from the mean i.e, X i X will be zero. The mean absolute deviation from the mean for a set of sample data consisting of n observations I computed as MD from the mean = X i X n Similarly, MD from the median is obtained as MD from the median = X ungrouped data. It is obtained as f X X f f X Md f i M D from the mean = i i i M D from the median = i i in case of grouped data, where Xi’s are the mid-points and f i n. Example 5.9. The age of a sample of 10 students from a class is given below. 18, 19, 19, 19, 20, 21, 21, 22, 23, 24 Find mean deviation (i) from the mean (ii) from the median Solution:Arithmetic mean = X i n 206 10 20.6 n value n 1 value 20 21 2 Median = 2 = 20.5 th th 2 Age 18 19 19 19 20 21 21 22 23 Mean Absolute deviation from the mean /18 – 20.6/ = 2.6 /19 – 20.6/ = 1.6 /19 – 20.6/ = 1.6 /19 - 20.6/ = 1.6 /20 - 20.6/ = 0.6 /21 - 20.6/ = 0.4 /21 - 20.6/ = 0.4 /22 - 20.6/ = 1.4 /23 - 20.6/ = 2.4 2 Mean absolute deviation from the median /18 – 20.5/ = 2.5 /19 – 20.5/ = 1.5 /19 – 20.5/ = 1.5 /19 – 20.5/ = 1.5 /20 – 20.5/ = 0.5 /21 – 20.5/ = 0.5 /21 – 20.5/ = 0.5 /22 – 20.5/ = 1.5 /23 – 20.5/ = 2.5 86 i Md n in the case of 24 /24 - 20.6/ = 3.4 16 /24 – 20.5/ = 3.5 16 87 Therefore, MD from the mean = MD from the mean = X i X = n X i Md n 16 = 1.6 10 16 = 1.6 10 Example 5.10. Find mean absolute deviation from the mean and from the median for the data given in table 5.2. Solution: First arrange the data as follows: Score Xi Md Fi X i M d Fi Class mark 6 –10 5 8 9.125 45.625 9.167 45.835 11 - 15 10 13 4.125 41.250 4.167 41.67 16 – 20 15 18 0.875 13.125 0.833 12.495 21 – 25 7 23 5.875 41.125 5.833 40.831 25 – 30 3 28 10.875 31.625 10.833 32.499 Xi X Fi X i X (35%) 40 fX i i 173.75 (5x8) + (10x13) + (15x18) + (7x23) + (3x28) = 40 + 130 + 270 + 161 + 84 = 685 Mean = fX i i n Median = Lmd = 15.5 685 = 17.125 40 40 2 CF xCW PMd md FMd 20 15 x5 = 17.167 15 Therefore, M D form the mean = f X X = 173.75 = 4.344 40 f X M 173.33 = 4.333 i i i M D from the median i n d 40 88 173.33 Note: coefficient of mean deviation, relative measures, form the mean and from the median are given as follows: (i) Coefficient or M D form the mean = (ii) M D from the mean x100% mean Coefficient of M D from the median = M D from the median x100% median Example 5.11. Compute the coefficient of mean deviation from the mean and from the median for the data given in example 5.10. Solution:- MD from the mean = 4. 344 MD from the median = 4.333 Mean = 17.125 Median = 17.167 Thus, coefficient of M D from he mean = 4.344 x100% 17.125 = 25.37% Coefficient of M D from the median = 4.344 x100% 17.167 = 25.24% Advantages of Mean Deviation It is easy to understand and compute than standard deviation It is not unduly influenced by large or small values All values are used in its calculation Disadvantages of Mean Deviation It ignores the algebraic sign of the deviations It is not suitable for further mathematical processing. 89 4. Variance and Standard Deviation Like other measures, variance and standard deviation also quantities the dispersion of the observations around the mean value. The population variance is defined as the arithmetic mean of the squared deviations from the population mean. Properties of Population variance All values are used in calculation. The units are awkward, the square of the original units. The formula for the population variance for raw data is: 2 X 2 i N where: N S2 = Mean (population) = total number of observation X i X 2 n 1 Where; n = sample size X = mean Alternatively, we can simplify it as follows S2 X i X 2 = n 1 = X X 2 i 2 i 2 X 2 X X i n 1 2 2 2 X 2 X X i Xi X 2X Xi n 1 n 1 n 1 n 1 X 2 X = nX 2n X Xi n 1 n 1 n 1 n 1 2 2 i 2 2 i n n 1 n X i X i 2 = nn 1 2 for small sample size. 90 n X i X i 2 2 n2 = for large sample. Why n-1? The reason for this is, in small sample, if provides a better estimate of the variance of the population from which the sample is drawn. However, as n increases above about 30, we can use n instead of n-1, as the two versions given approximately the same result for practical purposes. Example 5.12. The ages of a family (in years) are: 2, 18, 34, 42. What is the population variance Solution: X i 96 = 24 4 X 2 2 2 242 18 142 34 242 42 242 4 944 = 236 4 = the population standard deviation is the square root of the population variance. X 2 i N and the sample standard deviation is the square root of the sample variance. S X S X i X 2 n 1 i X n for small sample size & 2 for large sample size Alternatively, for small sample less than about 30 n X i2 X i 2 S nn 1 91 Example 5.13. From the sample data given below compute variance and standard deviation 10, 15, 30, 22, 41, 32 Solution:n=6 Xi Xi2 10 100 15 225 30 900 22 484 41 1681 32 1024 X i X 4414 150 2 i n X i2 X i 2 So, S 2 nn 1 64414 150 = 45 2 = 132.8 S S 2 132.8 = 11.51 Variance and Standard deviations for grouped data For grouped data the population and sample variance denoted by f X f i i f i X i2 f i X i 2 2 2 2 i S 2 f X X n f X f i and S2 respectively are given by: i i f i X i 2 2 i n2 i in which Xi’s are the class mid-points and f i N for the population and f i n for the sample. Alternatively for small sample size we can use: S 2 n f i X iw f i X nn 1 2 By definition, standard deviations in each case are the square roots of the respective variances. 92 Example 5.14. From the cotinions frequency distribution given in table 5.2, compute the sample variance and standard deviation. Solution: Class limits Class fi X X X X 2 fi X i i i fi X i X 2 X i2 f i X 2i (scores) mark 6 –10 8 5 40 -9.125 83.26 416.328 64 320 11 – 15 13 10 130 -4.125 17.016 170.16 169 1690 16 – 20 18 15 270 0.875 0.7656 11.48 324 4860 21 – 25 23 7 161 5.875 34.516 241.609 529 3703 26 – 30 28 3 84 10.875 118.26 254.80 784 2352 40 685 253.82 1194.8 12925 Therefore, for small sample size S 2 f X Xi i n 1 2 1194.8 = 30.625 40 1 S S 2 30.625 = 5.534 Alternatively, S 2 n f i X i2 f i X nn 1 2 4012925 685 4039 2 = = 30.625 S 30.625 = 5.534 Important properties of Variance /Standard Deviation The following are some of useful mathematical properties of variance and standard deviation: 1. The variance/standard deviation of any constant is always zero. A standard deviation of zero implies that there is no variation at all in the data set. In other words the data values are the same. 2. A variance/standard deviation never be a negative number. 3. If a constant is added or subtracted from each observation, the variance/standard deviation of the resulting observations will not be affected. 93 If every observation is multiplied by a constant K, then the new variance will be K 2 times the 4. original variance and the new standard deviation will be K times the original standard deviation. 2 2 5. If there are two sets of data consisting of n1 and n2 observations with S1 and S 2 as their respective variances, the combined variance S C2 of (n1 + n2) observations is SC2 n1 S12 d12 n2 S 22 d 22 n1 n2 where d1 = X 1 X C 2 2 and d 22 X 2 X C . Herein, the combined mean X C 2 n1 X 1 n2 X 2 n1 n2 in case X 1 X 2 . n1S12 n2 S 22 S n1 n2 2 C Further, when n1 = n2 SC2 S12 S 22 2 6. If Y represents a linear transformation of X as Y = a+bX, with a as the additive constant and b as the multiplicative constant, then the variance of Y is: SY2 b 2 S X2 , where S X2 is the variance of X. It follows that standard deviation of Y is bSX. Where SX is the standard deviation of X. Example 5.15. Calculate the standard deviation of the combined group of 400 items form the following data. Table 5.4. Group A Group B Group C Number of items (ni) 50 150 200 Mean X i 40 50 60 81 100 121 Variance S i2 Solution:- XC = n1 X 1 n2 X 2 n3 X 3 n1 n2 n3 50(40) 150(50) 200(60) 50 150 200 94 = 53.75 di X X C d1 = 40 – 53.75 d2 = 50 – 53.75 = -13.75 d3 = 60 –53.75 = -3.75 = 6.25 Consequently, the combined variance is given as SC2 n1 S12 d13 n2 S 22 d 22 n3 S32 d 32 n1 n2 n3 50 81 13.75 150 100 3.75 200 121 6.25 = 400 = 2 2 2 13503 17109 32012 400 = 156.56 SC 156.56 = 12.512 5. Coefficient of Variation Coefficient of variation, developed by Karl person (1857 – 1936), is a relative measure of dispersion which is a very useful measure when either the data are in different units or the data are in different units or the data are in the same units but the means are far apart. It is defined as the ratio of the standard deviation to the arithmetic mean (where mean is different from zero), expressed as a percentage: CV S tan darddeviation X 100% Mean for population CV N X 100% while for sample, it is obtained as CV S X 100% N Coefficient of variation (CV) helps us for comparing the Variability, Heterogeneity /homogeneity, Uniformity, & Consistency of two or more distribution. 95 A series /distribution with smaller coefficient of variation is said to be more homogenous /uniform/ consistent than the other distribution. And a series /distribution with larger CV is said to be more variable or more heterogeneous than the other distribution. Example 5.16. The number of employees, the average wages and the variance of the wages for two factories are given below. Table 5.5. Summary of wage & employees of two factories. Factory A Factory B Number of employees 50 100 Average wages 120 85 9 16 Variance of the wages Which factory is consistent in respect to the wages of employees? Solution: Factory A Factory B Given: nA = 50 XA = 120 S A2 = 9 CVA SA XX Given: nB = 100 X B = 85 S B2 = 16 SB CVB x100% CVA 3 X 100% = 2.5% 120 CVB 4 XB 85 X 100% X 100% = 4.7% Conclusion: CVA < CVB => the wages of employees of factory A is more consistent than factory B. Interpretation of Standard Deviation Theorem: (GAUSSIAN RULE). If a data in a sample are approximately distributed, then a. X S , approximately include 68% of the data. b. X 2S , includes approximately 95% of the data c. X 3S , includes approximately 100% of the data. Standard Scores (Z-Scores) The Z-score is defined to indicate the number of standard deviations that an observation is below or above the mean depending on whether the Z-score is negative or positive. Z – is called the standard value which is given by 96 Z Xi X S .d Example 5.15. Helen scored 65 in Auditing and Samuel scored 70 in Auditing. If the average score of the whole students in Auditing is 67 and standard deviation equal to 3, which student performs better? Solution Z Helen = Z Helen X S Z Samuel 65 67 3 = = -0.6 X Sami X S 70 67 3 =1 Therefore, Samuel performs better in Auditing than Helen and than the average result of the whole students. Exercise: In a sample, 100 students doing a master program in management were tested in a general knowledge paper carrying 100 marks. At the end of the exercise, they were found distributed according to marks obtained as follows: Marks obtained Number of 30 -40 35-39 40-44 45-49 50-54 55-59 60-64 5 8 12 20 27 20 8 students Find a) The range of the distribution, b) Quartile deviation, c) Mean absolute deviation form the mean, d) Variance and standard deviation, and e) Coefficient of variation. Answer: a) using class limits = 34/using mid-points = 30 b) QD = 5.375 c) MD= 6.46 d) S2 = 61.24 and S = 7.82 e) CV = 15.8% 97 5.2. Moments, Skewness, and Kurtosis In this section, we will deal with two other important characteristics of a frequency distribution. One refers to lack of symmetry in the distribution, or its departure from being bell-shaped. The other relates to the degree of flatness or peakdness of a distribution at its top. The former is described as skewness and the later kurtosis. 5.2.1. Moments Moments tell us information about the “shape” of the distribution It is represented by Mr, r =0, 1, …, r, which is called the rth moment. We can have moments about any constant number, about the mean, zero or any desired value. In general, the rth moment about any arbitrary constant number, say A, is given by X Mr A 2 i n Example 5.18. Consider the following data and compute the first four moments bout five (5). 2, 2, 3, 4, 4, 5, 6, 7, 8 Solution:A=5 n=9 Mr Xi Xi-5 X i 52 X i 53 X i 54 2 2 3 4 4 5 6 7 8 Total -3 -3 -2 -1 -1 0 1 2 3 -4 9 9 4 1 1 0 1 4 9 38 -27 -27 -8 -1 -1 0 1 8 27 -28 81 81 16 1 1 0 1 16 81 278 X 5 r i n 98 n M0 M1 X 5 0 i 9 X 5 2 i 1 9 9 1 9 1 i = 4 9 X M 1 5 9 2 i = 38 9 M3 X M4 X 5 3 i 9 = 28 9 5 9 4 i = 278 9 9 Note: For grouped data the rth moment about any constant number, say A, is given as: f X A f r Mr i i i where; f i => Frequency of Xi in case of discrete grouped data f i => Frequency of the ith class in case of continuous groped data and here Xi is the class mark of the ith class. Note: M0 is always equal to 1. Example 5.19. Find the first three moments about 4 for the data given in table 5.6 Table 5.6 Number of children in ten families Xi 2 3 4 5 3 2 3 2 Solution:- Xi fi Xi 4 f i X i 4 X i 42 f i X i 4 2 f i X i 4 3 f i X i 4 4 2 3 4 5 Total 3 2 3 2 -2 -1 0 1 -6 -2 0 2 -6 4 1 0 1 12 2 0 2 16 -8 -1 0 1 -24 -2 0 2 -24 99 f X 4 f 2 M0 i i i M1 6 M 2 16 10 i 10 10 = -0.6 = 1.6 10 M 3 14 f 1 10 = 1 = -2.4 10 Central Moments (Moment about the mean) th The r central moment for ungrouped data is given by the formula. X Mr r i N Mr X X i n , for the population with N observations and mean . , for sample data with n sample size and mean X . Similarly, for grouped data the central moment is defined as: f X f r Mr i i for the population, and i f X X for sample data. M f where; f N - for the population f n - for sample w i i r i i i Xi = class mark of the ith class in case of continuous grouped data. = frequency of Xi in case of discrete grouped data & frequency of the ith class in case of continuous grouped data. Example 5.20. Find the first three central moments for the population data given by:X = 2, 3, 7 Solution X N i 2 3 7 12 5 =4 3 M0 = 1 =0 100 = = Note: For central moments M0 = 1 M1 = 0 M2 = = (variance of X) M2 and M3 help us to measure Skewness and Kurtosis Moment about the origin (i.e, A = 0) is given by:- Example 5.21. Compute the first four moments about the mean for the following sample data (discrete frequency distribution) Table 5.7 Xi -3 1 2 3 5 Fi 2 1 4 2 3 Solution:=2 -3 2 -5 -10 25 50 -125 -250 625 1250 1 1 -1 -1 1 1 -1 -1 1 1 2 4 0 0 0 0 0 0 0 0 3 2 1 2 1 2 1 1 1 2 5 3 3 9 9 27 27 81 81 243 Total 0 80 M0 = 1 M1 = 0 M2 = = 6.6667 M3 = = -14.083 M4 = = 124.67 101 -169 1496 5.2.2. Skewness Skewness refers us lack of symmetry. We study skewness to have an idea about the shape of the curve which we can draw with the help of the frequency distribution. Frequency distributions often found skewed on either side of its central value. As a result, it has a longer tail either to the left or to the right. When there is a longer tail to the right of the center, the distribution is said to be positively skewed. If the tail is longer to the left of the center, the distribution is said to be negatively skewed. A positive skewness means a greater dispersal of individual observations towards the right of the central value. A negative skewness, on the other hand, implies that individual observations have greater dispersal towards the left of the central value. Skewness, therefore, not only refers to the lack of symmetry in distribution, it also shows the direction of dispersion of individual observations on either side of the center of the distribution. Accordingly, a measure of skewness quantifies the extent of departure from symmetry and also indicates the direction in which the departure takes place. Diagrammatically, the shape of frequency curves: a) b) Positively Skewed Symmetrical distribution c) Of the measures of skewness, two shall be discussed here. Negatively skewed 102 a) Moment coefficient of skewness b) Pearsonian coefficient of skewness a) Moment coefficient of Skewness In terms of moment coefficient, skewness is defined as: = = Where M2 = S2 = variance Interpretation: (1) If = 0 => Symmetrical distribution (2) If < 0 => Negatively skewed distribution (3) If > 0 => positively skewed distribution (4) A greater or smaller value of means a greater or smaller degree of skewness. Example 5.22. Find the skewness of the distribution given in example 5.18 Solution: Thus = 0.567 <0, therefore the distribution is negatively skewed. b) Pearsonian coefficient of Skewness Pearsonian coefficient of skewness is developed by Karl Pearson. This measure is based on the fact that when a distribution drifts away from symmetry, its mean, median, and mode tend to deviate from each other. This results about from the presences of exceptionally high or low observations affecting the value of the mean the most, and that of the mode the least. The value of the mean tends to be the highest and that of the mode the lowest when some observations in a given set of data are exceptionally high. Consequently, a distribution having exceptionally high observations has a longer tail towards the right. Contrarily, mean tends to be the lowest, and mode the highest, when a set of data contain some exceptionally low observations. As a result, the distribution will have a longer tail towards the left. Thus, it is the direction in which mode drifts from mean that determines whether a distribution will have positive or negative skewness. Using this conclusion, the pearsonian coefficient of skewness, denoted as , is defined as 103 In which S is standard deviation. Using the empirical relationship among mean, mode and median in a moderately skewed distribution, i.e, mode = mean – 3(mean – median), the above equation can be modified as Note: 1. 2. If the distribution is symmetrical 3. If the distribution is positively skewed 4. If , the distribution is negatively skewed Example 5.23. Find the skewness of the following data using pearsonian’s coefficient of skewness. Solution:Arrange the data in an increasing order 1, 2, 4, 5, 6, 7, 8, 10, 30, 32 = 6.5 = 10.5 = 124.06 = 11.14 Therefore, = = = 1.077 Interpretation: The distribution is positively skewed. 5.2.3. Kurtosis Another attribute of a frequency distribution is its peakdness, or flatness, at its top. A distribution may have a smaller or greater degree of flatness at its top. Thus, it is the characteristics of flatness or peakdness at the top of the distribution that kurtosis describes and measures. Taking symmetrical distribution as a frame of reference, a distribution which is more peaked than the normal as in (a) below is known as Leptokurtic distribution. The one whose polygon is flat at its top as in (c) below is called a platikurtic distribution. A distribution with a polygon which is neither to high in peak, nor too flat at the top as in (b) is termed as Mesokurtic distribution. a. Leptokurtic b. Mesokurtic 104 c. Platykurtic We have two measures of Kurtosis (i) The coefficient of Kurtosis (ii) Moment coefficient of Kurtosis (i) The coefficient of Kurtosis The coefficient of kurtosis denoted by K is defined as a ratio of inter-quartile range to inter- decile range. K= Interpretation: If K = 0.5, approximately the distribution is Mesokurtic If K > 0.5, approximately the distribution is leptokurtic If K<0.5, approximately the distribution is platykurtic. (ii) Moment coefficient of Kurtosis Moment coefficient of Kurtosis is Kurtosis in terms of the fourth moment about the mean, denoted by B 2, and is defined as Where S is standard deviation. Interpretation: If => Mesokurtic distribution If => Leptokurtic distribution If => Platykurtic distribution 105 Review Exercises 1. Which of the following is not a measure of dispersion a) Range b) Standard deviation c) Variance d) Harmonic mean 2. A disadvantage of the range is a) Only two values are used in its calculation b) It is in different units than the mean c) It is easy to calculate d) All of the above 3. The standard deviation is a) Based on squared deviations from the mean b) In the same units as the mean c) Uses all the observations in its calculation d) All of the above 4. The variance is a) Found by dividing the mean deviation by N b) In the same units as the original data c) Found by squaring the standard deviation d) All of the above 5. In a positively skewed distribution a) The mean, median, and mode are all equal b) The mean is larger than the median c) The median is larger than the mean d) The standard deviation must be larger than the mean or the median 6. In a symmetric distribution a) The mean, median, and mode are equal b) The mean is the largest measure of location c) The median is the largest measure of location d) The standard deviation is the largest value 7. A coefficient of skewness of -2.73 was computed for a set of data. We conclude that a) The mean is larger than the median b) The median is larger than the mean c) The standard deviation is a negative number d) Something is wrong as the coefficient of skewness can't be less than -1.00 106 8. Which of the following statements is true regarding the standard deviation? a) It cannot assume a negative value b) If it is zero, then all the data values are the same c) It is in the same units as the mean d) All the above are all correct 9. The standard deviation of a normal distribution is found to be 3. What must be the value of the fourth central moment in order that the distribution to be: a) Mesokurtic b) Leptokurtic c) Platykurtic 10. The mean and standard deviation of 25 observations were found to be 30 and 3 respectively. After the calculations were made, it was found that two of the observations were recorded as 29 and 31 incorrectly. Find the mean and standard deviation if the incorrect observations are excluded 11. A person invested his money in to two areas A and B. His net profit (in Birr) for the first three months are: Area A 72 76 74 Area B 45 92 85 a) Find the mean net profit for each area of investment b) Find the range of net profit in both areas. c) Which area is risky to invest? In which area is the net profit more consistent? 12. The yearly salaries of all employees working for a company have a mean of Birr 42350 and a standard deviation of Birr 3820. The years of schooling for the sample of employees have a mean of 15 years and a standard deviation of 2 years. Is the relative variation in the salaries higher or lower than that in years of schooling for these employees? Why? 13. The coefficient of variation of a distribution is 60% and its standard deviation is 12. Find out its mean. 14. The mean and variance of five observations is 4.8 and 4.56 respectively. If the three of the five observations are 2, 5 and 6, find the other two observations 15. Using the frequency distribution given below, find a) The range, b) Quartile deviation c) Mean absolute deviation from mean d) Variance and standard deviation e) Pearsonian coefficient of skewness using two different formula Class Intervals 50 - 51 53 - 55 56 - 58 107 59 - 61 62 - 64 Frequencies 5 10 21 8 6 Chapter 6 Simple linear Regression and Correlation Chapter Objective: Dear reader, after studying this chapter, you will be able to: Define regression analysis Define and fit simple linear regression Predict the population average value of the dependent variable on the basis of known (fixed) values of the independent variable. Understand correlation Compute the Pearsonian and rank correlation coefficients. 6.1. Simple Linear Regression In the preceding chapters we have been dealing with data on a single variable. Here we shall focus on methods of dealing with paired data, which may be related in some way. Regression Analysis:- is concerned with describing and evaluating the relationship between a dependent variable and one or more independent variables. Therefore, regression is used for bringing out the nature of relationship and using it to know the best approximate value of the other variable. In what follows, therefore, we will deal with the problem of estimating and/or predicting the population mean/average values of the dependent variable on the basis of known values of the independent variable (s). The variable whose value is to be estimated/predicted is known as dependent variable while the variables which help us in determining the value of the dependent variable are known as independent variables. A regression equation which involves only two variables, a dependent and an in dependent referred to us simple regression. This model assumes that the dependent variable is influenced by only one systematic variable and the error term. However, when several variables (necessarily more than two) are included in the model, it is called multiple/multivariate regression. The relationship between any two variables may be linear or non-linear. The former implies a constant absolute change in the dependent variable in response to a unit changes in the independent variable while the latter implies varying marginal change in the dependent variable in response to changes in the independent variable. Consequently, in this chapter we will confine ourselves to the type of regression involving only tow variables and the type of relationship between our variables which is linear. If this turns out to be the case, it is called simple linear regression. 6.1.1. The Scatter Diagram 108 Consider the following data collected by taking a sample of five industries in a given industrial sector on their input (number of workers) and output (thousands of birr). Table 6.1. (Yi) (Xi) Paired date output (thousands of Inputs (no of (Xi, Yi) Birr) workers) 1 4 2 (2,4) 2 7 3 (3,7) 3 3 1 (1,3) 4 9 5 (5,9) 5 17 9 (9,17) Industry Output level (Yi) is believed to depend on number of workers (X i). Accordingly, Yi is a dependent variable and Xi is independent variable. In order to visualize the form of regression we plot these points on a graph as shown in fig. 6.1. What we get is a scatter diagram. Y 20 * 15 * 10 5 * 1 * * 2 3 4 5 6 7 8 9 X When carefully observed, the scatter diagram at least shows the nature of relationship; whether positive or negative and whether the curve is linear or non-linear. When the general course of movement of the paired points is best described by a straight line, the next task is to fit a regression line which lies as close as possible to every point on the scatter diagram. This can be done by means of either free hand drawing or the method of least squares. However, the latter is the most widely used method. 6.1.2. The regression Equation Regression equation is a statement of equality that defines the relationship between two variables. The equation of the line which is to be used in predicting the value of the dependent variable takes the form Y e = a + bx. The most universally used and statistically accepted method of fitting such an equation is the method of least squares. The Method of Least Squares:109 This method requires that a straight line is to be fitted being the vertical deviations of the observed Y values from the straight line (predicted Y values) is the minimum. As shown in fig 6.1, if e1, e2, …… e5 are the vertical deviations of observed Y values from the straight line (predicted Y values – Ye), fitting a straight line in keeping with the above condition requires that (for n sample size) n e = i 1 2 i is minimum. This can be done by partially differentiating respect to a and b and equating them to zero. ei is the error made when taking Ye instead of Y. Therefore, ei = Yi – Ye. e = Y Y e = Y a bX e (Y a bx ) 0 2 2 i i e 2 2 i i 2 i 2 i a i a Y a bX 0 Y a bx 0 na Y b X -2 i i i i i i n i n n a Y bX ei2 b -2 (Yi a bxi ) 2 b Y i 0 a bX i X i 0 =0 =0 Therefore, b= Or equivalently, multiplying both the numerator and denominator by n, we get: 110 e 2 i with Example 6.1. Suppose we want to study the relationship between input (number of workers) and output (thousands of Birr) of five factories given in table 6.1. above. To fit the regression line of Yi (thousands of Birr) on Xi (number of workers, we can employ the method of least squares as follows: Solution. Table 6.2. Arrange the data in tabular form Where Yi Xi YiXi Xi2 Tab. 4 2 8 4 Mean of 6.2 7 3 21 9 Mean of 3 1 3 1 9 5 45 25 17 9 153 81 40 20 230 120 Mean 8 4 = summation /total n = number of sample size n=5 Substituting these values in the above equations, we get = = = = =1 Therefore, the least square regression equation equals: Estimate the amount of Birr that a factory will have if it has 8 workers. Xi = 8 (8) Consequently, if a factory has 8 workers, its level of output will be 15 thousand ETB. Example 6.2. In what follows you are provided with sample observations on price and quantity supplied of a commodity X by a competitive firm. a) Construct the scatter diagram b) What is the linear regression of Yi(quantity supplies) on Xi(price of the commodity X). c) Suppose price of the commodity X be 32, what will be the quantity supplied by the firm? 111 Tab. 6.3. Data on price and quantity supplied. (Yi) 40 45 40 50 55 60 60 65 70 75 55 60 675 Total (Xi) 15 20 25 30 35 40 45 50 55 60 40 45 460 XiYi 600 900 1000 1500 1925 2400 2700 3250 3850 4500 2200 2700 27,525 * a) 70 60 * 50 * 40 * ** * * * * * 30 20 10 10 20 30 b) 40 50 = 0.7795 = 26.3718 Therefore, the estimated supply function is Ye = 26.3718 + 0.7795 Xi c) 60 Xi = 32 Ye = 26.3718 + 0.7795 Xi = 26.3718 + 0.7795 (32) = 26.3718 + 24.944 112 70 Xi2 225 400 625 900 1225 1600 2025 2500 3025 3600 1600 2025 19,750 = 51.3158 If the price of x is 32, the estimated quantity supplied will be approximately equal to 51 units. 6.1.3. Regression of X on Y In the above sub-topic 6.1.2. we have explored regression of Y on X type. Sometimes, it is possible and of interest to fit the regression of X on Y type, i.e., being Y as independent and X dependent. In such cases, the general form of the equation is given by: Where Xe = expected value of X a0 – X-intercept b0 – slope of the regression Applying the principle of least squares as before, the constants a 0 & b0 are given as follows: N.B. The regression equation of Y on X type and of X on Y type coincide at . 6.2. Correlation The correlation coefficient measures the degree to which two variables are related /associated – simple correlation denoted by r. For more than two variables we have multiple correlations. Two variables may have either positive correlation, negative correlation or may not be correlated. Furthermore, depending on the form of relationship the correlation between two variables may be linear or non-linear. Therefore, in this section, we shall be concerned with quantifying the degree of association between two variables with linear relationship. Contrary to regression analysis explained in the previous section (6.1), the computation of coefficient of correlation does not require one variable to be designated as dependent and the other as independent. The measure of the degree of relationship between any two variables known as the pearsonian coefficient of correlation, usually denoted by r, is defined and is termed as the product – moment formula. It can be further simplified as NB. The building blocks of this formula are, therefore, and n(sample size). Properties of pearsonian coefficient of correlation 1. 2. 3. When r = 1/-1 perfect positive/negative correlation. 4. Adding a constant number to each value of X and Y, as well as multiplying each value by a constant does not affect the value of r. 5. The closeness of the relationship is not proportional to the value of r. 113 6. When r is positive and close to 1 then there is high positive correlation while when it is close to zero it shows low positive correlation. Similarly, when r is negative and close to -1 then there is high negative correlation while when it is close to zero it shows low negative correlation 7. It is free of any units used. Example 6.3. Find the pearsonian coefficient of correlation for the two variables in the data of table 6.1. Solution Table 6.4. Total Yi Xi Xi2 Yi2 XiYi 4 2 4 16 8 7 3 9 49 21 3 1 1 9 3 9 5 25 81 45 17 9 81 289 153 40 20 120 444 230 = 0.99 Interpretation: it implies strong positive relation: Example 6.4. Find the pearsonian coefficient of correlation for the two variables in the data of table 6.3. Solution: Table 6.5. Total Yi 40 45 40 50 55 60 60 65 70 75 55 60 675 Xi 15 20 25 30 35 40 45 50 55 60 40 45 460 Xi2 225 400 625 900 1225 1600 2025 2500 3025 3600 1600 2025 19,750 Yi2 1600 2025 1600 2500 3025 3600 3600 4225 4900 5625 3025 3600 39,325 = 0.974 Interpretation: It implies strong positive relation between X & Y. 114 XiYi 600 900 1000 1500 1925 2400 2700 3250 3850 4500 2200 2700 27,525 Therefore, Example 6.5. Adding to each value of X and Y given in table 6.1 a constant number, say 1, show that property 4 holds true. Solution Table 6.6. Total = Yi Xi Xi2 Yi2 XiYi 5 3 9 25 15 8 4 16 64 32 4 2 4 16 8 10 6 36 100 60 18 10 100 324 180 45 25 165 529 295 = 0.99 Therefore, we have shown that property 4 is true. Spearman’s Rank Correlation Coefficient The pearsonian coefficient of correlation cannot be used in cases when the direct quantitative measurement of the phenomenon under study is not possible. In such cases, we make use of the rank correlation coefficient. Steps involved to calculate the spearman’s coefficient of rank correlation: 1. Rank the X values among themselves giving rank (1) to the largest (or smallest value and (2) to the next largest (or smallest) value and so on. 2. Rank the Y-values among themselves in a similar way to that of X. 3. When there are ties in rank, i.e., when there are values sharing the same rank, assign toe ach of the filed observation, the mean of the ranks they jointly occupy and the next rank to be over looked. 4. Find the sum of the squares of the differences between ranks of two variables. 5. Apply the formula n = number of pairs of observations di =ith difference between ranks of X and Y As the steps above indicate, rs may be calculated for numerical data after ranking the values according to numerical size. Example 6.2. Consider the ranks given by two Judges for five ladies in a beauty contest: Table 6.7 Judges Ladies AZEB TIZITA FATUMA RA RB 1 3 4 2 4 3 115 LEMLEM CHALTU 2 5 1 5 Solution: di di2 1 1 1 1 -1 1 -1 1 0 0 Total 4 = = 0.75 Interpretation: Since rs= 0.75, it implies that there is similarity between the ranks of Judge A and Judge B. Review Exercises 1. Define and distinguish between; a) Regression and correlation b) Simple and multiple regression c) Linear and non-linear relationship 2. Bring out the relevance of a scatter diagram in regression analysis. 3. Explain the meaning and status of the two constants a and b in the regression equation Y e = a + bXi. 4. The marks obtained by 10 students in their graduation with B.A. degree in management and the MBA entrance test were found as given below. Graduation (Xi) 50 52 55 60 62 65 65 66 70 75 Entrance test (Yi) 52 50 57 65 65 62 65 65 71 75 Therefore, find a) The two regression equations b) The correlation coefficient between two sets of marks 5. Obtain the regression equation of X on Y and Y on X for the paired data given below. Also compute the coefficient of correlation. 6. Market price of X 26 28 30 31 35 Market price of Y 20 27 28 30 25 Ten students got the following marks in Maths and Statistics Student A B 116 C D E F G H I J Maths (X) 78 36 98 25 75 82 90 62 65 39 Statistics (Y) 84 51 91 60 68 62 86 58 58 47 Compute the coefficient of Rank correlation and interpret the result. 7. For a certain set of paired data on X and Y, 3Xi + 2Yi – 26 = 0 and 6Xi + Yi – 31 = 0 are the two regression equations. a) Find the mean values b) Find the coefficient of correlation 8. A leading company engaged in the production of detergents has 10 vacancies of salesman for which 15 (n) persons were called for personal interviews. The interview board consisted of the sales manager and a psychologist. The ranks given by the two to all 15 candidates who attend the interview is given below. Sr.No. in the interview 1 2 4 5 8 9 10 11 13 14 15 17 18 19 20 sales 2 3 1 5 4 6 8 7 9 10 12 11 13 14 15 the 1 3 2 4 6 5 7 9 8 11 10 12 14 13 15 list Ranking by the manager (xi) Ranking by psychologist (Yi) Compute the rank correlation coefficient. 117 Chapter Seven Elementary Probability Chapter Objectives; Dear learner, at the end of this chapter, you are expected to: 7.1. Define probability. Understand the basic terms such as experiment, outcome and event. Calculate probabilities applying the rules of addition and multiplication. Define the terms conditional probability and joint probability. Understand permutation and combination. Define the terms random variable and probability distribution. Distinguish between a discrete and continuous probability distribution Calculate the mean, variance and standard deviation of discrete probability distributions Understand binomial and normal probability distributions. Define and calculate the Z-value Compute probabilities using the standard normal distribution. Introduction Probability as a general concept can be defined as the chance of an event occurring. Probability theory gives us methods of dealing with uncertainty. As nothing is accurately predictable, uncertainty is common feature of every decision making process. In such situations the probability theory comes to our aid, by providing the necessary methods to take appropriate decisions even under conditions of risk and uncertainty. 7.2. Definition and basic concepts An Experiment – is the process that leads to the occurrence of one or more possible observations. Example:- Tossing a coin - Rolling two dice once - Drawing a card from a deck Sample Space – is a complete listing of all elementary events of an experiment. Example. The sample space for the experiment of tossing a coin is (H,T). if two coins are tossed once, the sample space is (H1, H2) (H1, T2) (T2 H2) (T1 T2). The sample space for the roll of a single die is (1,2,3,4,5,6). If two dice are rolled once, the possible outcomes (sample space) are:- 118 Sample points:- are elements of sample space. Example. 2 is one sample point of rolling a die. To find the number of sample spaces, apply the formula where n is the number of experiments and K is the number of possible outcomes of a single experiment. An Event – is the collection of one or more outcomes of an experiment. Events are mutually exclusive if the occurrence of any one event means that none of the others can occur at the same time. That is if two events cannot occur at the same time, they are mutually exclusive. Events are independent if the occurrence of one event does not affect the occurrence of another. Events are collectively exhaustive if at least one of the events must occur when an experiment is conducted. Example: A fair die is rolled once. The experiment is rolling a die. The possible outcomes are the numbers 1,2,,4,5, and 6. If an event is the occurrence of an even number, we should collect the outcome, 2,4 and 6. Probability is a measure of the chance or likelihood that a particular event will happen in the future. It can only assume between 0 and 1. For instance, probability of E which is written as P(E) as a number do have the properties: P(E) = 0 means the event will not happen and is called impossible event. P(E) = 1 means we are 100% sure that the event will occur (sure event) Probability can be defined in three different approaches. (i) Classical probability (ii) Relative frequency (Emperical) probability (iii) Subjective probability i) Classical Probabilities:- It is based on the assumption that the outcomes of an experiment are equally likely. It applies rules and laws and involves an experiment. Where: N = total possible outcomes of an experiment n = the number of outcomes in which the event occurs out of N outcomes in an experiment. Examples. In a coin tossing experiment, what is the probability of getting a head on one toss of a coin? As there are only two possible outcomes, the probability is 50% or 0.5 or ½ . ii) An unbiased die is thrown. What is the probability that digit 2 appears? Ans. . Relative frequency (Emperical) Probabilities- This method is based on cumulative past historical data. : 119 a) Suppose that, of the last 70 days with conditions like those forecasts for today, it rained for 12 days, what is the probability of rain today based on those historical days? = 0.17 or 17% b) Throughout her teaching career a professor has awarded 186 A’s out of 1200 students. What is the probability that a student in her section this semester will receive an A grade? = 0.1555 iii) Subjective Probability:- It uses probability value based on an educated guess or estimate, employing opinions and inexact information. For example, a seismologist might say that there is a 45% probability that an earthquake will occur in Afar after thirty years. 7.3. Basic Rules of Probability If two events A and B are mutually exclusive, the special rule of addition states that the probability of A or B occurring equals the sum of their respective probabilities: P (A or B) = P(A) + P(B) Definition: Two events of a single experiment are said to be mutually exclusive if they cannot occur simultaneously as a result of the experiment. This is equivalent to saying that mutually exclusive events must have disjoint event sets. Example: Abay Zuria transport association has recently supplied the following information on their trip from Bahir Dar to Debre Markos: Arrival Frequency Early 100 On time 800 Late 75 Cancelled 25 Total 1000 If A is the event that a bus arrives early, then P(A) = 100/1000 = .10. If B is the event that a bus arrives late, then P(B) = 75/1000 = .075. The probability that a bus is either early or late is: P (A or B) = P(A) + P(B) = .10 + .075 =.175. The complement rule The complement rule is used to determine the probability of an event occurring by subtracting the probability of the event not occurring from 1. If P(A) is the probability of event A and P(~A) is the complement of A, then P(A)+P(~A)=1 or P(A)= 1P(~A). 120 Examples: 1) Two events X and Y are mutually exclusive. Suppose P(X) =0.05 and P (Y) =0.02. What is the probability that either X or Y will occur (0.07). What is the probability that neither X nor Y will happen? (0.93) 2) Suppose the probability that you will score an A in this class is 0.25 and the probability that you will get a B is 0.50. What is the probability that your grade will be above C? (0.75) 3) The probabilities of events A and B are 0.20 and 0.30 respectively. The probability that both A and B occur is 0.15. What is the probability of either A or B will occur?(0.35) 4) A student is taking two courses, microeconomics and statistics. The probability that the student will pass the microeconomics course is 0.60 and the probability of passing the statistics course is 0.70. The probability of passing both is 0.50. What is the probability of passing at least in one course? (0.80) The general rule of addition If A and B are two events that are not mutually exclusive, then P(A or B) is given by the following formula: P(A or B) = P(A) + P(B) - P(A and B) Example: In a sample of 500 students, 320 said they had a radio, 175 said they had a TV, and 100 said they had both: If a student is selected at random, what is the probability that the student has only a radio, only a TV, and both a radio and TV? Solution: P(S) = 320/500 = .64. P(T) = 175/500 = .35. P(S and T) = 100/500 = .20. If a student is selected at random, what is the probability that the student has either a radio or a TV in his or her room? Solution: P(S or T) = P(S) + P(T) - P(S and T)= .64 +.35 - .20 = .79. Joint Probability A joint probability measures the likelihood that two or more events will happen at the same time. An example would be the event that a student has both a radio and TV in his or her dorm room. Special rule of multiplication The special rule of multiplication requires that two events A and B are independent. Two events A and B are independent, if the occurrence of one has no effect on the probability of the occurrence of the other. 121 If the occurrence of one event has no effect on the probability of the occurrence of any other event, then the events are called independent events. Two events originating from independent experiments will be independent, while two events originating from the same experiment will not, in general, be independent. Example: Suppose two coins are tossed, the outcomes of one coin (head or tail) is unaffected by the outcome of the other coin (i.e. head or tail). That is, the outcome of the second event does not depend on the outcomes of the first event. This rule is written: P(A and B) = P(A)P(B) 7.4. Conditional Probability A conditional probability is the probability of a particular event occurring, given that another event has occurred. The probability of the event A given that the event B has occurred is written P(A|B). General rule of multiplication The general rule of multiplication is used to find the joint probability that two events will occur. It states that for two events A and B, the joint probability that both events will happen is found by multiplying the probability that event A will happen by the conditional probability of B given that A has occurred. The joint probability, P(A and B) is given by the following formula: P(A and B) = P(A)P(B/A) or P(A and B) = P(B)P(A/B) Where P (B/A) = probability of B given that event A has occurred. Conditional probability P( A / B) P( AandB ) , P( B) 0 B Example: The Dean of the School of Business at a University collected the following information about undergraduate students in her college: Major Male Female Total Accounting 170 110 280 Finance 120 100 220 Marketing 160 70 230 122 Management 150 120 270 Total 600 400 1000 a) If a student is selected at random, what is the probability that the student is a female (F) and Accounting major (A) P (A and F) = 110/1000. Given that the student is a female, what is the probability that she is an Accounting major? P (A|F) = P (A and F)/P (F) = [110/1000]/[400/1000] = .275 Let an experiment have a sample space S with E as any event. We define the probability of E occurring written as P (E) as a number of satisfying the following conditions. P(S) = 1, p i =1 Additional examples: 1. An experiment is performed by tossing a normal coin and observing which side (H or T) is shown uppermost. 2. a. Write down the sample space S = (H, T) b. Calculate P(H) = ½ c. Show that P(S) = 1 = ( d. Show that 1 1 1) 2 2 E1 (H) and E2 (T) are mutually exclusive. A fair dies is rolled once as an experiment with S = (1,2,3,4,5,6) a. P(1 or 2) = P(1)+P(2) = 1/6+/6=1/3 b. P(X<4) = ½ c. P(even number)= ½ d. P(even or less than 4)=P(even number) + P(<4) – P(even number and <4)=1/2 +1/2 -1/6=5/6 7.5. Counting Procedures Permutation is any arrangement of r objects selected from n possible objects. The formula to count the total number of different permutation is n pr n! where n! n(n 1)(n 2)........2 *1 By definition 0! (read as zero factorial)=1 (n r )! NB. The arrangements abc and bac are different permutations. Example: If you have three guests (Abebe, Bekele, Chala) invited to come to your house, a. In how many ways can they sit on the chair available in your house? Sitting Arrangement Abebe, Bekele, Chala Abebe, Chala, Bekele 123 Bekele, Abebe, Chala Bekele, Chala, Abebe Chala, Abebe. Bekele Chala, Bekele, Abebe 3 p3 3! 6 (3 3)! Therefore, there are 6 different arrangements for the three guests. b. If you want to arrange a seat for two guests out of three, in how many ways can you arrange them? Abebe, Bekele Abebe, Chala Bekele, Abebe Bekele, Chala Chala, Abebe Chala, Bekele 3 p2 3! 6 (3 2)! Therefore, there are 6 different sitting arrangements for the two guests. c. What if you are trying to give a seat for a guest out of three guests? Abebe, Bekele, Chala 3 p1 3! 3 (3 1)! Therefore, there are 3 different sitting arrangements for a guest. Combination: is the number of ways to choose r objects from a group of n objects. Formula c n r n! r!(n r )! Example: If executives Abebe, Bekele and Chala are to be chosen as a committee to negotiate on the price of a car, a. How many combinations of these three executives are possible? Solution: c 3 3 3! 1. 3!(3 3)! There is only one combination of these three. The committee of Abebe, Bekele and Chala is the same as the committee of: Bekele, Chala and Abebe or Chala, Abebe and Bekele Bekele, Abebe and Chala Chala, Bekele and Abebe Abebe, Chala and Bekele b. How many possible combinations are possible of two executives are supposed to negotiate to by a car? Abebe, Bekele 124 Abebe, Chala Bekele, Chala c 3 2 c. 3! 3! 3 . Three combinations are possible. 2!(3 2)! 2!*1! How many possible combinations are possible if one executive is supposed to negotiate to buy a new car? Abebe, Bekele, Chala c 3 1 3! 3! 3 Three combinations are possible. 1!(3 1)! 1!*2! 7.6. Probability Distributions and Random Variables Probability Distribution: It is a listing of all the outcomes of an experiment and the probability of each of these outcomes either tabular or graphically. Random Variables A random variable is a numerical value determined by the outcome of an experiment. Types of Probability Distributions A discrete probability distribution can assume only certain outcomes. A continuous probability distribution can assume an infinite number of values within a given range. Examples of a discrete distribution are: The number of students in a class. The number of children in a family. The number of cars entering a carwash in a hour. Examples of a continuous distribution include: The distance students travel to class. The time it takes an executive to drive to work. Features of a Discrete Distribution The main features of a discrete probability distribution are: The sum of the probabilities of the various outcomes is 1.00. The probability of a particular outcome is between 0 and 1.00. The outcomes are mutually exclusive. Example: Consider a random experiment in which a coin is tossed three times. Let x be the number of heads. Let H represent the outcome of a head and T the outcome of a tail. The possible outcomes for such an experiment will be: TTT, TTH, THT, THH, HTT, HTH, HHT, HHH. Thus the possible values of x (number of heads) are 0,1,2,3. The outcome of zero heads occurred once. The outcome of one head occurred three times. 125 The outcome of two heads occurred three times. The outcome of three heads occurred once. From the definition of a random variable, x as defined in this experiment is a random variable. The probability distribution is given as X P(X) 0 1/8 1 3/8 2 3/8 3 1/8 The Mean of a Discrete Probability Distribution The mean: reports the central location of the data. is the long-run average value of the random variable. is also referred to as its expected value, E(X), in a probability distribution. is a weighted average. The mean is computed by the formula: where [( xP( x)] represents the mean and P(x) is the probability of the various outcomes x. The Variance of a Discrete Probability Distribution The variance measures the amount of spread (variation) of a distribution. The variance of a discrete distribution is denoted by the Greek letter (sigma squared). The standard deviation is the square root of Sigma Squared. The variance of a discrete probability distribution is computed from the formula: 2 [( x )2 p( x)] Examples: 1. The table listed below show random variables and their probabilities. However only one of these is actually a probability distribution: X P (X) X P (X) X P (X) 5 0.30 5 0.10 5 0.50 10 0.30 10 0.30 10 0.30 15 0.20 15 0.20 15 -0.20 20 0.40 20 0.40 20 0.40 a) Which one is a probability distribution? 126 b) Using the correct probability distribution, find the probability that X is 1) Exactly 15 (0.20) 2) Not more than 10 (0.40) 3) More than 5 (0.90) c) Calculate the mean, variance and standard deviation of the correct probability distribution. Mean=5*.10+10*.30+15*.2+20*.4=0.5+3+3+8=14.5 2. According to recent information published in the capital magazine 36 percent of the households in the Ethiopia have one TV set, 47 percent have 2 sets, 15 percent have 3 sets, and 2 percent have 4 sets. a) Depict the probability distribution X 1 2 3 4 P(X) 0.36 0.47 0.15 0.02 b) What is the mean number of sets per household? 1(.36) 2(.47) 3(.15) 4(.02) 1.83 127 c) What is the variance of the number of sets per household? 2 1 1.832 (.36) 2 1.832 (.47) 3 1.832 (.15) 4 1.832 (.02) .5611 3. The head of a department estimated the distribution of student admission to his department for the next semester based on past experience as follows: Admission Probability 1000 0.60 1200 0.30 1500 0.10 a) What is the expected number of students who will admit to the department next semester? (Ans. 1110) b) Compute the variance and standard deviation The binomial distribution The binomial distribution has the following characteristics: An outcome of an experiment is classified into one of two mutually exclusive categories, such as a success or failure. The data collected are the results of counts. The probability of success stays the same for each trial. The trials are independent Mean & Variance of the Binomial Distribution The mean is found by: n The variance is found by: n (1 ) 2 To construct a binomial distribution, let n be the number of trials x be the number of observed successes be the probability of success on each trial The formula for the binomial probability distribution is: P( x)n cx x (1 ) n x Example: The Department of Labor reports that 20% of the workforce is unemployed. From a sample of 14 workers, calculate the following probabilities: Exactly three are unemployed. At least three are unemployed. 128 At least one are unemployed. 129 Solution The probability of exactly 3: P( x)n cx x (1 ) nx P(3)14 c3 (.2)3 (1 .2)11 364.91* 0.008 * 0.859 0.2501 The probability of at least 3 is: P( x 3)14 c3 (.2)3 (1 .2)1114c4 (.2) 4 (1 .2)10 ...14c14 (.2)14 (1 .2)0 0.551 The probability of at least one being unemployed. P( x 1) 1 P(0) 114 c0 (.2)0 (1 .2)14 0.956 The Normal Probability Distribution Characteristics of a Normal Probability Distribution The normal curve is bell-shaped and has a single peak at the exact center of the distribution. The arithmetic mean, median, and mode of the distribution are equal and located at the peak. Thus half the area under the curve is above the mean and half is below it. The normal probability distribution is symmetrical about its mean. The normal probability distribution is asymptotic. That is the curve gets closer and closer to the X-axis but never actually touches it. It is a continuous probability distribution. Theoretically, curve extends to infinity The Standard Normal Probability Distribution The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. It is also called the z distribution. A z-value is the distance between a selected value, designated X, and the population mean divided by the population standard deviation. The formula is: z X Example: The bi-monthly starting salaries of recent MBA graduates follow the normal distribution with a mean of Birr 2,000 and a standard deviation of Birr 200. What is the z-value for a salary of Birr 2,200? z What is the z-value of $1,700? X z 2,200 2,000 1.00 200 X 1,700 2,200 1.50 200 130 A z-value of 1 indicates that the value of $2,200 is one standard deviation above the mean of $2,000. A zvalue of –1.50 indicates that $1,700 is 1.5 standard deviations below the mean of $2000. Example: The daily water usage per person in New Providence, New Jersey is normally distributed with a mean of 20 gallons and a standard deviation of 5 gallons. About 68 percent of those living in New Providence will use how many gallons of water? About 68% of the daily water usage will lie between 15 and 25 gallons. What is the probability that a person from New Providence selected at random will use between 20 and 24 gallons per day? z X 20 20 0.00 5 z X 24 20 0.80 5 The area under a normal curve between a z-value of 0 and a z-value of 0.80 is 0.2881. We conclude that 28.81 percent of the residents use between 20 and 24 gallons of water per day. What percent of the population use between 18 and 26 gallons per day? z X 18 20 0.40 5 z X 26 20 1.20 5 The area associated with a z-value of –0.40 is .1554. The area associated with a z-value of 1.20 is .3849. Adding these areas, the result is .5403. We conclude that 54.03 percent of the residents use between 18 and 26 gallons of water per day. Review Exercises 1) Which of the following is a correct statement about a probability? a. It may range from 0 to 1 b. It may assume negative values c. It may be greater than 1 d. It cannot be reported to more than 1 decimal place e. All the above are correct 2) An experiment is a a. Collection of events b. Collection of outcomes c. Always greater than 1 d. The act of taking a measurement or the observation of some activity e. None of the above is correct 131 3) Events are independent if a. By virtue of one event happening another cannot b. The probability of their occurrence is greater than 1 c. We can count the possible outcomes d. The probability of one event happening does not affect the probability of another event happening e. None of the above 4) When we find the probability of an event happening by subtracting the probability of the event not happening from 1, we are using a. Subjective probability b. The complement rule c. The general rule of addition d. The special rule of multiplication e. Joint probability 5) The Special Rule of Addition is used to combine a) Independent events b) Mutually exclusive events c) Events that total more than one d) Events based on subjective probabilities e) Found by using joint probabilities 6) When we determine the number of combinations a) We are really computing a probability b) The order of the outcomes is not important c) The order of the outcomes is important d) We multiple the likelihood of two independent trials e) None of the above 7) The difference between a permutation and a combination is a. In a permutation order is important and in a combination it is not b. In a permutation order is not important and in a combination it is important c. A combination is based on the classical definition of probability d. A permutation is based on the classical definition of probability e. None of the above 132 8) Which of the following is not a requirement of a binomial distribution? a. A constant probability of success b. Only two possible outcomes c. A fixed number of trials d. Equally likely outcomes 9) The expected value of the a probability distribution a. Is the same as the random variable b. Is another term for the mean c. Is also called the variance d. Cannot be greater than 1 10) The normal distribution is a a. Discrete distribution b. Continuous distribution c. Positively skewed distribution d. None of the above 11) Which of the following are characteristics of the normal distribution? a. It is a symmetric distribution b. It is bell-shaped c. It is asymptotic d. All of the above 12) Which of the following statements is correct regarding the standard normal distribution? a. It is also called the z distribution b. Any normal distribution can be converted to the standard normal distribution c. The mean is 0 and the standard deviation is 1 d. All of the above are correct 13) The area under a normal curve between 0 and -1.75 is a) 0.0401 b) 0.9599 c) 0.4599 d) None 14) The area under a normal curve less than 1.75 is a) 0.0401 b) 0.9599 c) 0.4599 d) None 15) In the standard normal distribution, what is the probability of finding a z value between -1.25 and -1.00? a) 0.3944 b) 0.3413 c) 0.7357 d) 0.0531 133 16) Which of the following is not a requirement of a probability distribution? a) Equally likely probability of a success b) Sum of the possible outcomes is 1.00 c) The outcomes are mutually exclusive d) The probability of each outcome is between 0 and 1 17) In a continuous probability distribution a) Only certain outcomes are possible b) All the values within a certain range are possible c) The sum of the outcomes is greater than 1.00 d) None of the above 18) In a normal distribution the relationship between the mean, median, and the mode is a. They are all equal b. The mean is the largest c. The median is the largest d. None of the above Problems 19) Sixty percent of the students at Scandia Tech drive to class and 30 percent have GPAs of at least 3.00. Ten percent of the students have a 3.00 GPA and drive to class. If we select a student at random, what is the likelihood that the student had a GPA of 3.00 or drives to class? 20) An insurance sales representative has an appointment with four clients today. From long experience she knows that the probability of selling a policy to a client is .80. a. What is the probability of selling a policy to all 4 clients? b. What is the probability of selling a policy to three or more clients? 21) There are 600 employees at the Tuesday Morning’s Department Store corporate headquarters in Columbia. See the following breakdown. Gender No College College Total Male 25 225 250 Female 75 275 350 100 500 600 Total An employee is selected at random. a. What is the probability the employee is female? b. What is the probability the employee is either female or attended college? c. What is the probability the employee attends college given a female employee? 134 135 For a particular group of taxpayers, 25 percent of the returns are audited. Six taxpayers are randomly selected from the group. a. What is the probability two are audited? b. What is the probability two or more are audited? 23) Suppose P (A) =0.75, P (B/A) =0.40, what is the joint probability of A and B? 136 Sample Answer for Review Exercises Chapter one – Introduction 1. 2. 3. 4. 5. 6. 7. 8. A B C D B E C a. b. c. d. Inferential Descriptive Deferential Inferential a. b. c. d. Qualitative Quantitative Qualitative Qualitative 9. 15. D 16. D Chapter two - Sampling Theory 7. a. Probability b. From 150 = 6 sample 100 = 4 sample 50 = 2 sample 9. 100 from 10,000; 50 from 5000; 150 from 15,000 2000 from 20,000 & 500 from 50,000 Choose the best answerer 1. D 2. C 3. D 4. D 137 5. C Chapter three - Data Collection & Presentation Choose the best answer 1. B 2. C 3. A 4. C 5. A 6. C 7. B 8. C 9. D 10. B Work Out 5. X = 25, Y = 35, a= 45, b= 65, c= 80 , z = 0.25 6. iii) a = 40, b = 60, c = 40 d = 25 e = no answer Chapter Four – Measures of central tendency Choose the best answer 1. B 2. C 3. C 5. C 6. A 7. A Work Out 7. 70 8. Birr 2000 9. 5% 10. Birr 5.83 11. 2/3 12. 20% & 80% respectively 13. Birr 3.095 /Kg 14. Birr 1400 15. 50.9 16. 15.43 138 17. 15.5 years 18. 66 19. 4 & 6 20. Birr 1500.0185 21. 20.05 22. 3.18 23. 2700 24. 30.20 25. 3200 26. 50 & 100 respectively 27. 39.08 28. f1 = 20 29. f1 = 25 & f2 = 24 30. $0.80 31. 35 32. 21.5 - 28.5 33. a. 30.2 b. 28.75 Chapter Five – Measures of Dispersion 1. D 2. A 3. A 4. D 5. B 9. a. 243 b. > 243 c. < 243 10. Mean = 30, S.D = 3.1 11. a. 74 b. 4 & 47 139 12. CV for salaries = 9% CV for years of schooling = 13.33% 13. Mean = 20 14. X = 8, Y = 3 Chapter Six – simple linear regression & Correlation 4. a) Ye = 1.434 + 0.993 Xi Xe = 6.182 + 0.886 Yi b) r = 0.938 5. Xe = 20.59 + 0.36 Xi Ye = 12.29 – 0.46 Xi r = 0.40 6. rs = 0.818 7. = 4, = 7, r = -0.5 8. rs = 0.96 Chapter Seven – Introduction to Probability 1. A 2. D 3. D 4. B 5. B 11. D 19. P(A or D) = 0.6 + 0.3 – 0.1 = 0.8 20. P(4) = 0.4096 P(X 3) = 0.8192 21. P(F) = 0.5833 P(F or C) = 0.9583 22. P(2) = 0.2966 P(X 2) = 0.466 P(C/F) = 0.7857 23. 0.30 140 18. A Bibliography 1. D. A Lind, W.G. Marchal and S.A. Wathen, Statistical Techniques in Business and Economics, 12th edition 2. Elementary Statistics: A Step by Step Approach, A.G. Blumnan, 2nd and 5th edition 3. Gupta, Introduction to Statistics 4. Ghosh and Saha, Business Mathematics and Statistics, 10th edition 5. Introduction to Probability and Statistics, William Mendel, etal. 6. Monga, G.S. (1972), Mathematics and Statistics for Economics, Vikas Publishing House 141