KENYA METHODIST UNIVERSITY Department of Economics and Applied Statistics STATISTICS AND BIOMETRICS By Prof. George K. King’oriah PhD., MBS. P. O. Box 45240 NAIROBI - 00100 Kenya Tel: (020) 2118443/2247987/2248172 SafariCom + 254 0725 – 751878 Celtel 0735 - 372326 Email: nairobicampus@kemu.ac.ke © KEMU SCHOOL OF BUSINESS AND ECONOMICS 1 GENERAL STATISTICS CURRICULUM OUTLINES ELEMENTARY BUSINESS STATISTICS The nature of variables, Sums and summation signs, Types of statistical data, Samples, Sampling, Population and the universe. The functioning of factorials, frequency distributions, Measures of central tendency like the mean, median, mode, proportion. Measures of variability – Variance, and Standard deviation; and Standard errors. Probability, Rules of probability, Permutations and combinations. The binomial theorem and approach to normal curve. Random variables and probability distributions. The normal curve and use of normal tables, the normal deviate “Z”, the “t” distribution the proportion “ ” and standard error of the proportion “ ”, confidence intervals and hypothesis testing, use of standard errors in statistics. Analysis of variance, and the use of “F” statistic; simple linear regression/correlation analysis; the Chi-square statistic (introduction) INTERMEDIATE BUSINESS STATISTICS Non- parametric statistics; Chi-square statistical analysis, Median test for one sample, Mann-Whitney test for two independent samples, Wilcoxon rank test for two matched samples, Kruskal-Wallis test for several independent variables: Rank order correlation, Spearman’s rank correlation “ s ”, Kendall’s Rank correlation coefficient, Kendall’s partial Tau “ ”; Biserial Correlation. Poisson distribution “ ”, Bayes Theorem, and Posterior probability Non- linear regression correlation, partial regression correlation, Multiple regression by Snedecor method, Multiple regression using linear equations and Matrix algebra, significant testing for linear, non linear and multiple regression and correlation parameters. Advanced analysis of variance, randomized block design and Latin squares. ADVANCED BUSINESS AND APPLIED STATISTICS Use of linear and non-linear regression/correlation analysis, use of multiple regression in measurement of economic and non-economic variables, identification and identification problems, Ordinary least squares and generalized least squares models and their use, models and model building, multi-collinearity, Heteroscedasticity, Two-stage least squares models, Maximum likelihood methods, Bayesian methods of estimation, use of Dummy Variables; Logit, Probit and Tobit models, Aggregation problems and their use in the estimation of structural variables, Index Numbers, Time Series Analysis, Autocorrelation, Analysis of Seasonal Fluctuations, cyclical movements and irregular fluctuations, Price Indices, Decision Theory and decision making. Application of computer packages for data analysis (SPSS, Excel, E-Views, SHAZAM, MINITAB, etc). Use of equation editors for statistical and mathematical presentations 2 STATISTICS AND BIOMETRICS Purpose of the course The course is designed to introduce the learner to the purpose and meaning of Statistics; and to the use of statistics in research design, data collection, data analysis and research reporting, especially in biometrics and agricultural research. Objectives of the Course By the end of the course the learner is expected to :1. Understand and use sampling and sampling methods 2. Understand the nature and use of Statistics 3. Use Statistics for data collection, data analysis and report writing within the environment of biometric research. 4. Be able to participate in national and international fora and discussions which involve analysis and interpretation of statistical and biometric research data. 5. Be able to explain (in a simplified and an understandable manner) the meaning of research data and findings to students, farmers, ordinary citizen, Politicians and other stakeholders in the Agricultural Sector. 6. Be able to carry out Biometric project evaluation, management, and monitoring. Course Description Common concepts in Statistics, Graphic Presentation of Data, Measures of Central Tendency and Dispersion, Measures of variability or dispersion. Simple Rules of Probability, Counting Techniques, Binomial Distribution. Normal Distribution, The t distribution The Proportion and Normal Distribution. Sampling Methods, Hypothesis testing, Confidence Intervals. Chi-Square Distribution. Analysis of Variance. Linear Regression/correlation. Partial Correlation. Logarithmic and Mathematical Transformations. Non-Linear Regression and Correlation. Non-parametric Statistics: Median Test for one Sample. Mann-Whitney Test for Two Independent Samples, Wilcoxon Signed-Rank Test, Kruskal-Wallis Test, Spearman’s Rank Order Correlation. Use of statistical methods for measurement, data collection analysis, and Research reporting. 3 Course Outline 1. Statistics, Biostatistics, measurements and analysis 2. Basic Probability Theory 3. The Normal Distribution 4. Sampling Methods and Statistical Estimation 5. Confidence Intervals 6. Hypothesis Testing using Small and Large Samples 7. The Chi-Square Statistic and Testing for the Goodness of Fit 8. Analysis of Variance 9. Linear Regression and Correlation 10. Partial Regression and Multiple Regression 11. Non-Linear Regression and Correlation 12. Significance Testing using Regression and Correlation methods 13. Non-Parametric Statistics Teaching Methods 1. Classroom Lectures and learner interaction in lectures and discussions 2. Distance Study materials and frequent instructor supervision at distance learning centers 3. Students may be guided to design simple experiments and test various hypotheses, using various statistical and biometric methods. Recommended Textbooks Class Text Books King’oriah, George K. (2004), Fundamentals of Applied Statistics, Jomo Kenyatta Foundation, Nairobi. Steel, Robert G.D. and James H. Torrie, (1980); Principles and Procedures of Statistics, A Biometric Approach. McGraw Hill Book Company, New York. 4 Useful References (The Most recent editions of these textbooks should be obtained) Gibbons, Jean D., (1970) Non-Parametric Statistical inference. McGraw-Hill Book Company, New York (N.Y.) Keller, Gerrald, Brian Warrack and Henry Bartel; Statistics for Management and Economics. (1994) Duxbury Press, Belmont (California). Levine, David M., David Stephan, (et. al.) (2006), Statistics for Managers. Prentice Hall of India New Delhi. Pfaffenberger, Roger C., Statistical Methods for Business and Economics. (1977), Richard D. Irwin, Homewood, Illinois (U.S.A) Salvatore, Dominick, and Derrick Reagle; (2002) Statistics and Econometrics, McGrawHill Book Company, New York (N.Y.) Siegel, Sidney C. Non-Parametric Statistics for the Behavioral Sciences. (1956) McGraw-Hill Book Company, New York Snedecor, George W. and William G. Cochran; Statistical Methods (1967) Iowa University Press, Ames, Iowa. (U.S.A.) 5 STATISTICS AND BIOMETRICS By Prof. George K. King’oriah, B.A., M.I.S.K., M.Sc., Ph.D., M.B.S. CHAPTER ONE: STATISTICS, BIOSTATISTICS, MEASUREMENT AND ANALYSIS .. .. .. 6 Objectives and uses of Statistics Types of Statistics Basic common concepts in Statistics Graphic Presentation of Data Measures of Central Tendency and Dispersion Measures of variability or dispersion CHAPTER TWO BASIC PROBALBILITY THEORY .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 39 Some basic terminology Simple Rules of Probability Counting Techniques Binomial Distribution CHAPTER THREE THE NORMAL CURVE AS A PROBABILITY DISTRIBUTION .. .. .. .. .. 58 The Normal Distribution Using Standard Normal Tables The “ t ” distribution and Sample Data The Proportion and Normal Distribution CHAPTER FOUR STATISTICAL INFERENCE USING THE MEAN AND PROPORTION .. 93 Sampling Methods Hypothesis testing Confidence Intervals Type and Type errors Confidence Interval using the proportion CHAPTER FIVE THE CHI-SQUARE STATISTIC AND ITS APPLICATIONS .. .. .. .. .. The Chi-Square Distribution Contingency Tables and Degrees of Freedom Applications of the Chi-Square Test 6 116 CHAPTER SIX ANALYSIS OF VARIANCE .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 132 CHAPTER SEVEN LINEAR REGRESSION AND CORRELATION .. .. .. .. .. .. .. .. .. .. .. .. 156 Introduction One-way Analysis of Variance Two-Way Analysis of Variance Introduction Regression Equation of the Linear Form The Linear Least squares Line Tests of Significance for Regression coefficients Coefficient of Determination and the Correlation Coefficient Analysis of Variance for Regression and Correlation CHAPTER EIGHT PARTIAL REGRESSION, MULTIPLE LINEAR REGRASSION. . .. .. .. .. AND CORRELATION 183 Introduction Partial Correlation Computational techniques Significance Tests Analysis of Variance CHAPTER NINE NON-LINEAR CORRELATION AND REGRESSION. .. .. .. .. .. .. .. .. .. 204 Logarithmic and other Mathematical Transformations Non-Linear Regression and Correlation Testing for non-linear Relationship Significance Tests using non-Linear Regression/Correlation CHAPTER TEN NON-PARAMETRIC STATISTICS .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. Need for Non-Parametric Statistics Median Test for one Sample Mann-Whitney Test for Two Independent Sample Wicoxon Signed Rank Test Kruskal-Wallis Test for Several Independent Variables Spearman’s Rank Order Correlation 7 226 CHAPTER ONE: STATISTICS, BIOSTATIATICS, MEASUREMENT AND ANALYSIS Objectives and uses of Statistics After working for sometimes in adult life, we are required to use statistics in our efforts of analyzing data measured from real-life phenomena. The immediate question which is asked by most of us is, why statistics has not been necessary all this time, and now it is required! The answer to this puzzle is that all that time we have not been asked to use statistics, our lives may have been much simpler than now, and we did not require rigorous data analysis, rigorous proofs, and rigorous exhibition of data accuracy and veracity. Now that we are interested in all these things, I welcome to one of the most powerful tools available to mankind for data collection, data analysis, and data reporting or presentation. This tool is largely a product of the technological developments of the 19th and 20th centuries. That is when it became necessary to be accurate with data (of all kinds) because of the nature of production mechanisms involved in all the industries that sprang up as a result of the industrial revolution, all current human endeavor, and the current information age (using computers and allied equipment) at the dawn of the 21st Century. For example, in this regard, we are never satisfied by reports from anybody that something is bigger or better than another one. The current technical minded man or woman requires to know how much bigger the object is, and how better it is. This means that there must be reliable data (always), upon which to base our decision making. This is the nature of today’s life. Think of the time your child comes home from school at the end of the term, with a report form. He or she tells you that he/she was number ten in this term’s examination. Unlike our forefathers, we are not satisfied by the mere number (or position) that our child scored (obtained) in school. We need to know the average mark that the child obtained, so that we can know the actual academic strength of our child. We then may consider the average or the mean grade. This is when we are able to preliminarily rate our child’s ability. There are very few of us who have not done this, to ascertain whether our child is good material for admission to secondary schools, or to universities or other 8 institutions of higher learning. This is one of the commonest and simplest uses of Statistics, which we shall be considering soon in this module. The average mark is called the mean grade or mean mark! Do you see how close Statistics is to our daily lives? Then it pays to learn statistics! For a summary of how to begin reading this exciting subject, learners are referred to George K. Kingoriah’s Fundamentals of Applied Statistics (2004, Jomo Kenyatta Foundation, Nairobi.) Chapter One. These days, statistical calculations are required in almost all human activities and all academic disciplines. Advances in science and technology in all fields of study, all production areas, and all areas of human endeavor have necessitated the use of statistics. Some phenomena are easily quantified and can be easily measured. Other natural phenomena are not that easy to measure. In all these cases, we resort to the use of statistics so that we can know how to handle each case we come across. In this module we shall learn how to analyze facts and figures, and to summarize many diffuse data types using the tools that are available within the discipline of statistics. We may therefore define statistics as a body of knowledge in the realm of applied mathematics with its own ethics, terminology content, theorems and techniques. When we master this discipline though our rigorous study we seek to master all these theorems and analytical techniques and to make them our toolkit which we can use as our second nature in order to understand the universe around us and to make sure that we are not duped by con-people who push facts and figures to us so that we can believe them in order they can gain emotionally, politically, spiritually or materially from our data indigestion. In this way we view the discipline of statistics as a tool for helping us in our daily needs of data analysis and understanding. It may not be necessary to use the tool always, but if it is available we can use it any time we need it. It is better to have it on stand-by, than to miss it altogether. This way we become flexible in the understanding of our universe. Origins of Statistics 1. Statistics has many origins. However, there are two chief sources of this discipline. One of the earliest sources of statistics is in the book of Numbers within the 9 Holy Bible, when it was necessary to know how many of the children of Israel were proceeding to the Promised Land, which flowed with milk and honey. Activity Learners are requested to appreciate that the numbers given in the Book of Numbers are facts and figures of governing any community, or any state. This is where the name Statistics came from. Please read the Book of Numbers in the Old Testament of the Holy Bible to be able to appreciate this. Also one of the other ancient events most copiously recorded happened when Jesus Christ was born. Remember how there was data needs within the ancient Roman Empire, which caused Caesar Augustus to have an enumeration of all people in the Roman Empire? When it came to Palestine, it affected the Holy Family, and they had to travel to Bethlehem so that the population count can be accurate for the purposes of the emperor. They had to be counted in their district of origin, not where they worked and lived. Upon this event, Jesus was born in Bethlehem. The term “Statistics” 1. Heads of states and other officials of government need to know the characteristics of the populations they are serving, such as birth rates, death rates, numbers of livestock per person, the amount of crop yields of farms in the areas they administer, and so on. These are the facts and figures of state. Hence the name Statistics. 2. The second major source of statistics lay in the desire of kings and noblemen in Europe during classical and Baroque times (1600 - 1900 A.D.), who needed to know chances of winning in their gambling endeavors. This led to the development of Probability theory and the games of chance; which is very much in use within the discipline of statistics today. 3. The word Statistics is used in describing the subject matter of the discipline where facts and figures and probability theory are used to collect, analyze and present data in a rigorous manner, so that we can be assisted to make informed decisions. Anybody studying Statistics today is therefore interested in this kind of meaning of the word Statistics. The techniques, terminology and methodology of applying facts and figures 10 which are encountered in daily life - especially as a result of research or scientific investigation, are the subject matter of the discipline of statistics. In this regard, the uses and applications of various algorithms which mankind has inherited from the study of mathematics is the discipline of Statistics. Science and Scientific Method During the age of discovery (1300 - 1800) people used to travel out of, and go back to Europe, and give accounts of what wonderful things they had seen and what wonderful events they had witnessed during their colorful travels. For a satirical example of this activity, we need to look at novels like Jonathan Swift’s Gulliver’s Travels (see any encyclopedia for this reference), and to note how mysterious the strange lands are described in this book. Reports of activities of strangely gigantic races and dwarfish races are given. Scholars in Europe got fed up with con-people who after taking a ship overseas, came to report of such fantastic events. (Jonathan Swift was not one of these). Sir Francis Bacon (1561 - 1626) (among many other endeavors and achievements) pioneered in the development of a method to avoid academic and factual conmanship, which he called Novum Organum or “the new instrument of knowledge”. The essence of this method was to prescribe what must done to ensure that what was reported by the traveler or the investigator is true replicable, and verifiable. According to him, experiments or methods of study must be carefully outlined or defined. The expected results must be carefully stated. The way of ascertaining whether the expected results were correct must be carefully stated even before going to the field to collect data or to attempt the required exploratory feat. The method of study or exploration must be replicable by all other persons in the same discipline of study. It is only after these rigorous steps that the person proposing the study must begin that study. Ethics dictate that he must explain why and when he fails to achieve his objective, and why he knows he has failed. If he succeeds, he must mention succinctly why he has succeeded. This is the process which has been used to pioneer all the discoveries and the inventions of mankind since that time (including computers and the marvels of internet). The methodology involved is the so-called Scientific Method of investigation; and the 11 discipline of Statistics is one of the tools used to ensure the success and rigor of this method. Science is therefore a systematic discipline of studying the natural world around us. It is not looking through sophisticated equipment, or reading subjects like physics, chemistry, and botany (etc.) alone which comprise science. Science is a discipline of study or investigation. Later we shall learn how to set up study hypotheses and how to prove them using the scientific method of investigation. Biostatistics There are physical and social sciences. Social sciences are involved in the investigation of human behavior and human activities using statistical and other investigative models. For example, Economics is a social science. Physical sciences, on the other hand, are amenable to experimentation because of the fact that the variables under investigation can be accurately measured and controlled. Bio-statistics is a subdiscipline of statistics adopted to analyze all the phenomena which occur in our environmental milieu, and which manifest themselves as biological substances or creatures. Their behavior, the qualitative and quantitative data that result from their individual or collective manifestation is amenable to the study and investigation using statistical tools. These tools are what we call biostatistics. There is no great difference between biostatistics and other types of statistics. The only difference is in the specialized application of the broad statistical methodology to investigation of those phenomena which arise in the biological environment. Activity Learners are requested to write a short essay on what science is; and on the differences between physical and social sciences. Consult all the reference material that may be available in your University library, especially books on Research Methodology. Functions of Statistics 1. Statistics is not a method by which one can prove almost anything one wants to prove. Carefully laid rules of interpreting facts and figures help to limit the prevalence of 12 this abuse in data analysis and data presentation. Although there is nothing significant to prevent the abuse, statisticians are instructed against intellectual dishonesty; and are cautioned against possible misuse of this tool. During the instruction of Statistics, ethical behavior is assiduously encouraged. 2. Statistics is not simply a collection of facts and figures. There could not be any point of studying the subject if this were so, since the collection would have no meaning, especially after such collection has been accomplished. 3. Statistics is not a substitute for abstract or theoretical thinking. Theory cannot be replaced by Statistics, but is usually supplemented by statistics through accurate data collection, data analysis and data presentation. 4. Statistics does not seek to avoid the existence of odd or exceptional cases. It seeks to reveal such existence, and to facilitate careful examination of the cases within the scope of research at hand. In some cases it is through the investigation of the odd, the peculiar or the residual cases that we have come to understand the world better. Sir Alexander Fleming (1881 - 1955) invented Penicillin in 1924, because he got interested in the strange behavior and death of the bacteria which he was culturing after they were accidentally exposed to mould (stale) bread-crumbs that fell within his petri-dishes by accident. In the first instance, he had not set off to discover penicillin, he was doing other things. The world has been saved from millions of bacterial infections as a result of him investigating this strange, fortuitous event, which took place in his bacteria-containing petri-dishes by accident. 5. Statistics is not a “fix-all” tool-kit for every type of scientific investigation. It is one of the many tools which are available to the scientist for data analysis and problem solution. If in any event statistics is not required, it is obviously foolish to try to force statistical analysis when other tools like deductive and inductive reasoning, cartography, calculus, historical analysis, (and others) can do. Also, one does not use advanced statistical techniques when simple ones are sufficient to do the job; simply for the purposes of impressing his/her peers, all and sundry. Always use the most effective and the simplest statistical tool available to avoid making a fool of yourself among all other scholars. That simple and humble habit will also save time and money in any research environment. 13 Types of Statistics Descriptive Statistics Descriptive statistics are sometimes called inductive statistics because their very nature induces the conclusion of the researcher or the investigator. Information collected from research is sometimes too much to be meaningful. This information requires to be sifted and arranged to a point where the researcher or the investigator can see the pattern of trends and what is in it. Information is sometimes summarized - but most often arranged in a comprehensive manner, in accordance with the subject matter of the research or the investigation in question. This arrangement involves computing such measures as means, ratios standard deviations (to be explained later) and others. By so doing, we dilute the massive amounts of data to manageable and understandable proportions. Note carefully, that this may not necessarily involve summarizing the data collected. Where it does we must be very careful we do not lose any important data. If we do, our conclusions could be misleading. Cautious interpretation is therefore necessary. In the event that some data must be omitted, the limitations of such omission must ne carefully outlined and the reasons why such omission was done must be exhaustively justified. Inferential Statistics This type of statistics is sometimes called deductive statistics because the researcher deduces the meaning of data from its behavior after its analysis. Sometimes it is necessary to generalize on the basis of limited information which is caused by certain difficult or constraining circumstances. Time and money may not allow for exhaustive investigation of the subject matter and exhaustive data collection. Therefore, the researcher draws a sample representing a small proportion of the group under investigation which he considers has representative characteristics of each member of the larger group. From this sample, he can make inferences, and he can deduce his conclusions. This is the most abused type of statistics, and the researcher or investigator must be careful not to misuse the information if he is interested in intellectual honesty and integrity. 14 Ethical requirements in biostatistics are as rigorous as in all other types of statistics and other means of measurement, data analysis and compilation. All final reports must be accurate to the best of the knowledge and ability of the investigator. Basic common concepts in Statistics Operational Definitions in Statistics Population or the Universe : This is a term used by statisticians to refer to the totality of the population of all objects they are interested in studying at any one moment. For example, all students enrolled in the course of Biostatistics in Kenya Methodist University in any one year (if this is the subject of our interest), comprise the population on which our study is centered. The word Population generally means the actual number of people in the demographic sense. However, statisticians stretch the word to mean any group of inanimate objects which are the subject of interest to the investigator. This is also called the universe - to connote the fact that this is all there is of the number of the subjects having some characteristic of interest. A sample: Sometimes it is not possible to study the whole population of the objects of our interest. It becomes necessary to choose a few representative members of that population which will be expected to have adequate characteristics of the universe. Rigorous methods of sampling are used to ensure that all members are represented by the resulting sample, and that the sample has the representative characteristics of the entire population. Unless this is done, it is possible to come to erroneous conclusions through the use of some unrepresentative sample. A variable : A variable is a phenomenon which changes with respect to the influence of other phenomena or objects. This changing aspect of the phenomenon is what makes it be known as a variable, because it varies with respect to changes in other phenomena. There are two types of variables: dependent variables and independent variables. The former vary because of other phenomena which exist outside the domain of the investigation. Therefore its magnitude depends on the magnitude or the presence of other variables - hence the term dependent variable. Independent variables are those phenomena which are capable of influencing other variables without they (themselves) 15 being influenced. They are outside the domain of investigation, but they determine the result or the size of the subject dependent variables in the experiment or in the model. In the statement the weight of on individual depends on the dietary habits of an individual we have both the dependent and the independent variable. The weight of an individual is a dependent variable, while the dietary habits which influence that weight are the independent variable. A model : is a theoretical construction of all expected interrelationships of facts and figures which cause and affect any natural physical or social phenomenon. Sometimes it is impossible to represent the complex reality in the manner that it can be easily understood and interpreted. A model is an attempt to construct a simplified explanation of the complex reality in a manner which can be easily understood, and whose nature can generally be said to be possible at all times. Models are often used to describe phenomena in social sciences because the interaction of variables is of a complex nature and it is impossible to run controlled experiments to determine the nature of the interaction of such variables. For example, the habits of any consumer population can only be modeled, because we cannot lock any member or members of the population or any two members of the population in a laboratory for a long time to study their consumer habits. An Experiment: This term will no doubt appeal to students of biostatistics. This is an attempt to observe scientifically physical characteristics of natural variables and the interaction thereof in a laboratory, or within some controlled environment. In so doing, it is possible to have some alternative but equal and equally representative variables which are set up, but denied the influence of the dependent variable, to ensure that really the independent variable within an experiment is the one which is the cause of change in the variable under investigation. An experiment is possible in all physical and biological sciences. Students of biostatistics no doubt have done some experiment or another during their academic careers. An experiment is more exact than a model. The standard of measurement can be made very precise, and the results are precisely determined within this experiment using mathematical tools. 16 A measurement is a precise quantification of phenomena determined using some instrument or some precise method of observation. To measure is to quantify or to determine precisely the magnitude of any kind of variable. A measure is a quantity so determined. An Index is an imprecise measurement which is only capable of indicating a trend. Indexes or indices are designed only where the procedure of measurement is able to yield an imperfect or an imprecise indicator. Examples of this are the consumer price index in economics, or the discomfort index in climatology and weather forecasting. An index is only an imperfect indicator of some underlying concept or variable which is not directly measurable. Validity of an Index This refers to the appropriateness of an index in measuring some underlying concept. A valid index should be appropriate for such measurement, ideally, despite any operational procedure used. Any change in operation in the procedure which changes the index implies invalidity of the index. In practice it is difficult to find indices obeying this rule precisely. This accounts for the many disagreements regarding the use and the results of most indices. A hypothesis is an educated guess of the nature of the relationship between variables which is capable of being proved during the process of modeling or experimentation. We shall consider hypotheses later in the course of this module. In order to determine whether any hypothesis is testable valid measurements or indices must be used. Only operationally defined concepts are feasible foe a hypothesis to be testable. Concepts which are not operationally defined which may appear in any hypothesis lead to endless debate as to their validity, and had better be avoided. Summation signs One of the most confusing features of statistics to a beginner is the summation signs. We all are used to individual counts of phenomena; but these complex instructions present some significant difficulty, because it is not every day that we are instructed in 17 mathematical symbolism to sum quantities resulting from many different observations of different aspects of one phenomenon. Learners are requested to pay a lot of attention, and to learn the nature of these instructions. In fact I do not mind if a whole day is taken to learn this section; because then, we guarantee ourselves an enjoyable time of tactical use of these instructions. Learners should therefore take their cool time, because this is the only hurdle we have to clear in the use of these symbols. For more work on these instructions refer to King’oriah (Jomo Kenyatta Foundation, 2004), and/or any other text on Statistics. The Greek letter Sigma written as “ ” is used to herald these instructions. Using this convenient symbol we are instructed to all what is listed in front of it. For example the instruction x i means you add all observations of any characteristic “i ”. In this regard, the footnote “ i ” is only a description of the nature of the objects to be added. Then this means we must add all x i , x 2 , .... , x n observations in the subject n population. The instruction x i means we add all observations from the first i 1 observation to the last observation which is the nth observation. In actual arithmetic terms it implies :n x i x 1 x 2 .... x n . i 1 From the observation of the above expression, we find that the summation sign “ ” is n surrounded by a curious arrangement: “ xi ”. At the bottom there is " i 1" . At the i 1 top there is “ n ”. The former instructs the analyst where to start his work, and the latter where to end the summation. These two are called the indices of summation. " i 1" is the running index, and “ n ” is the end of summation. Obviously the x i is the variable designation of what is being summed. Summation signs follow simple operations of algebra. This means that in dealing with each expression, we must follow all the rules of algebra. We must remember the 18 BODMAS rule when we encounter complicated expressions. thus, in manipulating the summation arithmetic instructions we deal with all the expressions in the Brackets first, followed by the expressions indicated by “Off”s, then “D”ivision, “M”ultiplication, “A”ddition and finally “S”ubtraction; hence - BODMAS. Activity 1. Solve the following expressions :- a xi n yi i 1 b xi n yi i 1 2 c k xi . n i 1 All these can be found explained in King’oriah (2004), Jomo Kenyatta Foundation Nairobi, Chapter Two. 2. After a busy working day, five orange vendors were observed, special regard being paid to the number of oranges remaining unsold at the end of a busy market day. The following table illustrates the results:(a) Sum all the unsold oranges (b) Sum the square of unsold oranges (c) Find the square of the total number of oranges sold. (d) Explain if you can see the difference of the sum of the total number of oranges, and the square of the total sum of squares. is there any difference? Types of Statistical Data All data measurements can be classifies into the following categories :(a) Nominal data (b) Ordinal Data (c) Interval Data (d) Ration data These terms are used to denote the nature of data and the measurement level at which such data and the measurement level at which such data has been acquired. 19 Nominal Data This is the weakest level of measurement. Such a level entails the classification of data qualitatively by name - hence the term “nominal”. For example the labeling of data into two categories “men” and “women”, these two categories can be known only by name. Meat can be classified as “fresh” and “stale”. Names like Caroline, Chege, Acheampong, Patel, Kadija, and so on, are classifications on the nominal scale. If you classify cats as “black” and “white”, you are measuring them using the nominal scale. Analysis and manipulation of this data requires those statistical techniques which can handle names and nominal data. The Chi-Square statistic in Chapter five of this module is one of the few that are available for this work. Ordinal Data This is the kind of data which is categorized using those qualities one can differentiate with size. In other words, data is amenable to be Transitive, that is: with magnitude and direction. Thus, data classified as big, bigger, biggest, or: large, larger, largest, and similar qualities, is data which has been acquired and arranged in an ordinal manner. It is ordinal data, and the level of measurement for such data is ordinal. ChiSquare and Analysis of Variance (to be learned later), together with any other measures and statistics that can handle this type of data, are used to manipulate and to make deductions with this kind of data. Interval Data This is data acquired through a process of measurement where equal measuring units are employed. Such data has magnitude and direction (is transitive) and the size of the interval between each observation and the one above it is the same for all observations. Equal measuring units are employed. This data therefore contains all the characteristics of nominal data, and ordinal data. In addition, the scale of measurement moves uniformly in equal intervals up and down the respective sizes of the data; in equal intervals - hence the name “interval” data. The only weakness with this kind of data is that the position of zero is not clear, unless it can be assumed. Thus data like 2001, 2002, 2003 ... and so on, is interval data. The zero year then can be assumed as 2001. Data like 20 temperature readings have absolute zero so far that it is not practical to find it, and use it in every-day data manipulation. The same applies to time in hours or even in minutes, and so on. The statistic used for analysis is such measures as: analysis of variance, and regression-correlation. However, ratios are difficult to compute Ratio Data This is the highest level of measurement, with transitivity - magnitude and direction; equal interval qualities; and the zero can be identified and used conveniently. It is possible to perform all mathematical manipulations using this data, whereas in other data such exercise is not possible due to lack of zero levels. Division and ratio computation between one group of observations and another is possible - hence the use of the word ratio. All the known statistical techniques are useful with this kind of data. this is the kind of data most people can handle with ease, because the observations are countable and divisible. Statistical Generalization After the above considerations we now come to the other meaning of the word “statistic”, which means a quantity computed from a sample an obtained after the manipulation of data using any known method of analysis. The characteristic of a sample are called statistics, while those of a population or a universe are called parameters. The concept of Probability Probability is used in statistics to measure how possible events occur. Many times events if interest to ordinary people and researchers are observed to happen. Due the nature of the circumstances surrounding the occurrence of these events, one is not sure whether these events can be repeated. Over the years, since the gambling exercises of the baroque era, mathematicians have developed methods of testing the certainty that any single event can happen. The probability that any event A can happen is computed as the ratio between the total number of all the successes (or favorable events), to the number of all the observations in that population. Thus :- 21 P A r , where r = the number of favorable outcomes and n = total number of n trials. Favorable outcomes need not be the most pleasant outcomes at all times. The term may be used to describe even highly unpleasant affairs. Consequently the term is used to denote all the events which are of interest to the observer, however pleasant or unpleasant they may be. This farmer who is interested in the frequency of the failure of the short rains is dealing with a very unpleasant event to all farmers. However, since this is the event of his statistical interest, any failure of the short rains is a favorable event from his statistical point of view. We shall deal with probability in greater detail in Chapter Two of this module. Activity 1. Compute the probability of obtaining four heads in six tosses of an evenly balanced coin. 2. Compute the probability of obtaining a six in a single toss of a fairly balanced die. (Plural - dice) The Normal Curve This concept will also be discussed in great details in Chapter Three of this module. However, it is appropriate to introduce it as one of the concepts of the most frequently used models of probability in statistics. We shall see later that the normal curve is a logical concept described by means of a precise mathematical formula. It is used frequently in statistics because many real-world variables like weight, height, length, and so on, which occur naturally in populations of interest to statisticians are distributed in such a manner that approximates the shape of this curve. The normal curve is actually a representation of how frequently any real-life phenomenon occurs. It records graphically the frequency with which an observation can be expected to occur within a specific range of all the characteristics available either in a sample, or in any given population. 22 Frequency distribution For example, the average height of maize plants after three weeks of growth in a normal season with good rains can be recorded in terms of the number of these plants that have grown to specific heights in meters or centimeters as in the diagram below. If we were to count all maize plants which achieve each height over and above any specified level, we obtain several plants, which if we divide with the number of all the maize involved in our experiment - say 100 - we obtain the probability that we can find maize satisfying that category in that field. Activity 1. Use Figure 1 - 1 and count all the maize which has attained the height of 0.3 meters after three weeks of growth on our hypothetical field. What is the number of these plants? This number is the frequency of occurrence of the maize with this characteristic or this quality of being 0.30 meters high. 2. What is the ratio of this number to the total plants available in our sample? This is the proportion of the sample with this characteristic to all the maize considered in our experiment. 3. Count how many plants are 0.30 meters and below in this array of data. Divide this number with the total number of maize plants in our sample. This result gives you the ratio, or the proportion of all the maize which is 0.30 meters and below in the sample of our interest. 4. Plot the graph of the numbers of maize plants in each category of data over the whole range of measurement qualities in our sample. This should give you a curve of the distribution of all characteristics of maize plants having different qualities in our sample. The graph is what is called a frequency distribution, which has the quality of out interest on the horizontal axis and the number having each of these qualities on the vertical axis. The normal curve as a frequency distribution We shall see later that the normal curve is a frequency distribution of characteristics, which are supposed to be the same, but which differ slightly in 23 magnitude. The greatest number of the characteristics of all the observations is clustered at the center of the distribution; and the smaller numbers are clustered towards the ends of the distribution, just like our maize plants in this example illustrated in Figure 1 - 1. Figure 1 - 1 The numbers of maize plants counted for each height in meters, from 0.05 meters to 0.70 meters; at 0.03 meter intervals. (Source: George K. King’oriah, op cit., 2004) Factorials We shall need to understand the meaning of factorials, especially when we will be dealing with permutations and combinations. A factorial is any number n which is multiplied by all other numbers less than it. The factorial is denoted by an exclamation mark behind any number which is required to be manipulated this way. Thus we can have 24 such numbers as 8!, 20! , 3!, and any other number that we wish; and we can denote this statement as n! The general expression for a factorial is :n! n n 1 n 2 ..... 2 1. In this case “ n ” is the number whose factorial we are looking for, and “ ! ” is the denomination for factorial as we have seen above. As an example, we take 4! to be : 4! 4 3 3 2 1 . Do not bother to take factorials of large numbers because this kind of instruction makes the answer explode so fast that most hand calculators cannot handle numbers above 12! This makes it tricky to manipulate factorials. However, if we remember that we can cancel our the expressions in any quotient then our life is very much simplified; like in the following example :8! 8 7 6 5 4 3 2 1 5! 5 4 3 2 1 Here we can cancel the 5! 5 4 3 3 2 1 above and below within this expression to be left with 8! 8 7 6 336 . Not a big deal! Therefore, do all 5! the necessary cancellation and then multiply what is left of the factorial quotient. Addition and subtraction is rarely required in our course. Graphic Presentation of Data After any field survey data will need to be arranged and analyzed. this is because there are too many data entries for the observer to make any sense out of the field collection. Graphic presentation of data can be resorted to so that the investigator can make sense of the data. This is a section of a broad area of statistics which can be called descriptive statistics. Most of us are familiar with data in newspapers, periodicals, government reports, etc., which is illustrated by means of colorful bar-graphs pie charts, wind-roses, proportional balls, etc. For a closer look at the wide use of this technique 25 learners can have a look at Arthur Robinson, Randall Sale, and Joel Morrison’s Elements of Cartography, which is cited at the end of this chapter or any other text. Figure 1 - 2 is an example of two cases where graphic presentation of data has been found necessary. There are many others illustrated in the cited textbooks (King’oriah, 2004) and in other materials. Frequency distributions, cumulative frequency distributions, histograms, ogives, frequency polygons (and others) are good examples. These are easy to learn and draw, as long as you make sure you have a good textbook at your disposal. Readers are requested to read and learn from these examples, and then participate in the activity prescribed below. Activity Given the following data, comprising the height in millimeters of 105 maize plants after two weeks of growth :- 129 148 139 141 150 148 138 141 140 146 153 141 148 138 145 141 141 142 141 141 143 140 138 138 145 141 142 131 142 141 140 143 144 145 134 139 148 137 146 121 148 136 141 140 147 146 144 142 136 137 140 143 148 140 136 146 143 143 145 142 138 148 143 144 139 141 143 137 144 144 146 143 158 149 136 148 134 138 145 144 139 138 143 141 145 141 139 140 140 142 133 139 149 139 142 145 132 146 140 140 132 145 145 142 149 26 Figure 1 - 2: An Example of a Histogram and a Pie Chart. The upper part of the diagram is a histogram, and the lower one is a Pie Chart. (Source: King’oriah 2004, and World Bank 1980.) Draw a table of the distribution of different classes of heights of maize plants in millimeters after two weeks of growth. Identify the distribution of heights of these classes by grouping the data into 10 classes of equal ranges and equal class intervals. In this regard, you must draw a 27 diagrams of appropriately crossed tallies to indicate how you obtained the groups of your choice. Show the upper and the lower class limits of each data interval. Identify the class marks of each data group or data interval. Draw a histogram of the same data, a frequency polygon and an ogive. All this work must show your clear understanding of the terms in italics. Learners are expected to consult the given textbooks (King’oriah, 2004) and to submit the work within the work period specified by your instructors. Measures of Central Tendency and Dispersion Representative measures The frequency distribution and its graphical relatives can provide considerable insights about the nature of the distribution of any set of data, as you have discovered in the above activity. One can quickly tell about the form and shape of any distribution. It is also possible to represent this information through results obtained from computation of numerical measures like averages and other quantities which measure the spread or the clustering of data and observations. The first kind of these measures is called representative measures. They are designed to represent all the observations in either a sample or a population by attempting to locate the center (or the middle) value of any given distribution. Another way of thinking about these representative measures is that they attempt to measure or locate the central tendency of the characteristics of a given range of observations. In that regard, they are called measures of central tendency. We consider each one of these measures through the presentation given hereunder, beginning with the arithmetic mean. The Arithmetic mean In every-day colloquial language this measure is known as the average. This is a very important measure in statistics, because, arguably, we can say that the whole discipline of statistics is based on the manipulation of the mean in one way or another, as we shall see in this module. The statisticians represent this measure using the Greek letter 28 “ ” for the mean of the population and the roman capital letter “X” with a bar on top. This is usually called “ X-bar ”, and is denoted as “ X ”. Note carefully, that the symbol is in capital letter form, because the small “ x ” means something else. It actually means an error or a deviation from a trend line, or surface; as we shall see later when we shall be considering either the standard deviation, or regression/correlation and related materials. Therefore learners must show the capital letter form of this symbol in all their presentations, particularly in examinations; to avoid being ambiguous, or being mistaken by the examiners. Nearly everybody knows how to compute the average. However, in the summation notation that we just learned, when you compute the average (or mean) this is what you do :n Xi i 1 for the population mean. N Notice that we are using the capital “ N ” to indicate that we are considering the universe, or everything that is “considerable”, and that we have to consider. The analogous notation for the sample mean is :n X Xi i 1 for the sample mean. n Here also we are using the lower case “ n ” to indicate that we are using only data sampled from the larger population whose number is “ N ” . In statistics we observe strict symbolic discipline, especially if we are presenting data in the longhand writing; so that we may not ne mistaken. Let us try a simple example comprising five simple observations belonging to a sample which is presented in the following manner :X 1 10, X 2 15, X 3 6, X 4 12, and 29 X 5 15. Now let us look for the sample mean X . Using our symbolism we see that :- n X 5 Xi i 1 n Xi i 1 5 10 15 6 12 11 54 10.8 5 5 You can do the same with a large sample of more than 30 observations, because this is where we begin assuming the universe, or a large collection of data, which we can classify as the population. For more general characteristics of mean, see King’oriah (2004), or any other standard textbook on statistics, especially the ones given in the references: at the end of this chapter. The median This is an important measure in statistics, especially when data is such that the mean cannot conveniently represent the given population conveniently. In some cases the end-observations in the data array which is used to investigate the central tendency is so large that it may tend to pull the value of the mean wither way. Imagine this statistician who id looking for the mean rent of all dwellings in a certain neighborhood in your town or city. No doubt some houses in the same locality are very expensive, and others are very cheap. The mean would be pulled upwards by the expensive houses, and would be a poor representative of the central tendency in this case. The median therefore comes to our rescue because it mentions only the middle observation of the data array which is given. In so doing, the median is an indication of the rough position of the mean under ordinary circumstances when your data is no being pulled upwards by expensive units. Now, given the following data, as we had for the computation of the mean we may wish to determine the median of the data. X 1 10, X 2 15, X 3 6, X 4 12, and X 5 15. The median can be determined easiest if we arrange the data in an orderly manner, from the smallest observation to the largest, as follows :- 30 6, 10, 11, 12, 15 In this case, the median of the data, which we denote as “ Md ” is the middle value of the array, which is clearly 11. The middle value is Md = 11. You can try many more examples on your own; especially data with large observations on either and which is not balanced by an equally large observation in the opposite end. Unlike the mean, the median is not easily manipulated mathematically. However, it is a useful measure to indicate the middle point of the range of any data. Generally, the mean gives more information because of the quality of the observations used in its computation. When the mean is computed, the entire range of data is included. Therefore the mean is a more representative measure than the median. The Mode The interpretation of the statistical mode is analogous to that given to the same word as it is used in the fashion industry. Any person dressed in current style is said to be trendy, or “ in mode ”, meaning “ fashionable ” or wearing what everybody would like to wear according to the current fashion trends. Therefore the mode is the easiest of the measures of central tendency, because all what needs to be done to obtain it is to observe which observation repeats itself more times than all the other observations. Therefore, given the following array of simple data we find that :X 1 5, X 2 6, X 3 5, X 4 7, and X 5 4. Clearly, the mode of this data is 5, because the number is repeated twice. In view of this we may expect that it is possible for some data to be having no mode in the case where we do not have any repeating observation. In the other extreme data could be multimodal where more than two observations repeat themselves several times. Bi-modal data is that which has two figures repeating themselves. Therefore all we need to do is to 31 observe the nature of the data array after we have arranged it from the smallest to the largest observation. The Proportion In some areas of scientific observation it becomes necessary to deal with qualities rather than quantities. Imagine a biological experiment where color is involved. Clearly, you cannot measure color in numerical terms. Therefore all we can say is that a certain proportion of our data is of a certain color. The percentage and the ratio mean the same thing only that they are arithmetic transformations of the same measure. Mathematically, the proportion is of the same family as the mean, and in statistics it is treated as some kind of mean. The proportion is the ratio of the number of observations with the desired characteristic to the total number of observations. For the population, the proportion is represented by means of the Greek letter “small” “ ” , and for the sample a lower case or “ p ” will do. In that case to calculate the proportion we have the following procedure for all sample data :- p Number of observations in each category , Sample size " n" And for the population data we have :- Number of observations in each category . Population size " N" Learners must read more about these measures in the prescribed textbooks to be able to answer the set questions at the end of this chapter. Measures of Variability or Dispersion The concept of variability is very important in statistics. For example, in production management an area of major concern is the variability of the quality of the product being produced. In biostatistics, one may wish to know the variability of a crucial variable like the diameter of the stems of certain plants in the investigator’s interest. It is 32 in these instances and many more that we are interested in investigating how disperse data under our control is. Our academic and research colleagues, and even the common man is usually interested in data dispersion in order to investigate aspects of uniformity as well. We begin with the simplest measure, the range. The range The range can be described as the difference in magnitude between the largest value and the smallest value in any data array. It measures the total spread in the set of data. It is computed by taking the largest observation and subtracting the value of the smallest observation. Depending on our data analysis requirements, there are several types of ranges. One may divide the data into quartiles. This means determining the end of the lowest quarter, the half mark, the end of the third quarter and the beginning of the fourth quarter. Each portion of the data is a quartile. The first quartile is like from 0% to 25%. The second quartile is 26% to 50%. The third quartile from 51% to 75%, and the final quartile of the data is from 76% to 100%. When data is arranged this way we may be interested in the inter-quartile range, which describes data between 26% and 75%, in an attempt to determine how disperse or how centralized data is. Learners should how to compute this range. Generally subtract the first quartile from the third quartile to obtain the following general formula for the inter-quartile range :Inter quartile range Q3 Q1` This measure considers the spread within the middle 50% of the data (or the mid-spread) and therefore is not influenced by extreme values. Using an ordered array of data the inter-quartile range is computed in the following manner :Given the following hypothetical data which is ordered as in the following array :- -6,1 -2.8 -1.2 -0.7 4.3 4.5 5.9 6.5 7.6 8.3 9.6 9.8 12.9 13.1 18.5 To find the inter-quartile range we find the mark of the first quartile and the third quartile. In this case, the first quartile mark is - 0.7, and the third quartile mark is 9.8. These ones can be designated as :- 33 Q1` = 0.7 and Q3` = 9.8 Therefore, the inter-quartile range is computed as follows :Inter Quartile Range 9 .8 0 . 7 10 .5 This interval, usually called the middle fifty (%) examines the nature of the middle data and induces various conclusions about the spread of the data. If this data were representing some rate of return from a bond, the larger the spread the greater the risk; because the bond can fluctuate wildly during the trading period and cause losses of money. Small inter-quartile ranges measure consistency. This implies the consistent maintenance of the value of the bond, and therefore makes the bond more attractive. This means that there are no resistance outliers influencing the value of the bond from either the lowest side or the highest side. In modern computer assisted packages the interquartile range is always included as a measure to show the strength of the mean as a measure of central tendency. A small inter-quartile range is always an indicator that data is strongly centralized and closely clustered around the population mean. The Variance and Standard Deviation Mean Absolute Deviation The range does not tell us very much about the characteristics of observations lying within the interval which contains the smallest and the largest observation. However, given the characteristics of the observations at the exact center of the data distribution, and the range defining the extent of both ends, an observer should be more enlightened with respect to the nature of the subject data. We have already seen the use of inter-quartile range in measuring the total spread in the set of data around the mean value of the data. More information is required regarding how varied observations are compared to the mean size Historically statisticians tried to look for the deviations of data and to find the mean average deviation from the mean. Then immediately they did this they found that data which is above the mean is equal to the data which is below the mean. That is: the 34 negative deviations equal the positive ones, and that any quest for average deviation results in a zero value: unless we deal with absolute figures. This led to the use of absolute values (which are neutralized in such a manner that the negative and the positive signs are not considered during addition). This led to a measure which is called Mean Absolute Deviation (MAD). Although this measure will tell us something similar to the information obtained from the use of inter-quartile range, it is not very useful for advanced statistical analysis. It is also a weak measure, and one cannot deduce very much from its use. See King’oriah (2004). Consider the following data on pineapple sizes from a 6 hectare plot. TABLE 1 - 1 AN ARRAY OF PINEAPPLE SIZES PICKED FROM DIFFERENT PARTS OF A 6-HECTARE PLOT (Source : King’oriah, 2004) Pineapple Number Pineapple Diameter (cm) Mean Diameter X 1 13 0.14 0.14 2 12 0.86 0.86 3 15 2.14 2.14 4 11 - 1.86 1.86 5 14 1.14 1.14 6 9 - 3.86 3.86 7 16 3.14 3.14 12.86 cm. Not Total Absolute Deviation Deviation Xi X 90 meaningful 7 X i X 11.14 i 1 In this table we are able to obtain the sum of absolute deviations to be 11.14. The mean absolute deviation is therefore :7 MAD X i X i 1 7 35 11.14 1.5691 7 Big values of MAD imply that data is “disperse” about the mean value, and small values of MAD mean that the variability of data about the mean value is small. The farmer of these pineapples could perhaps compare the MAD of his sample pineapples with that of other farmers to gauge the consistency of the method of pineapple breeding that is practiced on his farm. Large MAD imply great variations in quality from one farm, whereas small MAD imply consistency in quality and size of pineapples from this farm. Mean Squared Deviation (Variance) “ ” 2 As we have seen, dealing with absolute deviations is not easy in mathematics. However, if we square absolute values we eliminate the signs, and specifically the negative signs. The sum of squared deviations when divided by the number of observations gives the Mean Squared Deviation, or the Variance - denoted as “ 2 ” When expressed this way, the symbolism denotes the population variance. However, the sample variance is expressed as “ s 2 ”. This is not surprising since “ ” is the Greek form of s. Therefore statisticians chose the Greek version of “ s ” for the population just like they chose the Greek version for M - that is for the population mean. The English version of “ s 2 ” therefore took the denomination of the sample in this case . We can now use our example to demonstrate the computation of mean signed deviation using the same pineapple width data but in a modified Table 1 - 2. The algorithm for the computation of variance is :- X 7 Variance s 2 i X i 1 7 2 34.8572 4 . 95 7 In this connection, the farmer can begin telling his friends that the variance of his pineapples is a certain tangible figure like we found using the steps outlined in Table 1 - 2: that is :- Variance s 2 4 . 95 . 36 TABLE 1 - 2: STEPS IN THE COMPUTATION OF VARIANCE (Source: King’oriah 2004) Pineapple Diameter (cm) Mean Diameter X Deviation Xi X Squared Deviation X i X 13 0.14 0.0196 12 0.86 0.7396 15 2.14 4.5796 - 1.86 3.4596 14 1.14 1.2996 9 - 3.86 14.8996 16 3.14 9.8572 TOTALS 90 Not meaningful 34.8572 11 12.86 cm. 2 But then he would be talking of measures given in terms of Centimeters Squared! This is the main weakness of using variance, because no one would like to present his figures in squared values. Picture somebody saying “goats Squared”, “Meters Squared”, “Kilograms squared”, etc. The Standard Deviation Since the use of the variance arose from the need to shed off the signs (especially the negative sign) from the data statisticians found themselves in a very lucky position in this case because merely looking for the square root of variance gives a result without any negative signs. The original signs which existed when looking for deviations no longer recur in this case. The resulting measure is called the Standard Deviation; and has no negative signs either in front or behind it! Its computation algorithm is represented in the following manner :- 37 X 7 Standard Deviation s X i i 1 2 34.8572 7 7 4 . 95 2.45cm. This time the farmer can report to his compatriots a measure that is easily understandable. Large standard deviations mean that data is more dispersed about its mean than other data which could be having small standard deviations. This time the measure is in actual units. We have now computed the sample standard deviation. We need to know that the computation of the population standard deviation is done in the same manner. The representative symbol for the population standard deviation is the Greek letter “ ” because it is the square root representation of the symbol “ 2 ”. We shall deal with the computation of the sample standard deviation later, in greater and more rigorous detail; especially when we shall be considering the degrees of freedom. For now this is sufficient information to show the derivation of these important measures. The symbolism denoting the population standard deviation is :- X 7 Standard Deviation i X i 1 2 . N Obviously, the method of computation of the population standard deviation is identical to the steps we have shown above. In the modern digital age the computer program gives the standard deviation as a matter of course. The learner should therefore know that the machine uses the same method as we have used above. Degrees of Freedom When using the sample standard deviation as compared to the whole population, the sample standard deviation formula is not as straight forward as we have discussed above. Statisticians always make the denominator of the sample standard deviation smaller, by adjusting the usual formula (which we have derived above) by one degree of freedom. If the population standard deviation is denoted by :- 38 X 7 i X i 1 2 N The sample standard deviation by convention is represented by :- X 7 s i X i 1 n 1 2 . (This adjustment is also valid for the expression that is used to denote the variance, before the radical is used over the formula to obtain the standard deviation.) The smaller denominator in the sample standard deviation expression is said to be adjusted by one degree of freedom. The sample standard deviation denominator is made smaller than the population variance-denominator by one (1.0) to provide some mathematical justification of using this expression; despite the few data involved in the computation of the sample standard deviation. It is asserted that since one is dealing with a stretch of data in one dimension, one is free to choose any value along that dimension, except the last one. One is free to manipulate all figures along that dimension except the last one. Mathematicians say that in that manipulation, the investigator loses one degree of freedom. It is these manipulations that are applied to the sample data; because we cannot claim that the sample satisfies the requirement of the large data observations which are available in a population - hence the nature of the formula. (See King’oriah 2004.) We shall be dealing with degrees of freedom as we proceed on with our work of considering other statistical measures. 39 EXERCISES 1. Outline four meaning of the word Statistics, differentiating each meaning from all the others mentioned in your discussion. 2. Explain the meanings of the following terms: Science, Scientific Method, and Ethics in Research, Natural Sciences, Social Sciences, and Behavioral Sciences. 3. Explain how it is possible to mis-represent the real world situation using facts and figures, charts and diagrams (cheating through the use of statistics). In this regard, what is the place of ethical behavior in Statistics, and how can you avoid this temptation? 4. Briefly describe the following terms : Population, Independent variable, running index, nominal data ordinal data interval data, and ratio data. 5. Explain what you understand by the word sampling. Discuss various methods that are available for carrying out sampling exercises. 6. The following figures represent the number of trips made by the lecturers within the Faculty of Agriculture in your local university to inspect and supervise field experiments done by the students of Agriculture. 7. 29 6 10 11 13 13 16 11 23 19 19 17 21 39 25 22 9 16 18 13 18 3 9 (a) Construct a frequency distribution with five classes for this data. Give the relative frequencies , and construct a histogram from the frequency distribution. (b) Compare the mean, the median, the range, and the variance for this data. If there are 13270 female chicken, and 12,914 male chicken in a chicken-rearing location, what is the proportion of either type of chicken to the total number of chicken in this area? Read the prescribed textbooks and explain why the proportion is treated as some kind of mean in statistical analysis. 8. Explain what you understand by the term Degrees of freedom, and how this quantity is used in the computation of standard deviations. 40 CHAPTER TWO BASIC PROBALBILITY THEORY Introduction We have already touched on the definition of statistics, and found how the tool of statistics is used to determine the chances of occurrence of phenomena. We need to mention here there are two aspects of the concept of probability. There is the idea of priori probability. This term refers to the quantification and determination of the probability of occurrence of an event through the circumstances of the given facts, without going to the trouble of rigorous testing. For example, if we state that the probability of obtaining heads in the process of tossing an evenly balanced coin, we may straight away say “ 50% ”. This is what happens every day. “What are the chances that it will rain today?” Kamau may ask Kaberia. Kaberia may answer, “Fifty-fifty.” Then Kamau would know that he has a 50% chance to expect a rain shower during the day. However, this is not technically accurate. Ideally, the concept of probability applies for many trials, so that the trials are repeated more and more, and as the number of trials gets bigger and bigger, the ratio of successful outcomes to the total number of trials approaches the probability of interest. In this connection, we need to understand two things. Firstly, probability is expressed as a fraction of 1.0; and it defines the fraction out of 1.0 that a favorable event will occur. Secondly, the concept of probability is a limiting concept. It is the limiting ratio of the relative frequency of favorable outcomes to the total number of trials, as this number of trials approaches infinity. Thus, whereas we can say that the probability P that an event A will occur is expressed as :P A r where r is the number of favorable outcomes, n is the number of trials and A n is the favorable event, the most accurate method of presenting this sentence is :P A Lim n r . n This statement means that the probability of an event A happening, is the limit - as the number of trials approach infinity - of the ratio r/n . This means that any ratio of this kind can be regarded as a probability of the occurrence of any event A only if we can 41 be sure that after many trials (whose number approaches infinity) the ratio is what we have cited as the probability ratio r/n . The implication of this is that a trained statistician cannot afford to be careless with the term probability. He/she must ascertain that the ratio he gives is a true representation of the actual probability as the number of trials approaches infinity. In addition, there can never be less than zero (negative) probability. There can never be less than zero chances that an event will occur, because at least one desired event must be expected to occur. The measure of probability has two extremes. The first extreme is when the number of successes is equal to the number of trials. Every time one tries he succeeds in obtaining a favorable event. In that case, the ratio- which we call the relative frequency is equal to 1/1 = 1.0. The outcome of the events is the same no matter how many trials are done. The second extreme is when the number of successes is never there, no matter how many trials are done. This time the relative frequency is zero. The probability measure therefore ranges between zero and 1.0. More likely events to occur have high probabilities, above 0.5, and those which are less likely to occur have low probabilities - between zero and 0.5. Obviously, Kaberia’s statement above is the “fifty/fifty” level of probability of occurrence, which is 0.5. Some definitions Any two events are said to be mutually exclusive if they cannot occur together during the same trial or in the same experiment. For example, there are only two possible outcomes in a coin toss - “heads or tails”. The nature of the experiment of flipping a coin is such that a head and tail cannot occur together in any single toss of a coin. A head and a tail are therefore mutually exclusive in this experiment. Events are described as being collectively exhaustive if their total occurrence comprises all possibilities in a situation or an experiment. In a deck of cards, for example, all cards are either printed in red suit, or in black suit. Te describe a card drawn from a well-shuffled deck of cards as “ red suit - black suit - spade ” is collectively exhaustive, because one is certain that these events will definitely occur, no matter how one tries to pick his cards. This is because all spades fit into all the categories described by the statement. 42 Any two events are said to be independent when the chances of one event occurring are not influenced by the happening of the other event. When two people with no influence on one another toss a coin the outcome of one coin toss is not affected by the outcome of the second toss of the coin by another person. When we consider any one experiment, either an event will occur or it will not occur. The occurrence of an event is considered to be a success and the non-occurrence of an event is considered to be a failure in statistical terms. The probability of success is said to be a complement outcome to the probability of failure. Likewise, the probability of failure is said to be a complement of the probability of success. The event and its complement are collectively exhaustive. The probability of all of them is all that there is of all this kind of event, and therefore the total equals 1.0. Conditional probability is defined as the probability that any event will occur, given that another event has occurred. For example, in a deck of 52cards, there are 13 spades. If we consider the probability of obtaining a spade after shuffling the cards it is no doubt 13/52 = 0.25. However, if we remove and set aside the spade after we have obtained it, we have a different result, because the probability of obtaining a spade on condition that another spade has been drawn is 12/51. Thus we can say :- P A P spade 13 0 . 25 52 Then in symbolic terms we can state the second condition, which we call the conditional probability - conditional on the happening of the first event, as follows :- P spade P BA spade has been drawn The vertical slash symbol “|” 12 0 . 2353 51 indicates “on condition that”. Learners can therefore read the last expression conveniently and compute conditional probabilities. If in any difficulty, refer to the recommended texts (King’oriah, 2004). 43 Simple Rules of Probability Ordinarily, the researcher does not compute probabilities himself. He usually makes use of various probability tables and charts which have been ready-made by statisticians over time. The basic mathematical rules of probabilities which are described hereunder are just meant to give some insight into how probabilities were conceptualized and developed; in order to give rise to the development of various kinds of tables and distributions; which we shall consider later in this module, and which are given in the recommended textbooks. The order of discussing these rules hereunder is not the kind that we may expect in ordinary mathematics - beginning with addition, then subtraction onwards, ending with other operative functions in mathematics. Our discussion order begins with the multiplication rule onwards; because this is what is considered the easiest method of learning these rules. Then all the rules are summarized in the expected order at the end of the section. The Multiplication rule If A and B are independent events, the probability of joint occurrence of these two events is equal to the product of the probability of the occurrence of each one of them. Since flipping two coins are two independent events, the probability of obtaining head on the first coin and then on the second coin is the product of the probability of obtaining each of these events. Thus we can symbolically represent this phenomenon by means of the following statements :- P H 1 and H2 P H 1 P H 2 0 .5 0 .5 0 . 25 This rule can be extended to any number of events, as song as they are all independent of one another. Thus the probability of joint occurrence of four heads in five coin tosses can be calculated in the following manner :- P H 1 and H2 and H 3 and H4 and H 5 P H1 P H 2 P H5 P H4 44 P H5 0 .5 0 .5 0 .5 0 .5 0 .5 0 . 3125 Notice that the effect of multiplication reduces the size of the fraction. This is true in mathematics and also true conceptually, because of the difficulty in having such a lucky coincidence occurring in real life. Each event is difficult to achieve, and the succession of all other events are equally difficult. Therefore, chances that one could be so lucky become slimmer and slimmer as the multiplicity of the required events increases. When the outcome of any event A affects the outcome of some other event B the probability of joint occurrence of A and B depends on the conditional probability of the occurrence of B , given the fact that A has already occurred. This means that :- P A or B P A P B A For example, in a deck of cards one may consider the probability of drawing a spade on two consecutive draws without replacing the first spade. This means that :- P spade on two consecutive draws P first draw P second draw without replacing the first spade Using our actual figures we have :P first draw 13 0 . 25 : 52 P Second draw first draw Call this event A 12 0 . 2353 51 Call this event B Therefore P A and B P A P B A 0. 25 0. 2353 0. 0058825 Here we see that the chances of the two events happening together have been considerably reduced. This is because there are possibilities that we may miss each event, or all of them completely. 45 Addition Rule In situations where one is faced with alternative events, each one of which is a success, the chances of success increase because the investigator is satisfied by any of the successes. Therefore the number of ways in which he could be successful is the net sum of all the ways in which each of the favorable outcomes may occur. The symbolism for this process is as follows :P A or B P A P B The equation is additive, meaning that success is recorded when A happens, and equal success is recorded when B also happens. Addition rule of computing probabilities is therefore applicable to alternative events of this kind. When events are not mutually exclusive some areas intersect, and there is a chance of both events occurring together. This is best illustrated using circles on a plane, which are either standing apart, standing touching one another; depending on whether there is interaction between the circles. Figure 1 - 3 is an illustration of these circles. This visual technique of showing interaction or non-interaction using circles is using Venn Diagrams, after its inventor, Joseph Venn, who lived between 1834 and 1923. Figure 1 - 3: Venn Diagrams 46 In the upper part of the diagram, the two circles do not touch or overlap. This means there is no interaction between the two phenomena represented by the two circles. In the lower part of Figure 1 - 3, the circles overlap to create an area covered by both A and B. This means that there is some interaction between the phenomena represented by circle A and B; over the shaded area. In this case the addition rule applies; but then there is the subtraction rule (the negative form of addition), so that we can be enabled to cut off the effect of the interaction on the shaded area in order to obtain the probability of A or B , without any interaction. Consider a deck of well shuffled cards. There are four “ As ” in the deck. These comprise A of spades, A of clubs, A of hearts and A of diamonds. Suppose we were interested in finding the probability of obtaining both a clear A , which is not the A of hearts, and a probability of obtaining a clear heart which is not A. We are faced with an additive situation in the following manner :P A Probability of obtaining a clear A , without any interaction. P B Probability of obtaining a clear heart , without any interaction. P A and B Probability of obtaining A of hearts , where there is interaction between the cards bearing letter A and at the same time having a heart face-mark. There are four As in a deck of 52 cards. Therefore :- P A 4 , 52 PB 4 , 52 Then there is only one card which is A of hearts that forms an area of interaction. In this regard the probability of the area of interaction is computed in the following manner :P A and B 1 52 Thereafter, we subtract out our interaction space in this example, in order to obtain the probability of getting a pure event A and a pure event B in the following manner :- 47 P A or B P A P B P A and B In numerical terms, the whole of this exercise becomes :- P A or B 13 4 1 0 . 769 0 . 25 0 . 0192 0 . 3077 . 52 52 52 Now if we are interested in summarizing the rules that we have learned, and arranging them in the way arithmetic operation signs manifest themselves; ranging form addition, subtraction, multiplication ... etc., we find that we have four successive rules of probability for the time being. 1. Addition rule for mutually exclusive independent events :P A or B P A P B 2. Addition rule for events which have a chance of occurring together at least once, and therefore are not mutually exclusive :P A or B P A P B P A and B 3. Multiplication rule for dependent events: where the happening of one event influences the chances of the happening of subsequent events :- P A and B P A P B A 4. Multiplication rule for independent events, where the happening of one event has no influence on the happening of all the subsequent events. P A and B P A P B 48 For more exhaustive elaboration of these rules of probability, learners are advised to consult the prescribed textbooks. The rationale for the application of each of these rules must be noted; together with the many examples illustrating the use of each of the rules. It is only after doing this that the learners can successfully attempt the problems at the end of the chapter (and any other set problems) for evaluation by instructors. Counting Techniques There is the reverse way of looking at probabilities. In stead of asking what probability there is that an isolated may occur, one is interested in the number of alternative ways a set of events may turn out. Of course it the total number n of alternative ways that an event can turn out is known, then the probability that one of those ways turns out is 1/n. Consider the twenty-six letters of alphabet. In this regard, one may be interested in the alternative ways of arranging them. Computation using modern computers shows that there are 400 trillion ways of arranging these letters. The way which we have memorized since we were small is just one of these. The probability that such an order is selected can be said to be 1 . Such problems and many more are the concern of the 400 trillion counting techniques of statistics. One of the methods of approaching similar problems to the simpler types of the example given is by means of decision trees. This involves dividing the problem into many events (or stages) and drawing the various interconnections between them. If, for example, there is one production line of car assembly that produces three types of cars of three types of engine capacities 1100cc, 1500cc, and 2000 cc. Then if among the various engine capacities there are three types of color configuration - like yellow, blue and cream, and then if there are three body-build alternatives Saloon, Pick-Up and StationWagon; one may wonder what is the probability of obtaining one car, which is 1500cc, yellow in color and saloon. This is a typical simple problem which can be solved by means of a decision tree, like shown in Figure 1 - 4. 49 Figure 1 - 4 : A decision Tree Physically one can count the ultimate ends of the decision tree and find that there are 27 alternatives available. We may conclude that the probability of obtaining one trail of alternatives; like 1500cc, yellow and then saloon is exactly 1/27, because there are these 27 alternatives. Decision trees are convenient only for simple evaluating simple alternatives like the one given. The more stages of alternatives available, we find that it is impossible to keep drawing alternative family tree diagrams, because they become prohibitively complicated. this is where alternative methods of counting techniques are necessary. Using these counting techniques and appropriate computer programs this is how we are able to obtain all the possible arrangements of the 26 letters of the alphabet. These are the methods that we briefly allude in the discussion that follows hereunder. Learners are advised to read more within the prescribed textbooks. Permutations This is one of these techniques which is available and can save us from the pain of drawing decision trees always. It attempts to answer similar problems which arise from the desire to know how many arrangements of n-objects, r taken at a time, are possible . In permutations 50 the order of arrangement is important. Therefore we have the following formula for the algorithm used to compute permutations :- Prn n! . n r ! The formula reads : The permutation of n objects, r taken at a time is equal to the ratio of n! to the difference (n - r)! In that case, n is the total number of objects. The symbol r represents equal groups of objects that ate handled each time to effect the required arrangement. P is the permutation or another name for numerous alternative arrangements. Example Find the number of all possible arrangements of five objects, taking three at a time, where objects must be arranged in a sequential manner. Solution Here a specific arrangement is prescribed. This means that we can use only a permutation computational algorithm to solve this problem. Therefore, we use the formula :- Prn n! n r ! and insert the figures which we are given in the problem. Accordingly, we have the following arrangement which includes the given figures :- Prn n! 5! 5 4 3 2 1 P35 n r ! 5 3! 5 3! 5 4 3 2 1 5 4 3 60 . 2! This means you can arrange five objects taking two objects at a time 60 times. Learners must try as many problems in this as possible to familiarize themselves in the use of permutations. The probability of obtaining any of these possible arrangements is 1/60 = 0.016667. 51 Combinations We use combinations when the order of arrangements of objects is not necessary. In that case, the formula for this type of computation is as follows :- Crn n! r !n r ! If we examine the formula, we note that it is similar to the one of computing permutations, but there is an r ! in front of the denominator expression within the brackets. The formula reads, “The combination of n objects taking r objects at a time, is the ratio of n! to the product of r! and the difference (n - r)! Example Find the number of all possible arrangements of five objects, taking two at a time, where objects must be arranged in a no specific order. Solution Here no specific arrangement is prescribed. This means that we can use only a combination computational algorithm to solve this problem. Therefore, we use the formula Crn n! . If we now insert the given figures into this formula we r !n r ! have :- Crn n! 5! 5 4 3 2 1 5 4 C25 10 r !n r ! 2!5 2! 2 1 3 2 1 2 This means that five objects can be arranged ten times if we take two objects at a time when we are shuffling these objects. The probability of obtaining any of these possible combinations is therefore 1/10 = 0.1. 52 The Binomial Distribution If a medical team wishes to know whether or not their new drug provides their patients some relief from a certain disease, they need a computational technique to able to determine exactly the probability that this drug will work. If the probability distribution of this kind of computation is known, then it would be much easier to make many decisions in life which revolve around the two alternative options of success or failure. The Binomial Distribution assists us to compute these probabilities. Before we consider this probability distribution we need to consider and understand the concept of random variables. A random variable is a numerical quantity whose value is unknown, to begin with, but whose value could arise among very many alternatives options available, due to chance. It is a variable whose value is determined by chance. It is a theoretical concept arising out of knowing that there could be many possible outcomes in a process. In a coin toss activity, we know that Heads could be one of the outcomes, before the tossing of the coin. In the same manner, Tails is another outcome could happen. This means that the probability that any of these outcomes could happen is a random variable. The probability distribution of a random variable is an array of probabilities that each of all the possible random variables will occur. The random variable becomes an actual variable as a result of the actual occurrence of one of the outcomes of the random variable actually takes place. Consider the tossing of a coin toss five times. The probability that Heads will come out (number) three-times is a random variable, because it has not actually happened. When it actually takes place, then (number) three-times becomes an actual figure. Random Variables, Probability Distribution and Binomial Distribution Looking at the “ whether or not ” situation we note that each experiment, or each phenomenon, has its probability distribution determined by its own circumstances. There is no reason, for example, why 50% of all the patients who are subjected to a medical experiment will respond to some drug of interest. The doctors in this case are faced with two possibilities, involving success or failure of the drug. This is the same as saying, “Let us see whether or not the drug will work.” This is a situation which makes this 53 experiment a binomial experiment. This is the kind of experiment which has two likely outcomes, success or failure. There are still more conditions which make any experimental process become a binomial process. These are the so-called Bernoulli Conditions; named after Jacob Bernoulli who in 1710 gave a comprehensive account of the nature of this hit or miss process. For each Binomial experiment involving the probability of success or failure, the Bernoulli conditions are as follows :- 1. The experimenter must determine a fixed number of trials within which this experiment must take place. For example, in a series of coin tosses, where we are trying to see whether or not we shall get Heads must be fixed. The number of trials n must be definitely known. 2. Each trial must be capable of only two possible outcomes - either success or failure. 3. Any one outcome must be independent of all the other trials before and after the individual trial. 4. The expected probability of success ( ) must be constant from trial to trial. The probability of failure 1 must also be constant from trial to trial. 5. The experiment must be the result of a simple random sampling, giving each trial an equal chance of either failure or success each time. When trial outcomes are results of a Bernoulli Process, the number of successes becomes a random variable when examined from a priori point of view - before the experiment has been tried. When the experiment matures, and the results are known, then the variables which were initially conceptualized as random variables become actual variables which reflect the nature of actual outcomes. Example An evenly balanced coin is tossed five times. Find the probability of obtaining exactly two heads. Solution In this problem we wish to learn systematically the method of handling binomial 54 experiments. We begin by examining the given facts using the Bernoulli conditions which we have just learned. Bernoulli Conditions for the experiment We find that condition ( 1 ) is satisfied by the definite number of trials which has been specified in the question. Condition ( 2 ) is also satisfied, because each trial has exactly two outcomes, success or failure, heads or tails. Each coin toss is independent from any other, previous or subsequent toss, and this satisfies condition number ( 3 ) The nature of coin tossing experiments is such that each toss is independent of any other toss, and the expected probability of success or failure is constant from trial to trial - thus condition ( 4 ) is satisfied. The fifth condition ( 5 ) is satisfied by the fact that the system of coin tossing is repeated and random, and the results of the coin toss come out of a random “experimental” process. Once we are satisfied that the experiment has all the Bernoulli conditions correct, then we must use the Binomial formula to compute the probability like the one we are looking for in our example. We shall take the formula empirically and only satisfy our curiosity be learning that this is what was developed by Bernoulli for solving this kind of problems. The formula is as follows :- P R r Crn r 1 n r n! r ! n r ! 1 r n r For the beginner, this formula looks formidable - the puzzle being how to memorize and use it within an examination environment or in daily biometric life. On second thought we see that it is really very simple. Let us take the first and the middle expression :- P R r Crn r 1 n r We find that we have already learned the combination formula and that all the designated variables can be interpreted as follows:- 55 1. P is the obvious sign indicating that we need to look for the probability in this 2. experimental situation. The bold R is the designation for a random variable, whose value is unknown now but 3. which must be interpreted during an experimental process. The lower case r is the actual outcome of the random variable after experimentation. 4. Crn Is the combination formula which we have come across already when learning about combinations and permutations. This is the only material you need to memorize in this case, but remember it is :- Crn n! . r !n r ! Also remember it is the permutation formula with an extra r! in the denominator to neutralize the need for the specific orders which are necessary in permutations. The permutation formula is : Prn 5. n! . This is easy, is it not ? n r ! The variable is the population or the universal or the expected probability of success, and (1 - ) is the complement of this: which is the probability of failure. 6. The lower case n which appears often in this expression as a superscript is the fixed number of trials. The lower case r which appears as a superscript (an exponent) on the right hand of the equation is the actual number of successes, which is identical to what has been described in ( 3 ) above. Study these six facts very carefully before leaving this section because they will help you have some working knowledge of Binomial experiments which satisfy Bernoulli Conditions. Note that Statistics as a subject is plagued with symbol difficulties because 56 the same formula can appear in different textbooks with slightly different looking symbol configuration. While we recommend what we are using within this module because it is universally used in most statistical textbooks in print, we need to indicate some other types which you may come across during your studies. (a) The first form goes like this : n n x P xsuccesses , n x failures x 1 x (b) The other alternative form, still stating the same facts looks like this : n n x b x; n, x 1 x This one emphasizes the fact that you are dealing with a binomial experiment hence the “ b ” in front of the square brackets on the left hand side of the equals sign. All these have identical instructions to P R r Crn r 1 (c) n r . Another one you might come across is:- PK N p K q n k K Usually when these expressions are presented, the author makes sure he/she has defined the variables very carefully, and all the symbols have been explained. Pay meticulous attention to the use of symbols in the textbook you will be using in this case and in all the other instances of symbol use in Statistics. The results of using these symbols will be identical. Having understood all this, it is now time to work out our example of five coin tosses which we started with. This is now a simple process, because it merely involves inserting various given values in appropriate places and using your calculator and BODMAS rule that we learned in our junior school times to obtain the solution. P R r Crn r 1 n r 5! 0.52 0.55 2 2!5 2! 57 5! 0.5 2 0 .53 2! 3! 5! 0.5 2 0 .53 2! 3! 10 0. 25 0 .125 5 4 3! 0.5 2 0 .53 2! 3! 20 0. 25 0 .125 2! 0 . 31250 Activity Study the use of this method of working meticulously; using the interpretation we have given to the formula in the foregoing discussion; before leaving this part of the discussion. The material in pages 83 to 95 of the Fundamentals of Applied Statistics (King’oriah, 2004, Jomo Kenyatta Foundation, Nairobi) will be very useful and easy for the learner to supplement the information given in this module. A few easy, real-life-like examples of using the Binomial formula are given. I do not mind even if you use a half a day for the study of this one formula before going on. Then try to solve the problems at the end of Chapter Four (King’oriah, 2004) Pages 93 to 95. EXERCISES 1. 2. Explain the meaning of the following concepts used in Probability Theory :(a) Mutually Exclusive Events (b) Independent Events (c) Collectively Exhaustive Events (d) A priori examination of an experiment (e) Union of Events (f) Intersection of Events (g) Success and failure in probability (h) Complementary Events (i) Conditional Probability Use the binomial formula to compute the probability of obtaining tails in the process of tossing an evenly balanced coin eight times. 58 3. (i) Explain in detail the meaning of probability of an event. ( ii ) In a single toss of an honest die, calculate: (a) probability of getting a 4 (b) probability of not getting a 4 (c) probability of getting a e and a 4 (d) probability of getting a 2 and a 5 ( iii ) Explain in detail the meaning of the probability of an event. 4. A club has 8 members. (a) How many different committees of 3 members each can be formed from the club, with the realization that two committees are different even when only one member is different? (This means without concern for order of arrangement.) (b) How many committees of three members each can be formed from the club if each committee is to have a president, a treasurer and a secretary? In each of the above two cases, give adequate reasons for your answer. 59 CHAPTER THREE THE NORMAL CURVE AS A PROBABILITY DISTRIBUTION Introduction Any graph of any probability distribution is usually constructed in such a way as to have all possible outcomes or characteristics on the horizontal axis; and the frequency of occurrence of these characteristics on the vertical axis. The Normal curve is no exception to this rule. In fact we can regard this distribution literally as the mother of all distributions in Statistics. We shall introduce the normal curve systematically, using a histogram which has been constructed out of the probabilities of the outcomes expected from an experiment of tossing an evenly balanced coin five times. Here we are moving slowly from the “known to the unknown”. We have just completed learning the Binomial Distribution, and we shall use this to accomplish our present task. Study the Table 3 - 1. TABLE 3 - 1: THE OUTCOME OF TOSSING A COIN FIVE TIMES Possible Number of Heads Probability that the actual Number of Heads ( r ) equals the Possible n r Number of Heads R. Denoted as : P R r Crn r 1 0 5! 0.5 0 0 .55 0 0! 5 0! 1 5! 0.51 0 .55 1 1! 5 1! 5! 0.5 2 0.5 3 2! 3! 0.31250 2 5! 0.5 2 0 .55 2 2! 5 2! 5! 0.5 3 0.5 2 3! 2! 0.31250 3 5! 0.5 3 0 .55 3 3! 5 3! 5! 0.5 4 0.51 4! 5! 015625 . 4 5! 0.5 4 0 .55 4 4! 5 4! 5 5! 0.5 5 0 .55 5 5! 5 5! TOTAL PROBABILITY 5! 0.5 0 0.5 5 0! 5! 5! 0.51 0.5 4 1! 4! 5! 0.5 5 0.5 0 5! 0! = 60 0.03125 015625 . 0.03125 1.0000 Table 3 - 1 is the result of our computation. We begin by asking ourselves what the probability is; of obtaining zero heads, one head, two heads, three heads, four heads and five heads in an experiment where we are tossing an evenly balanced coin five times. For each type of toss, we go ahead and use the binomial formula and compute the probability of obtaining heads, just like we used in the last chapter. Accordingly, from our computation activities, we come up with the information listed in Table 3 - 1. You are advised to work out all the figures in each of the rows of this table, and to verify that the probability values given at the end of each row are correct. Unless you verify this, you will find it difficult to have an intuitive understanding of the normal curve. The next step is for us to draw histograms of the sizes of these results in each row, so that we can compare the outcome. Figure 3 - 1: A Histogram for Binomial Probabilities that the Actual Number of Heads ( r ) equals the Possible Number of Heads ( R ) in Five Coin Tosses. The reader is reminded about a similar diagram in Chapter One (on Page 19) about the height of three-week old maize plants. Both these diagrams have one characteristic in common. They have one mode, and have histograms and “bars” which are highest in the middle. In both diagrams, if the frequency were increased so that the characteristic on the horizontal axis is selected as finely as possible, instead of discrete observations, we have 61 very fine gradations of the quality (along the axis measuring quality - the horizontal axis). If we draw frequency distributions of infinite observations for both cases (in Chapter 1 and Chapter 3), the result will be a smooth graph, which is highest in the middle, and then flattens out to be lowest at both ends. The second characteristic of both graphs is that the most frequent observations in both diagrams occur either in the middle, or near the middle, within the region of the most typical characteristic. This is the trend of thought which was taken by the early statisticians who contributed to the discovery of Normal curve, especially Abraham de Moivre (1667 - 1754) and Carl Friedrich Gauss (1777 - 1855). The diagram in Chapter One (Figure 1 - 1, page 19) enables us to count physically how many observations are recorded in one category, or in many categories. Once we do this we can divide with the total count of all crosses to obtain the ratio of the subject observations to the total number of observations. We are even able to compute the area under the curve formed by the tops of the series of crosses, which area is expressed in terms of the count of these crosses under the general “roof-line” of the whole group. This count-and-ratio approach approximates the logic of the normal curve. In Figure 3 - 1 above, the early pioneers of Statistics were able to develop the frequency distribution which they called the Normal Distribution, by turning the observations along the horizontal axis into very many infinite category successions, and then plotting (mathematically at least) the frequency over each category of characteristic. When they did this mathematically, they obtained a distribution which is bell-shaped and with one peak over the most typical observation, the mean (). In Figure 1 - 1, the most typical characteristic is 0.3 of a meter. In Figure 3 - 1, the most typical probability seems to be 0.31250. We can rightfully argue that for Figure 3 - 1, this is the modal (or the most typical) probability (under the circumstances of the experiment, where 0 .5 , n 5 , and [ R = r ] varies between [R = 1 head] and [R = 5 heads] of the successful events). We shall not delve in the mathematical computations which were involved in the derivation of the normal curve, but we need to note that the curve is a probability distribution with an exact mathematical equation. Using this equation, a probability distribution is possible, because we can accurately plot an inverted bell-shaped curve; and most importantly, we can compute the area under this curve. Any such value of the area 62 under the curve is the probability of finding an observation having the value within the designated range of the subject characteristics. Activity 1. Go to Figure 2 - 1 and count all the maize which is 0.25 meters and below. This is represented by all the crosses in this characteristic range. You will find the crosses to be 31. In that regard, we can say that the probability of finding maize which is 0.25 meters and below. P maize 0.25 meters and below 2. 31 Actual observations 31 0 . 31 100 Total observations 100 Count all the plants which are 0.5 meters high. Compute the percentage of all the plants which are 0.5 meters high. Remember that there are 100 plants in total. What is the proportion of these plants to the total number of plants? What is the probability that one could randomly find a plant which is 0.5 meters high? 3. Count all the plants which are 0.35 meters and above. Compute the probability of finding some plant within this region by chance, if we consider that the total number of plants in this sample is 100 plants. What is the probability of finding a plant with a height below 0.35 meters? 4. Make your own table like Table 3 - 1. In your case, compute all the probabilities of finding zero Heads, 1 Heads, 2 Heads, 5. ..... up to 8 heads. Using Table 3 - 1, find the total probability of obtaining 3 heads and below. Explain why this kind of computation is possible. In all the above activities we have been actually computing the area under the curve, which can be described as a frequency distribution that is formed by observations 63 which result from the characteristics listed along the horizontal axis. The proportion or the probability of an observation or a set of observations under the curve is actually some kind of area statement. Please read the recommended texts and satisfy yourself that you are dealing with areas. The Normal Distribution The area under the bell-shaped Normal curve The founding fathers of statistics were able to use calculus and compute areas under curves of different types. Using the Binomial distribution it was possible to conceptualize very fine quality gradations along the horizontal axis, and to compute the frequency for each of the closely spaced characteristics, to form the normal curve. The general equation which they invented for describing the normal curve - which we do not have to prove is as follows :- In this equation f Xd X 1 2 2 e x 2 2 2 dX. is the usual geometric constant, defining the number of times the diameter of the circle goes into the circumference, which is 3.1416...times. The “ e ” is the base of natural logarithm, usually known as Euler’s constant, whose value is 2.7183.... The symbol is the universal mean of the population under consideration, and is the population standard deviation. The values of and e can be obtained from our scientific calculators at the touch of a button, because they are very useful in all advanced mathematical work. The population mean, represented by symbol , can either be computed from any number of observations above 30 ( the more observations the better ), or be a known figure from the observation of the nature of any data for a long time. The population standard deviation is likewise obtained from the raw population in the same manner as we discussed in Chapter 2. Except for the values of and e 64 which exist to define the curvature of the curve and the continuity of the same curve, respectively, the normal curve can be said to be completely determined by the values of the population mean ( ) and the population standard deviation ( ). Given these values, it is possible to determine the position of each point on the bell-shaped normal curve - or the normal frequency distribution. Consider the example of the maize plants example given in Figure 1 - 1. In this example, it was possible to tell the height (Number) of each of the columns of crosses for each category on the horizontal axis. The highest cross on each column of crosses determines the position on the frequency curve at that point; and at that level of the characteristic along the horizontal axis. The more we increase the total number of the sample and the more we keep on drawing histograms representing the number of counts for each characteristic, the smoother the graph of the distribution approaches the smooth shape of the normal curve, like in the two diagrams shown in Figure 3 - 2. To demonstrate that we are dealing with a real phenomenon and not merely a mathematical abstraction, we will now show that it is possible to compute the height of the normal curve at any point using the above formula, and given the two parameters, the population mean and the population standard deviation . We shall strictly stick to evaluating one point on the normal curve. Remember that a point is the smallest part of a curve - even a straight line. In this case we can say that a curve like a normal curve is a succession of points plotted using some definite equation, like we did in elementary schools, when we were learning how to draw graphs. This means that if we can obtain one point using the formula, then we can get a succession of points comprising the normal curve using the same formula :- f X 1 2 2 We now begin this very interesting exercise :- 65 e x 2 2 2 Figure 3 - 2: The more the observations along the horizontal axis, and the closer their values along the same axis, the more their relative frequencies fit along the smooth curve of the normal curve Example A variable X has a population mean of = 3 and a standard deviation = 1.6. Compute the height of the normal curve at X = 2 using the equation for one point on the normal curve f X 1 2 2 e x 2 2 2 . Solution Identifying the height of the curve using this formula means that we plot a point on the curve using the given data. If we insert the given values and those of the constants into the equation and solve the equation using these values, we shall get the required height. We shall first begin by evaluating the exponent, which lies 66 above the Euler’s constant “ e ” on the given equation. We also need to remind ourselves that “ f ( X ) ” on our equation means “ y ”, which in mathematics is any value on the y-axis. This is the value we described as the frequency of occurrence of the “point” characteristic defined along the horizontal axis. It is like asking how many maize plants are 0.2 meters high on Figure 1 - 1. If we count these we find that they are eleven. This now is the height of the jagged curve defined by the top of the crosses in the diagram at the time X = 0.2. In this case, we are asking the same question for the value of X = 2. Now we know what we are looking for. X Evaluate the exponent Step One : 2 2 2 . We do this by using the values given in the question, and inserting them in the appropriate places and obtaining the value within the square brackets. X Step Two : 2 2 2 2 3 2 2 2 1. 6 1 5. 2 Evaluate the fraction the known constant 1 2 Step Three : 2 1 2 1. 6 2 1 2 2 1 2 2.56 0 .19531250 using the given value of = 1.6, and . 1 1. 6 2 1 0 . 24933893 1. 6 6 . 283185 Evaluate the Euler’s Constant e to the power of the exponent calculated using Step One above . e 0.19531250 2.7183 0.19531250 67 0 .82257649 Step Four : Multiply the results of Step Three and those of Step Two. 0 .82257649 0 . 24933893 0 . 205510035 Amazing! We can therefore conclude that the height of the normal curve defined by the parameters X = 2, = 3, and = 1.5 is 0 . 205510035 . These units are usually presented in any numbers being used in the subject experiment. We can conclude that using the two population parameters: the population mean and the population standard deviation , it is possible to compute a succession of all the points along the trend of the normal curve using our equation for finding one value of “ y ” or “ f ( x )” along this curve. The continuous surface of the normal curve is made of all these heights joined together. Mathematically, we say it is the locus of all these points. Similarly if we are given the two important parameters, the population mean ( ) and the population standard deviation ( ), we can compute any area of any slice under the normal curve. The expression for the total area under the normal curve is :- f Xd X 1 2 e 2 x 2 2 2 d X And, using the same logic the equation defining the area subtending any values a and b, which are located in the horizontal axis (i.e. any characteristic value a and b ), could be obtained using calculus by re-defining the end-limits of the general equation as stated in the following expression :- a b f Xd X b 1 a 2 2 e x 2 2 2 d X Those of you who understand the branch of mathematics called Integral Calculus know that this is true. In addition, if we can compute the areas of slices under the normal curve, those areas will be identical to the probabilities that we can find any value within the slice defined by the characteristic values 68 a and b along the horizontal axis. Luckily, we do not have to use the formula every time for this purpose, because there exist tables which have been developed by the founding fathers of Statistics to assist us in computing these areas; if we learn the method of using these tables - as we are about to learn within the discussion which follows below. Just now, we need to study the characteristics of the normal curve. Characteristics of the Normal Distribution Any probability distribution of any random variable gives a probability for each possible value of that random variable. In the case of the normal distribution, the random variable is the characteristic of interest; which is usually plotted along the horizontal axis. We already know that along the vertical axis lies the frequency of each characteristic. This is why we involved ourselves in the exercise of counting the maize plants in Figure 1 - 1. We also need to remember that the frequency of any normal population is greatest around the mean characteristic ( ) of the population. This is why we involved ourselves in the computation of the binomial experiment of finding the probabilities of success after tossing a coin five times, as we have done using Table 3 - 1. We also did similar things using Figure 1 - 1. When the normal distribution is considered merely in terms of the characteristic of interest and their frequency of occurrence, it could be called a frequency distribution. For the normal curve to be a probability distribution, we have to think in terms of the characteristic of interest, and the probability that such a characteristic may turn out to be the real characteristic after investigation. This is where the concept of random variable comes in. Before any investigation is done, there is a priori conception of all possible ranges of any random variable of interest. .In this case, if we are thinking of the maize example, all possible heights of maize plants after a certain period of growth may be important. The characteristic heights would then form the horizontal axis; and the frequency of their occurrence, the values along the vertical axis. In that regard, we may compute the probability of interest in a similar manner as we have discussed above, either using the Binomial Distribution, or using Integral Calculus. When such probability distribution is plotted, a curve of the probability distribution is the result. The probability distribution is actually an arithmetic transformation of actual frequencies; which also can 69 be plotted along a curve. We call the actual frequencies raw scores, and the observations along the continuous probability distribution probability frequency values. The resulting curve is of course a frequency distribution. Having defined the normal curve as a probability distribution, we need to state the characteristics of the normal curve. We know that the peak of the normal curve lies above the mean characteristic in any distribution, and therefore we begin by stating the first characteristic :(a) The normal curve is uni-modal in appearance. It has a single peak above the mean characteristic value . When we use sample values to estimate the normal curve is replaced by the statistic X . The highest point on the normal the parameter curve defined using sample observations is therefore above the sample (b) The expression f X 1 2 2 e x 2 2 2 d X mean X. is for a curve which is an inverted bell-shaped, which falls steeply on both sides of the mean value ( , or X ), and then flattens out towards both ends as it approaches the horizontal axis. Statistically we say that the curve has an upper tail and the lower tail as in Figure 3 - 3 below. Figure 3 - 3: The Shape of the Normal Curve 70 (c) The normal curve depends only on two parameters: the population mean ( , or X ), and the population standard deviation or s for the sample standard deviation. (d) The total area in terms of all there is of the observations is either the population or the sample of interest. If we consider the curve as a frequency distribution the area of that curve represents the total probability of finding members of that population or that sample having all the possible range of the characteristic of interest. Since we are considering all there is of the know characteristics under the normal curve the total probability is actually 1.0. We can state this using the equation for the definition of random variables : P X x 1. 0 i 1 If we have a finite number of frequencies and observations to consider we can translate this equation to define the number between 1 and n observations :- n P X x 1. 0 i 1 (e) The shape of the normal curve is completely determined by the population or the sample means and the population and the sample standard deviation. Each curve has the same configuration as all the others, but in fact differs from all the others depending on the circumstances of the experiment and the characteristics of the population of interest. However, if we assume that the total probability of all the existing observations is 1.0, then we have what is called as the Standard Normal Curve. Such a curve is standardized in such a manner as to leave it with a mean of zero and a standard deviation of 1.0. 71 Activity The process of standardization can be very easily modeled by requesting a class of 30 students (or more) to measure their heights accurately. Then request them to compute their mean height. After this request them to stand up. Some students will be taller and others will be shorter than the mean height. You may now request all the members of that class who have exactly the mean height to sit down. For the remaining students record exactly how much higher or smaller they are than the mean height. Then look for the standard deviation of the class heights. Results 1. When all the students with their heights exactly equal to the mean height are either requested to sit down or to leave the group, this is like saying that the measurement of all the deviations shall begin from the mean value. Thus the mean value becomes your zero value. 2. Those students who are taller than the mean will record a positive difference above the mean. Record all these for the taller students, and deduct them from the mean height. The students whose height is smaller than the mean will record a negative difference from the mean. Do all the subtractions and record these with their negative value, and clearly indicate this using the negative signs. For both groups, ignore the mean value, and record only the deviations from the mean with their appropriate signs indicated in front of their respective values. This is like making the mean value zero. 3. Examine the value of the standard deviation. Using the standard deviation as a unit of measurement, find out of how much smaller or taller than the mean the each of remaining members of the class are - after all the standard deviation is a deviation like all the others, only it is Standard, which means it is the typical, or it is the expected normal deviation from the mean. This comparison can be done by dividing all deviations from the mean (whether large or small) by this standard deviation, which is after all the regular standard of measurement. 4. The result of this will give the number of standard deviations which separate each height-observation from the mean value ( ); which we have discounted and 72 assumed it is zero. Also these deviations from the mean large and small will be negative and positive, depending on which side of zero they happen to be located. The number of standard deviations for each case ( and this number, in absolute values, could be between zero (take on decimal values, and so on) and 4 standard deviations (for reasons we shall discuss soon). 5. This number of standard deviations from the mean for each observation (covering each student who has not been excluded when we told those with the mean value step aside), is the so called the Normal Deviate value, usually designated as the “Z-value”. Any Z-value measures the number of standard deviations at which any observation in any sample or any normal population stands away from the mean value. This value could take fractional values, it could take as large values as one, two, or even between three and four standard deviations from the mean. Our work of manipulating the normal curve will rest heavily on the understanding this characteristic of the Z-value. We shall do this presently. 6. The area under the normal curve covering any interval on both sides of the population mean depends solely upon the distance which separates the end-points of this interval in terms of the number of standard deviations, which from now we shall call Z-values, or the number of normal deviates. Any normal curve created out of deviations from the mean population values is called a standardized normal curve. Using the Z-values we can compute the area under the normal curve between any two end-limits. This saves us from the use of esoteric mathematics of integral calculus, which is also right; and can bring identical results if accurately used. In terms of standard deviations, the proportion of the normal curve between any two equal intervals on both sides of the mean is in accordance with the table 3 - 2. This table actually reflects the very nature of the normal curve which we have been explaining above. We must keep reiterating that we are concerned with the number of any population having any particular measurable or characteristic. Also we reiterate that in any Normal population which is nearly homogeneous in character the magnitude of the characteristic of interest is nearly identical, and does not differ considerably from the mean. This is why most of the population is found near the mean when we consider it in terms 73 of the magnitude of interest and in terms of the distance in terms of standard deviations away from the mean. 7. The number of standard deviations of any observation from the population or the sample mean can be calculated by dividing the actual size of that observation which is above or below the mean with the population or the sample standard deviation. Table 3 - 2: AREA UNDER THE NORMAL CURVE SPANNING BOTH SIDES OF THE MEAN AS MEASURED AS A PROPORTION AS A PERCENTAGE, AND IN TERMS OF STANDARD DEVIATIONS Number on each side of the mean Population % having this range of characteristic Proportion of the total area under the normal curve One Standard Deviation , to Two Standard Deviations 2 , to 2 Three Standard Deviations 3 , to 3 Four Standard Deviations 4 , to 4 68.0% 0.68 Probability of finding a member of population within this bracket 0.68 95.5% 0.955 0.955 99.7% 0.997 0.997 99.994% 0.99994 0.99994 Using Standard Normal Tables This discussion now brings us to a very interesting situation, where we can look for any area under the normal curve subtended by any two end-limits along the characteristic axis ( X-axis ) without any use of complicated integral calculus - which is the tool mathematicians use for calculating areas under any curve. Amazing! But we must thank our mathematical fathers especially Carl Friedrich Gauss (1777 - 1855) for the pains-taking efforts of developing the theory behind the so-called Standard Normal Tables. His work on the Standard Normal Tables and all the associated theories and 74 paradigms made the standard normal curve to be classified by mathematicians as a Gaussian Distribution. We can now comfortably learn how to use the Standard Normal Tables and the fact that a normal curve is a probability distribution to solve a few easy problems. For additional work on this exercise you are advised to read King’oriah (2004, Jomo Kenyatta Foundation, Nairobi) Chapter Five. Example A certain nut has an expected population mean weight of 50 grams, and a population standard deviation of 10 grams. How many standard deviations “ Z ” away from the population mean is the nut which you have randomly picked from the field, and it weighs 65 grams? Solution This is a good example of looking for the normal deviate Z . We must obtain the difference between the population mean value and the sample which we have taken from the field in terms of actual raw weight from the field. The formula for this activity is :Z X i . Where Xi is any value of the random value X (any actual value of an individual observation ), is The mean of the population under consideration, and is the standard deviation of the population under consideration. This formula is analyzed in a simple manner using our nut example. The difference between the actual observation and the population mean is:65 gm. 50 gm. 15 gm. This difference must be translated into the number of standard deviations so that we can calculate the Z-value :- 75 Z 65 gm. 50 gm. 15 gm. 1.5 Standard Deviations. 10 gm. 10 gm. Here we are using the standard deviation as a measure of how far the actual field observation lies away from the mean; and in this case the nut we picked lies 1.5 standard deviations away from the mean value . Example Suppose in the above example we find another nut which weights 34 grams, calculate the Z-value of this difference between Xi and . Solution Using the same units which we have been given in the earlier example we have :- Z Xi 34 50 10 16 1. 6 10 The Standard Normal Tables Look at page 487 in your textbook. (King’oriah, 2004) or any book with the standard normal tables entitled Areas Under the Normal Curve. In the first case of our nut-example, we obtained 1.5 standard deviations from the mean. Look down the left hand column of the table in front of you. You will find that beside a value of Z = 1.5 down the Z-column is an entry “ 4332 ”. This means that the area under the normal curve with the end-limits bounded by the mean and the actual observation “65 grams” is 0.4332 out of 1.0000 of the total area of the normal curve. And since the value is positive, the slice of the normal curve lies on the right hand side of the mean as shown in Figure 3 - 3. In the second case where Z = – 1.6, the actual observation lies on the left hand side of the mean , as indicated by the negative sign in front of the Z-value. The reading on the table against Z = – 1.6 is “ 4462 ”. This means that an observation of 34 grams and the mean = 50 grams as end-limits (along the characteristic values X-axis) subtend an 76 area which is 0.4462 out of the total area of 1.0000. This area is shown on the left hand side of the mean in Figure 3 - 3. Figure 3 - 3 Showing the Number of Standard Deviations for the weight of a Nut, Above and Below Mean Having computed the areas of the slices on both sides of the mean which are subtended by the given end-limits (in terms of the proportions of 1.0000), the next question we need to ask ourselves is what the probability is, that we can find an observation within both areas subtended by both intervals of the given end-limits. Both proportions (out of 1.0000) which we have already computed - namely 0.4332, and 0.4462, respectively, are actually records of the probability that one can find some observation within the areas subtended by the respective end-limits. The probability that one can find a nut weighing between 34 grams and 65 grams on both sides of the population mean is the sum of both probabilities which we have already computed. It is obtained using the equation below. Total Probability = 0.4332 + 0.44662 = 0.88982 77 Example What proportion of the normal curve lies between Xi = 143 and = 168, when the standard deviation of this population = 12? Solution Step One: Z Compute the normal deviate Z over this interval. Xi 143 168 12 25 2 . 08 12 Like all calculations of this kind, we ignore the negative sign in our answer, because this merely tells us that observation 143 is smaller than observation 168. Step Two: When Z = 2.08 we look up the proportion in the standard normal tables on page 487 of your textbook. (King’oriah, 2004) The row 2.0 of all the Z-values down the left hand side column and the top right hand column indicating 0.08 converge at a figure within the body of the entire table at a reading of “ 4812 ”. This means that 0.4812 of the normal curve lies between the observations given within the question of our example - namely between Xi = 143 and = 168. The real life implication of this can be better appreciated using percentages. It means that 48.12% of all the members of the population have the size of the characteristic of interest that lies between 143 and 168. These characteristics could be measured in any units. Suppose the example refers to the weight of two-month old calves of a certain breed of cattle. Then 48.12% of these calves must weigh between 143 kilograms and 168 kilograms. 78 Example Using the parameters and statistics given in the above example you are requested to compute the percentage of the two-month old calves which will be weighing between 143 kilograms and 175 kilograms. Step One : We know that between 143 kg. and 168 kg. are found the weights of 48.12% of all the calves in this population. Step Two: Compute the normal deviate between 168 kilograms and 175 kilograms. Z Step Three: Xi 175 168 12 7 0 .58333 12 Look for the area under the normal curve between the mean of 168 kilograms and 175 kilograms. We shall now symbolize the area which we shall find this way, and also all those that we found earlier, using the same designation as “ Az ”. This symbol means “ the area defined by the value of the given, or the value of the computed Z-figure” using the last expression. Look down the left hand column “ Z ” of the Standard Normal Table on page 487, and find where the Z-value reads “ 0.5 ”. Then, across the top-most row of that Z-table; go across this top-most row, to that column-heading reading 08333 (if you can find it). You will find that the nearest figure to this that you can find is 0.08 (may be 0.09) along the top-most row. For our convenience let us adopt the column labeled the approximate figure of “ 08 ” along the top-most row of the Z-table. Where the row labeled “ 05” intersects the column labeled “ 08 ” lies the area-value which will correspond to our of Z = 0.5800 (instead of our computed value of Z = 0.58333). After all, 0.5800 is a good rounded-up figure. Therefore the area subtended by this number of standard deviations Z = 0.5800 from the mean ( ) occurs where the row value of “ 0.5 ” down the left hand 79 Z-column intersects the column along the top-row labeled “ 08 ”. This intersection defines an area (the “ Az ” ) under the normal curve of 0.2190. Using our percentage value interpretation, we find that 21.90% of the calves must weigh between 168 and 175 Kilograms. Figure 3 - 4 Showing the Proportion of the Calve Population Which Weighs between 168 Kilograms and 175 Kilograms The “ t ” Distribution and Sample Data Sampling Distribution of Sample Means It is not every time that we are lucky to deal with entire populations of the universe of data. Very often we find that we are limited by time and expense from dealing with the whole population. We therefore resort to sampling (See Chapter Four) in order to achieve our investigations within the time available; and often, within the level of expense that we can withstand. In this case, we do not exactly deal with a normal distribution. We deal with its very close “cousin”: called the Sampling Distribution of Sample Means. This distribution has very similar qualities to the normal distribution because after all - all data is obtained from one population or another. However, we know that the characteristic of this data comes from only one sample out of the very many which could be drawn from the main population. In this case, we have a double-transformation of our sample data. Firstly, the data is governed by the quality of all individuals in the entire 80 population, and secondly the sample is governed by its own internal quality as a sample, as compared to any other sample which could be randomly selected form the entire “universe” or population. Mathematicians have struggled with this phenomenon for a long time; to try and find how sample data can be used for accurate estimation of the qualities of the parent population, without risking the inaccuracies which could be caused by the fact that there is a great chance for the sample not to display the complete truth about the characteristics of the parent population. This can happen because of some slight sampling errors, or because of the random characteristic of the sample collected - which could differ from all the other samples in some way, and also from the main population. After some prolonged study of the problem, the concept of the Expected Value of the Sample Mean - whatever its nature, was found to be the same as the population mean. However, suppose there are some fine differences? This is why and how the mathematicians came up with the Central Limit Theorem. This theorem clarifies the fact that all samples are estimates of the value of the entire population mean; but each sample may not be the exact estimate of that population’s mean. Therefore we assumed that sampling will be repeated for a long time randomly; and then the mean of all the sample means (with very fine differences) will be sought. We found that this mean is expected to be identical to the parent population mean. Obviously, this collection of sample means, with its fine differences, after many repeated samples will form a distribution in its own right. This is what mathematicians and statisticians call the Sampling distribution of Sample means. Since every sample is a very close estimator of the main population mean, such a sampling distribution is expected to be very closely nested about the main population mean ( ); because the frequencies (of all samples which are expected to have sample-means very nearly the size of the parent population-mean) will be very high. In other words, these samples will be very many, and their count will be clustered about the main population mean. Here we must remember the counting exercise we did at Figure 1 - 1 ( page 19 ). Then, let us project our thoughts to the counting of the numbers of sample means around the population mean ( ); may be, using a number of crosses like we used in that diagram. 81 If we were to generalize and smoothen the Sampling distribution of Sample means from discrete observations (or counts of data) like the “crosses” found in Figure 1 - 1, the result would be like the steeper curve in Figure 3 - 5. We would therefore obtain a very steep “cousin” of the normal curve nested about the population mean ( ) of the normal distribution which is formed by the “raw” observations of the parent population. The ordinary normal distribution comprises the flatter curve in that diagram. Figure 3 - 5: Theoretical Relationship between the Sampling Distribution of Sample means and a normal curve derived from Raw scores The sampling distribution of sample means also has its own kind of standard deviation, with its own peculiar mathematical characteristics, which are also affected by its origins. This is what is asserted by the Central Limit Theorem statement. Under these circumstances, the standard deviation of sample means is called the standard error of sample means. In order to discuss the Central Limit Theorem effectively, it is important to state the theorem, and thereafter explain its implications. The Central Limit Theorem states that :If repeated random samples of any size n are drawn from any population whose mean is and variance 2 , as the number of samples becomes large, the sampling distribution of sample means approaches normality, with its mean equal to the parent-population mean ( ) and its variance equal to 2 . n 82 Firstly, we need to note that we are using the population variance of the sampling 2 distribution of sample means for the final formula: . This is because the variance is n the measure which is used mostly by mathematicians for analysis of statistical theorems. This does not matter, as long as we remember that the standard error of sample means can be sought using the square-root of variance, expressed as X 2 . An n ordinary standard deviation is a measure of the dispersion of an ordinary population of raw scores, and the standard error of sample means is a measure of the standard deviation of the distribution of sample means. Secondly the peculiarities of the sampling distribution imply that we use it in the analysis in a slightly different way than that of the ordinary normal distribution; even if the two distributions have a close relationship. This means we use a statistic which is a close relative of the normal distribution, called the “ t ” distribution. This statistic is sometimes called the Student’s t Distribution (after its inventor, W.S. Gosset, who wrote about this statistic under the pen-name “ Student ”, because his employer, Guinness Breweries, had forbidden all its staff from publishing in journals during the period Gosset was writing - 1908.) The statistic first uses an estimate of the standard deviation through the usual formula which reflects the fact that one degree of freedom has been lost during such an estimate. This sample standard deviation is represented by and computed using the usual formula :S n X i X 2 n 1 i 1 where S means the estimated standard deviation of the sample, and n is the number of observations. The rest of the standard error formula can be compared to the usual formula for the computation of standard deviations using raw scores. Remember, the usual expression for the raw-score standard deviation of a normal distribution is 83 n i 1 X i X n 2 , and compare this to our new for the computation of the sample standard deviation-expression which we have just stated. Once the sample standard deviation has been computed, then the standard error is computed using the result of that computation in the following manner :- n S X X i 1 i X n 1 n 2 S n Sometimes this expression for the standard error denoted by X S , where n X means the standard error of sample means. It reflects the fact that we are actually dealing with a very large population of sample means. In all our work we shall use S S X as our designation, to mean the standard error of sample means. Once this n standard error has been computed, then the expression for the t-distribution can be computed using the following expression :- t X X . SX s n This is the expression which has been used to make all the t-tables. These tables are available at the end of your textbook ( King’oriah, 2004, page 498). The distribution looks like the usual normal curve. The interpretation of the results is similar. The distribution is bell-shaped and symmetrical about the parent population mean ( ). The scores recorded in this distribution are composed of the differences between the sample mean X and the value of the true population mean ; which difference is then divided by S X each time. The number of standard errors which any member of this population 84 stands from any sample mean can be obtained, and be used to compare individual sample means with the population mean ( ); or among themselves. The distribution can also be used to compute probabilities, as we shall see many times in the discussion within this module. The cardinal assumption which we make for this kind of statistical measure is that the underlying distribution is normal. Unless this is the case, the t-distribution is not appropriate for statistical estimation of the values of the normal curve, or anything else. We are now ready to use a small example to show how to compute the standard error of mean. Example Four Lorries are tested to estimate the average fuel consumption per lorry by the manager of a fleet of vehicles of this kind. The known mean consumption rate per every ten kilometers is 12 liters of diesel fuel. Estimate the standard error of the mean using the individual consumption figures given on the small table below. Lorry Number 1 2 3 4 Consumption 12.1 11.8 12.4 11.7 Solution We first of all compute the variance of the observations in the usual manner. Observe the computation in Table 3 - 3 carefully and the following expressions to make sure that the variance has been accounted for, the standard deviation, and finally, the standard error of mean. 1. S X n i 1 n 2. S X i 1 i X 2 n 1 X X i 0 . 30 4 1 2 n 1 4 85 0 . 30 4 1 4 0.316 0158 . 2 TABLE 3 - 3 : FIGURES USED IN COMPUTING THE ESTIMATED SAMPLE STANDARD DEVIATION IN PREPARATION FOR THE COMPUTATION OF THE STANDARD ERROR OF SAMPLE MEANS Xi X 12.1 X i X i X 12.1 - 12.0 = 0.1 0.01 11.8 - 12.0 = - 0.2 0.04 12.4 12.4 - 12.0 = 0.4 0.16 11.7 11.7 - 12.0 = 0.3 0.09 11.8 12.0 TOTAL 3. X 2 0.30 The expression number ( 2 ) above is a systematic instruction of how to compute the standard error of sample means from the sample standard deviation; which is also computed in the first expression ( 1 ). Note how we are losing one degree of freedom as we estimate the sample standard deviation. In the second expression ( 2 ) above, we must be careful not to repeat the subtraction of one degree of freedom for the second time to adjust the denominator “ 4 ”. This will bring errors caused by double-counting. After our understanding of what the standard error of sample means is, we now need to use it to estimate areas under the t-distribution in the same way as we used the Standard Normal Tables. We reiterate that the use of the t-distribution tables is similar to the way we used the Standard Normal Tables. We make similar deductions as those we make using the standard normal tables. The only difference is that the special tables have been designed to take care of the peculiarities of the t-distribution, as given in any of your textbooks (like King’oriah, 2004, page 498). We shall, however, defer t-table exercises a little (until the following chapter) so that we can consider another very important statistic which we shall use frequently in our statistical work. This is the standard error of the proportion. 86 The Proportion and the Normal Distribution In statistics the population proportion is regarded as some measure of central tendency of population characteristics - like the population mean. The fact that 0.5 of all the people in a certain population drink alcoholic beverages means that there is a central tendency that any person you meet in the streets of the cities of that community drinks alcoholic beverages. Thus the proportion is a qualitative measure of the characteristics of the population, just like any observation (or any sample mean) is a quantitative measure of the characteristics of any population. Remember that early in this chapter we said that the normal curve of any population is estimated via the binomial distribution. The proportion of success for any sample or population is a binomial variable. The lack of success or failure is also a binomial variable. Both satisfy Bernoulli conditions. Therefore the distribution of the proportion is very closely related to the normal distribution. In fact, they are one and the same thing, only that for the raw scores (or ordinary observations) we use quantitative measures (within the interval or ratio scales of measurement), while for the proportion observations we use qualitative measures (within the nominal scales of measurement) comprising success ( P, 1 ) or failure , 1 P . The universal (or the population) measure of the proportion is denoted by the Greek letter , while the sample proportion is denoted using the Roman letter “ P ” , sometimes in the lower case, and at other times in the upper case. This means that given the sample proportion “ Ps ” , we can use this to estimate the position of the sample proportion quality ( Ps ) under the normal distribution. This can be done in the same manner as we can use any observation or any population mean of the normal distribution or any sample-mean within the t-distribution. Like the sample mean X , the expected value of the sample proportion “ P ” is almost equal to the population proportion proportion P ; because, after all, the sample whose has been computed belongs to the main population. Therefore, the Expected Value of any sample proportion is the population proportion: E ( Ps ) = . The parameter known as the Standard Error of the Proportion (whose mathematical 87 nature we shall not have time to discuss in this elementary and applied course) is denoted symbolically as 1 P . An estimate of this standard error using the N P sample proportion is denoted as P 1 P . Here the various symbols n s which we have used in both expressions for the standard errors of the proportion are translated in the following manner :- P = Standard error of the population proportion P = Standard error of the sample proportion P = Sample proportion for successful events or trials or characteristic = The population proportion parameter for successful trials s 1 = The population proportion parameter for unsuccessful trials 1 P = The sample proportion parameter for unsuccessful trials N, n = The number of observations in the population, and in the sample, respectively. The associated normal deviate “ Z ” for the proportion, which is used with the standard normal curve to compute the position of samples which have specified characteristics in any population having a population proportion Z Ps p s , is :- Ps P1 P n The behavior of this normal deviate is identical to that of the normal deviate computed using raw observations (or scores), which we considered earlier. The Standard Normal Curve is used to estimate the positions of sample proportions, in the same manner as we used the same curve to estimate the position of individual members of the population under the normal curve. Refer to pages 69 to 74 in this document. 88 We now need to compute the standard error or the population using the information which we have just learned. After this we shall compute the normal deviate “ Z ” using the sample-proportion and the population-proportion. Then using the number of the standard errors which we shall calculate, we shall derive the required probability: as illustrated by Figure 3 - 6. Figure 3 - 6 : The shaded area is above P = 0.7. Example An orange farmer has been informed by the orange tree-breeders that 0.4 of all oranges harvested from a certain type of orange tree within Tigania East will have some green color patches on their orange skins. This is what has been found after a long period of orange tree breeding in that part of the country. What is the probability that more than 0.7 out of ten (10) randomly selected oranges from all the trees within that area will have green patches mixed with orange patches on their skins? Solution 1. In this example the long observation of orange skin color indicates that whatever was obtained after a long period of breeding is a population probability of success . This is what will be used to compute the standard error of the proportion. This 89 means that 0.4 . Then 1 0.6 . The underlying assumption is that sampling has been done from a normal population. The latest sampling experience reveals a sample proportion P 0.7 . The number of the oranges in this sample ( N ) is ten oranges. 2. The standard error of the proportion is :- P 3. N 0.41 0.4 10 01549 . The normal deviate is computed using the following method :- Z 4. 1 P 0.7 0.4 0.7 0.4 1937 . P P 01549 . This means that in terms of its characteristics, the current sample is “1.97 standard deviations” away from the mean (typical) characteristic or quality on the higher side. Remember that the typical characteristic or quality in this case is the population proportion of success; which we saw was 0.4 . What remains now is to use the normal deviate to compute the required probability using the computed normal deviate. We now introduce some simple expression for giving us this kind of instruction. This type of expression will be used extensively in the following chapters. We will use it here by the way of introduction. 5. P Ps 0 . 7 Ps 0.7 0.4 P P Z 1. 937 P P This expression reads, “The probability that the sample proportion PS will be equal to, or be greater than 0.7 {expressed as P Ps 0.7 } is the same as that probability that the 90 number of standard deviations away from the mean ( ) will exceed the number of standard deviations defined by Z 1937 .” This number of standard deviations . ( Z 1937 ) has been computed using the large expression in square brackets at number . (5) above - the same way as we had done previously. 6. To obtain the probability that more than 0.7 out of ten (10) randomly selected oranges from all the trees within that area will have green patches mixed with orange patches on their orange-fruit skins, we need to obtain the probability that 0.7 out of 10 oranges will have this characteristic, and then to subtract the figure which we shall get from the total probability on that half of the normal curve, which is 0.5000. To do this we look at the standard normal tables and find the area subtended by Z = 1.937 standard deviations above the mean. We shall, from now henceforth, describe this kind of area with the expression AZ. In this regard AZ A1.937 . Looking at the Standard Normal Tables on page 487 (King’oriah, 2004), we go down the left hand “Z” column until we find a value of 1.9. Then we find where it intersects with the figures of the column labeled .03, because .037 is not available. The figure on the intersection of the “1.9 row” and “0.03 column” of the Standard Normal Table, is 4732. Then we conclude that :- AZ A1.937 0.4732 7. Remember this is on the upper half of the standard normal frequency distribution. This upper half comprises 0.5000 of the total distribution, or 0.5 out of all the total probabilities available, that is 1.0. This means that on this end of the Standard Normal curve, the probability that more than 0.7 of the sample of ten (10) randomly selected oranges from all the trees within this area will have green patches mixed with orange patches on their orange-fruit-skins, will be the 91 difference between the total area of this half of the curve (which is 0.5000) and the value of AZ A1.937 0.4732 . 8. AZ A1.937 0.4732 P Ps 0.7 0 .5000 0.4732 0 . 0268 This means that taking any ten oranges, the researcher expects that about 2.7% of these oranges or less will have yellow patches interlaced with green patches on their orange-fruit-skins. Using the same kind of logic we can do one of these examples using the standard normal curve and the normal deviate Z, just to demonstrate the similarities. Observe all the steps very closely and compare with the example we have just finished. Example The Institute of Primate Research has established that a certain type of tropical monkey found in Mount Kenya Forest has a mean life span of X 24 years and a standard deviation of 6 years. Find the probability that a sample of this type of monkey caught in this forest will live to be more than 25 years old. One hundred monkeys were caught by the researchers. (Source: Hypothetical data, as in King’oriah 2004, page 135.) Solution 1. 24 years, 6 years N 100 monkeys. X 25 24 25 24 1 P X 25 P Z 6 10 0.6 6 100 X P Z 16667 . 92 2. The area defined by 1.6667 standard errors of , above the population can now be defined as AZ A1.6667 0.4525 . This area lies in the upper portion of the normal distribution. 3. The area above AZ A1.6667 0.4525 can only be the difference between 0.5000 of the distribution in the upper portion, less A1.6667 0.4525 . This area is equal to :- 05000 . 0.4525 0.0475 This means that only 4.75% of the monkeys can be expected to live beyond 25 years of age. Activity 1. Look at the textbook (King’oriah 2004, page 167) for a computation of this kind, which involves the standard error of sampling distribution of sample means and the use of the t-distribution. While you are doing so, you are advised to read Chapters Five, Six and Seven of the Textbook in preparation for the coming Chapter Four of these guidelines. 2. Attempt as many problems as you can at the end of Chapters Five and Six in the same Textbook. Tutors may set as many of these problems as possible for marking and grade awards. EXERCISES 1. (a) Explain why the proportion is regarded as a kind of mean. (b) In each of the following situations, a random sample of n parts is selected from a production process in which the proportion of defective units is π. Calculate the standard error of the sample proportion, and the 93 normal deviate that corresponds to the stated possible value of the sample proportion. π ( i ) 0.1 ( ii ) p 100 0.16 0.2 1,600 ( iii ) 0.5 25 ( iv ) 2.. n 0.01 9,900 0.215 0.42 0.013 In Sarah Mwenje’s coffee farm, the yield per coffee bush is normally distributed, with a yield of 20 kg. of ripe cherries per bush, and a standard deviation of 5 kilos. (a) What is the probability the average weight per bush in a random sample of n = 4 bushes will be less than 15 kg.? (b) Assuming the yields per bush are normally distributed, is your answer in (a) above meaningful? Explain why. 3. (a) Explain with reasons why the expected value of a sample mean from any population is the parameter mean of that population. (b) State the Central Limit Theorem. (c) Why do you think the Central Limit Theorem works for any population with any kind of distribution and population mean? (d) How does the standard error of mean assist in model building for all populations? 94 CHAPTER FOUR STATISTICAL INFERENCE USING THE MEAN AND PROPORTION Population and samples It is often a task of a scientist, whether social, behavioral of natural to examine the nature of the distribution of some variable character in a large population. This entails the determination of values of central tendency and dispersion - usually the arithmetic mean, standard deviation, and the proportion. Other features of the distribution such as skewness, its peakiness (the so-called Kurtosis), and a measure of its departure from some expected distribution may also be required. The term “Population” is generally used in a wide but nevertheless strict sense. It means the aggregate number of objects or events - not necessarily people - which vary in respect of some variable of interest. For example, one may talk about the population of all the school children of a certain age group in a given area. In Biostatistics, the botanical population of some particular plant is often the subject of investigation. In industrial sciences the population of all defective goods on a production line could be a subject of interest, and so on. In practice, the examination of a whole population is often either impossible or impracticable. When this is so, we commonly examine a limited number of individual cases which are part of the population. This means that we examine a sample of the population. The various distribution constants of the sample can then be determined; and on this basis the constants of its parent population can be estimated. Our knowledge of the sample constants can be mathematically precise. On the other hand, we can never know with certainty the constants of the parent population; we can only know what they probably are. Whenever we make an estimate of the population characteristic from a sample, we are faced with a question of precision of the estimate. Obviously, we aim at making our estimates as precise as possible. We shall presently see that the precision of an estimate can be stated in terms of probability of the true value being within such-andsuch a distance of the estimated parameter. There are one or two useful terms peculiar to the study of sampling. It will be convenient for the reader to become familiar with them at the outset. Various constants, such as the mean, the standard deviation, etc., which characterize a population are called 95 the population parameters. Parameters are the true population measures. They cannot be normally known with certainty. The various measures, such as the mean, the standard deviation etc., are the true population measures which can be known with precision can be computed from samples. The measures resulting from the sample computations are called Sample Statistics. Thus sample statistics are estimates of population parameters. The precision of these estimates constitutes the so-called reliability of the statistics, and we shall see later that there are techniques which enable us to infer population parameters with high degrees of accuracy. The Process of Sampling Before concerning ourselves with the reliability of the significance of statistics, it is necessary to have clearly in our minds the essential facts about the process of sampling. In general, the larger the sample size the greater the degree of accuracy in the prediction of the related population parameters. As the sample becomes progressively larger, the sheer mass of numbers reduces the variation in the various sample statistics, so that the sample is more and more able to represent the population from which it was drawn. However, it does not mean that the samples must always be large. Even small samples can do. What matters is that the sample, to the best of our knowledge, is representative of the population from which it was taken. To achieve this, certain conditions must be satisfied in selecting the sample. If this is done, then it is possible to reduce the size of the sample without sacrificing too much the degree of accuracy which is expected to be attained form using the larger sample. Our chief purpose of taking samples is to achieve a practical representation of the members of the parent populations, so that we can conveniently observe the characteristics of that population. For example there are situations in the testing of materials for quality or manufacturing components to determine strength, when the items under consideration are tested to destruction. Obviously, sampling is the only possible procedure here. In order for the test to be economical, it is important to estimate how many test pieces are to be selected from each batch, and how the selection of the pieces is to be made. 96 Our definition of Populations implies that they are not always large. But very often they contain thousands, or even millions of items. This is particularly true of investigations concerning characteristics or attitudes of individuals. In cases of such large populations, sampling is the only practical method of collecting data for any investigation. Even in cases where it would be possible from a financial point of view, the measure of a characteristic of the total population is really not necessary. Appropriately selected small samples are capable of providing materials from which accurate generalizations of the whole population can be successfully made. In the interest of efficiency and economy, investigators in the various data fields of the social, behavioral, and natural sciences invariably resort to sampling procedures and study their subject populations by using sample statistics. Selecting a sample What is necessary to select a good sample is to ensure that it is truly a representative of the larger population. The essential condition which must be satisfied is that the individual items must be selected in a random manner. The validity of sampling statistics rests firmly on the assumption that this randomizing has been done, and without it, the conclusions which may be reached using unrepresentative samples may be meaningless. To say that the items must be selected in a random manner means that chance must be allowed to operate freely; and that every individual in the subject population must have been given an equal chance of being selected. Under these conditions and under no other, if sufficiently large numbers of items is collected, then the sample will be a miniature cross-representation of the population from which it is drawn. It must be remembered that, at best, sample statistics give only estimates of population parameters, from which conclusions must be made with varying degrees of assurance. The more accurately the sample represents the true characteristics of the population from which it is drawn, higher the degree of assurance that is obtainable. It will be appreciated that, despite this limitation, without sample statistics it would be impossible to achieve any generalized conclusions which can be either of scientific of practical value. 97 Techniques of Sampling In general there are two major techniques which are used in compiling samples. One technique is called Simple Random sampling, and the other one Stratified Random Sampling. Among these there are various modifications of the major genre which the learner is advised to search and peruse in relevant texts of research methodology, some of which are listed at the end of this chapter. Simple Random Sampling refers to the situation indicated earlier, in which the choice of items is so controlled that each member of the population has an equal chance of being selected. The word Random does not imply carelessness in selection. Neither does it imply hit or miss selection. It indicates that the procedure of selection is so contrived as to ensure the chance-nature of selection. For example if in any population the names of individuals are arranged in alphabetical order, and one percent sample of selection is required, the first name may be selected by sticking a pin somewhere in the list and then taking every one hundredth name following that. Such a selection would lack any form of bias. Another method commonly used is to make use of the Random Number Tables [King’oriah, (2004), pages 484 - 487]. All individuals may be numbered in sequence, and the selection may be made by following in any systematic way the random numbers - which themselves have been recorded in a sequence using some form of lottery procedure by the famous Rand Corporation. The process of obtaining a random sample is not always that simple. Consider for example, the problem of obtaining views of housewives on a particular product in a market research project in any large city. The city may first be divided into numbers of districts, and then each district into numbered blocks of approximately equal size. A selection can now be made of districts by drawing lots, and interviewers allocated to each selected district could choose housewives from the blocks selected in a similar random selection manner in his district. In this way, a simple random sample of the market attitude of the product may be obtained. To secure a true random sample for a statistical study, great care must be exercised in the selection process. The investigator must be constantly aware of the possibility of bias creeping in. Another common technique which can be used to improve the accuracy of sampling results and to prevent bias, and assure a more representative 98 sample is called Stratified Random Sampling. In essence, this means making use of known characteristics of the parent population as a guide in the selection. A good example of this can be found in opinion polling. Suppose an investigation is undertaken to assess the public attitude to a proposed major reform in the private sector of the education system. It is probable that the political parties may tend to have diverse views on how to go about the education reform. It is also probable that people in different economic and social groupings such as the professional societies, business interests, the religious groups, skilled and non-skilled artisans, etc., would tend to react to the proposed education reforms systematically as groups. There might even be differences of opinion in general between men and women; and more probable still, between other divisions of the population like urban and rural people; between also regional and educational groupings. Obviously, in any given case all possible variables are not necessarily important. The whole population is studied to ascertain what proportions fall into each category, into individual political blocks, men or women, town and country, and so on...Steps are then taken to ensure that any sample would have proportional representations from all the important sub-groups, the selection of items in each sub-group being carried out, of course, in the manner of simple random sampling. Clearly, the stratification made of the population will depend on the type and purpose of the investigation, but where used, it will in appreciably improve the accuracy of the sampling results and help to avoid the possibility of bias. Essentially it constitutes a good systematic control of experimental conditions. A Stratified Random Sample is always likely to be more representative of a total population than a purely random one. Paradoxically purposeful sampling may be used to produce a sample which represents the population adequately in some one respect or another. For example, if the sample is to be (of necessity) very small, and there is a good scatter among the parent population, a purely random sample may by chance yield a mean measure of the variable which is clearly vastly different from the population mean. Provided that we are concerned with the mean only, we may in fact get nearer the truth if we select for our small sample only those individuals who seem on inspection to be close to what appears to be the population average. Where the required random characteristic is lacking in 99 making the selection, a biased sample results; and such a sample must contain a systematic error. If certain items or individuals have greater chance of being selected the sample is not a true representation of the parent population. Assuming that the procedures are scientifically satisfactory, we wish to see how and when the conclusions based on observational and experimental data are statistically, from a mathematical standpoint, warrantable or otherwise. Having emphasized that unbiased sampling is a prerequisite of an adequate statistical treatment, we must begin to discuss the treatment itself. Hypothesis testing Introduction One of the greatest corner-stones of the Scientific Method of Investigation is the clear statement of the problem through the formulation of hypotheses. This is usually followed by a clear statement of the nature of the research which will be involved in hypothesis testing and a clear decision rule. The process makes all the other researchers in the field of academic enquiry to know that the hypothesis has been proven and the envisaged problem has been solved. It is considered unethical in the research circles to state one’s hypotheses, research methodology and decision rule after one has already seen the how the data looks like, or the possible trend of events. Hypotheses and the associated research methodology are therefore formed before any sampling or any data manipulation is done. Thereafter, field data are used to test the validity of each such hypothesis. Therefore a hypothesis is a theoretical proposition which has some remote possibility of being tested statistically or indirectly. It is some statement of some future event which could be either unknown or known vaguely at the time of prediction; but which is set in such a manner that it can either be accepted or rejected after appropriate testing. Such testing can either be done statistically, or using other tools of data analysis and organization of facts. In this chapter we are interested in situations where quantitative or qualitative data has been gathered from field observations and then statistical methods are used for the hypothesis testing of the same data. 100 The Null and Alternative Hypotheses Hypotheses meant to be tested statistically are usually formulated in a negative manner. It is expected that in stating such hypotheses one should allow all chances that the desirable event must happen: so that should the desired event take place despite such a conservative approach, one can be in a position to confirm that the event did not occur as a matter of chance. It is like a legal process, where when a person has been accused in a court of law for a criminal offence. In this case, one is presumed innocent until one is proved guilty beyond all reasonable doubt. This kind of strict ethical behavior and code of ethics is applied to scientific research. The negative statement of the suspected truth which is going to be investigated through data collection and data manipulation is called a Null Hypothesis. For example if one suspects that there is a difference between two cultivars of millet, the research hypothesis statement must take the following two possibilities :- (a) Null Hypothesis (Ho): There is no difference between the two groups of millet cultivars, and that if any difference exists it is due to mere chance. (b) Alternative Hypothesis (HA): There is a marked and a statistically significant difference between the two groups of millet cultivars. After this statistical tools are used to test the validity of data; and to see which side of the two statements is supported by field investigations. Accepting the null hypothesis means rejecting the alternative hypothesis: and vice-versa. Steps in Hypothesis testing 1. Formulate the null hypothesis after familiarization with the actual facts, and after realizing that there is a suspected research problem. 2. Formulate the alternative hypothesis, which is always the complement and the direct opposite of the null hypothesis. 101 3. Formulate the decision rule which, if the facts are the way of this rule, the null hypothesis will be accepted. Otherwise the null hypothesis will be rejected, and its complement, the alternative hypothesis will be accepted. This decision rule must contain clear criteria of success or failure regarding either of the two hypotheses. These three steps are performed before the researcher goes to collect the data in the field, or before manipulating any data if the source of data is from documents. Other researchers must know clearly what the present researcher is up to, and that the current researcher is not going to “cook” success of his experiment in the field or from documents. 4. Collect and manipulate data in accordance with the chosen statistical or probability model, e.g., the Normal Curve, the t-distribution or any other. This is where the method of measurement is applied rigorously to test whether the data support or reject the null hypothesis - thus accepting or rejecting the alternative hypothesis. 5. Examine the results of data manipulation, and see whether the decision rule has been “obeyed”. If so, accept the null hypothesis; and if not, reject the same, and accept the alternative hypothesis. Let us now do a small hypothetical example, so that we can see how Hypothesis Testing is done. Example Investigate whether there is any difference in the weights of male and female goats using the special type of breed found in Buuri area of Meru District. Solution The process of hypothesis testing would proceed in the following manner:- Step One: Null Hypothesis (Ho): There is no difference in the weight of male and female goats which are found in Buuri area of North Imenti District. 102 Step Two: Alternative Hypothesis (HA): There is a marked and a statistically significant difference between the weights of male and female goats of the type found in Buuri area of North Imenti District. Step Three: Decision Rule: The null hypothesis will be tested at what the statisticians call a confidence level. In symbolic terms this level is designated with a bold letter C. If you are writing in longhand, like many of us will be doing, it pays to cross your letter Capital this way “ C ”, so that the reader of your manuscript will know you are talking about the bold capital C. In our case, let us adopt a 95% confidence level. C = 0.95. This confidence level means that when the Normal distribution model is used for testing these hypotheses, one would consider similarity if both weights of female and male goats will fall within the area of the normal curve subtended by 1.96 standard deviations on both sides of the mean. Figure 4 - 1: The probability of Similarities and differences on both sides of the mean Each side of this curve which is within 1.96 standard deviations from the mean ( ) would be a proportion of 0.4750 out of 1.000 (all the total available proportion.) Both sides of ( ), namely, – 1.96 to the left, and +1.96 to the right, would include :- 0.4750 2 0.9500 out of the total area under the normal curve; which is 1.0000. 103 Also, remember that with respect to the Normal Distribution, the word “proportion” is synonymous and identical to that other very important word that we have now become used to: “probability”. Now examine Figure 4 - 1 carefully for these details. The diagram illustrates the fact that we are testing the characteristics of our population at a 0.9500 probability level of confidence. Meaning: 95% confidence level. The complement of this level of confidence is what the statisticians call Significance Level and is denoted by the Greek letter “ ” . This is obviously :1. 0000 0.9500 0 . 0500 = 5% significance level , or “ ”. Therefore, if we are testing the results of the characteristics of our population (or any sample) at 95% confidence level we are at the same time testing the results of the population (or sample) at 5% confidence level. In research methodology and statistical analysis these two terms are used interchangeably. Figure 4 - 1 is an illustration of these levels of statistical significance. Step Four : The representative measure : Take either the male or the female goats as a representative measure. Weigh all of the goats of your sex of choice, which are available to you, whether as a sample or a population (in our case where we are using the Standard Normal Curve, we are assuming a population of all goats in Buuri). We can then assume that the mean weight of the female goats shall represent the population mean ( ), which is a parameter against which we shall test the mean weight statistic of all the male goats. If within our desired confidence level (or significance level) there is no statistically significant difference between the weight of the male goats from that of the female goats, we shall accept the Null Hypothesis (Ho): that there is no difference in the weight of male and female goats which are found in Buuri area of North Imenti District. 104 Step Five : The Rejection Regions : At the end-tails of Figure 4 - 1, you will notice that there is a region of dissimilarity either side, which is valued at “ /2 ”. This means that whenever an alpha level ( or significance level) is stated in a problem of this kind, which involves a test of similarity, the alpha or significance level is divided into two. It has to be distributed onto the upper tail and the lower tail, where the populations of samples which do not belong to the population of interest are expected to be located. Any observation beyond 1.96 standard deviations from the universal (population) mean from our population on both sides of the mean are either too great (too heavy in our example), or too small (too light in our example). Any observation falling in these regions is not a member of our population. Our population is clustered about the mean on both sides. This means that when we cut off the five percent which does not belong to the population comprising female goats, we are actually distributing the five percent of dissimilarities, as well as similarities to both sides of the population mean. Therefore, at 5% confidence level we have to divide the 5% into two, distributing 2.5% to both the upper-end tail and the lower-end tail regions of dissimilarity. Therefore, on each tail-end we expect to have 2 ½% of the population which does not belong to the main body of similarity. Any mean weight or observation of the male goats which falls within these rejection regions does not belong the main body of the female goats; in terms of their body-weight measurements. Specifically, when we compute the mean weight of the male goats, and if the mean value in terms of the standard deviations happens to fall within the rejection region on either sides of the normal curve, we must conclude that the weights of the male goats are not the same as the weight of the female goats. The same logic of analysis applies whether we are using the Standard Error of the Sampling distribution of the proportion or the sampling distribution of the sample means. Using other statistics, which we shall learn, we will find that any observation which falls within the rejection regions (defined by alpha levels of whatever kind) will be rejected as 105 not belonging to the main body of the population whose observation lie clustered close to the parameters of interest. In this case the parameter of the population is the mean ( ) . Now let us attempt a more realistic example using the same kind of logic, so that we can see how hypothesis testing is done using statistical analysis and confidence intervals. Example In considering very many scores for Statistics Examinations within a certain university, the mean score for a long time is found to be 60% (when considered as a raw observation or a count; and not a proportion). This year, a class of 25 students sat for a similar examination. Their average was 52%, and the standard deviation of 12%. Are the results of this year’s examination typical at 5% alpha level? Solution Step One: Null Hypothesis (Ho): 60% There is no difference between the scores of this year and those of all the other years. Alternative Hypothesis (HA): 60% . There is a statistically significant difference between the scores of this year’s class and those of all other year’s classes. Step Two: Decision Level: We test this null hypothesis at 95% confidence level, or at Five percent alpha level (significance level). The sample size n = 25. This is a small sample, less than 30 students. The statistical model which we shall use is the t-statistic. Accordingly we must adjust the sample by one degree of freedom. This means that n - 1 = 24, after adjusting for one degree of freedom. Now is the proper time for us to learn how to use the t-statistical table at page 498 of your textbook (King’oriah 2004). Turn to that page, and look for the top two rows. The first row has figures ranging from .10 to .0005. The other row has figures ranging from .20 to .001. The topmost row is labeled “ Level of significance for one-tailed test”, and the lower row is labeled, “ Level of significance for two-tailed test”. 106 We now consider our sample, and find that it may be safe to use one-tailed test, because the observation lies below the universal mean ( ). On the t-table at page 498, we use the column labeled “ .05 ” along the top row, which you can see is called “ Level of significance for one-tailed test”. The left hand-most column records the degrees of freedom. We manipulated our sample size: n = 25, and adjusted for the loss of one degree of freedom, to get n = 24. Now we must look for the intersection of n = 24 (down the left-most column of the table) and the list of numbers which are located down the column labeled “ .05 ” within the body of this table. At this intersection, we can see the expected value of t (designated as “ t ” and “pronounced as “tee alpha”) valued at t = 1.711, which we now extract from the tables. This means that the rejection area begins at t = 1.711 Standard Errors of Sample Means from the mean ( = 60% ). See Figure 4 - 2. Because of the negative sign, the rejection region ( “-area” ) must lie on the left hand side of the mean. Step Four: The Standard Error : In this case also, we have not yet computed the standard error of the mean. This is correct, because this kind of standard error is usually computed after stating the decision rule very clearly, and not before. We have been given the raw-standard deviation of 12%, and have assumed that this figure has been computed from the raw scores using the formula for computing the standard deviation :S n i 1 X i X n 1 2 12% . This standard deviation is given in this example as 12% for the purposes of saving time; and because we do not have the raw scores of all the classes which preceded this year’s class. Using this figure of 12%, we now compute the standard error of mean ( S X ) so that we can make use the t-distribution table (on page 498) of your textbook. 107 S X S n 12 25 Step Five: The compute (observed) value of t 12 5 2 . 40 from the given facts : Now we compute the actual t-statistic for this year’s sample using the formula :- X 52% 60% SX 2 . 40 t Figure 4 - 2: The distance of X i = 52% from = 60% in terms of Standard Errors We repeat here that the percentages used in this formula to compute the actual or the observed “ t ” are used as raw scores, and not as a proportion; in order to avoid any confusion with the computation of the standard error of the proportion which we considered before in Chapter Three of this module. Now, let us proceed with the computation of the actual value of “ t ” . t X SX 52% 60% 2 . 40 108 8 3. 333 2 . 40 In terms of actual standard errors, this is how far the current sample X 52% is away from the universal (population) mean ( = 60% ). It is -3.333 Standard Errors away. We demonstrate this fact by means of Figure 4 - 2. Accordingly, we conclude that this year’s class mean of 52% is far below the usual average score of all the previous classes. Therefore we reject the null Hypothesis that there is no difference between the scores of this year and those of all the other years. We accept the alternative hypothesis that there is a statistically significant difference between the scores of this year’s class and those of all other year’s classes. In that regard, the null hypothesis has been rejected at 5% alpha level. Confidence Intervals It is now time to consider learning a closely related concept in statistical investigation - the concept of Confidence Interval . In setting up this interval on both sides of the mean ( ), the onus of deciding the risk to be taken lies solely with the statistician. There is no hard and fast rule about this. Sometimes one may use more than two confidence intervals (say 95% and 99%) in one experiment to test the sensitivity of his model. The risk of making an error after deciding on the appropriate confidence level also is borne solely by the statistician. The probability of making an error after setting up this confidence interval is called the alpha level - the one we have discussed above. The confidence interval “ CI ” is the probability that the observed statistic belongs to the population from which the parameter has been observed. It takes the form of the actual number of standard deviations (or standard errors) which delineate the typical data as end-limits, and the rejection region for all that data which does not belong to the subject population. The technique is not different from what we have just discussed in the preceding section. The only difference is that of the approach. In this case we are building a probability model for investigating the chances of finding a statistic within the level of similarity which is typical to some known parameter. The expression for the confidence interval looks like this : CI P X Z X . 109 This statement says that the Confidence Interval ( CI ) is the probability “ P ” that the population mean ( ) will be found within a specified number of actual standard errors “ X ” of the actual population from the raw scores, multiplied by the standard errors of the population “ Z ” (for the purposes of standardization), on both sides of the mean. This same statement can be arranged in another simpler fashion :- CI P X Z X X Z X The meaning of this expression is, “Within the probability “ P ” that on both sides of the sample mean “ X ” defined by a specific number of actual standard errors “ Z X ” , we expect to find the mean of the population ( ). Note that the population mean ( ) lies in the center of this expression. This means that our expression allows our population mean ( ) to randomly slide between the end limits of the interval set athwart the sample mean “ X ”, as long as ( ) does not go outside this interval we have set using our specified significance level. We include the concept of the significance level in the model by specifying the alpha level, and stating it within the confidence interval notation: as follows :- CI P X Z X X Z X In this case “ Z ” is the computed number of standard errors at the alpha level set by the statistician. The value “ X ” denotes actual number of standard errors which have been computed using the given data. Remember that we have hitherto stated that using the raw data, the standard error of mean is computed using the formula X n . When we multiply this actual value with the expected number of standard errors obtained from our Standard Normal Table ( Z ), we obtain the actual standardized figure ( in number or observational scores) which is so many standard errors Z away from the statistic “ X ”. Let us now do a simple example of interval building to facilitate our understanding of this important concept. 110 Example From your field experience, you have found that the mean weight of 9-year old children in a certain district situated in central Kenya is 45 Kilograms. You came to this conclusion after weighing 100 children, which you randomly sampled from all over the district. The standard deviation of all your weights from this large sample is 15 Kilograms. Where would you expect the true mean of this population mean to lie at 95% confidence level? Solution 1. In this example the level of confidence C = 0.95 or 95%. Under the Standard Normal Table, this is the area subtended by 0.4750 standard deviations on both sides of the mean . This is the value we have hitherto called AZ . 2. Therefore AZ 0 . 4750 . Now let us compute the actual standard error (in Kilograms) of the sample means obtained from our field data. X 3. n 15kg. 1.5 kg. 100 Next, we build the confidence interval by inserting our field data in the formula :- CI P X Z X X Z X CI P 45 kg. 1. 96 1.5 kg. 45 kg. 1. 96 1.5 kg. CI P 45 kg. 2.94 45 kg. 2.94 CI P 42 . 06 47 . 94 4. The true mean weight of this population ( ) is likely to be found between 42.06 Kilograms and 47.94 Kilograms. Any observation outside the interval with these end-limits does not belong to the population of these children you found and weighed during your field research. 111 Type and Type errors The interval set on both sides of the sample mean may lie so far out that it may not include the population mean. This happens if the sample used in deriving the mean is so unusual (a-typical) that its mean lies in the critical (rejection) regions of the sampling distribution of sample means. All precautions must be taken to set up a good research design and to use accurate sampling techniques to avoid this eventuality. This is done by setting up an appropriate interval in all our experiments. If the controlling interval around the subject sample mean is too narrow there is a chance of rejecting a hypothesis through the choice of this kind of interval when in fact the hypothesis is true. This is called making the Type error. On the other hand a very wide interval leads the researcher to the risk of making type error. This is when the interval is so wide that he ends up accepting into the fold of the population defined by his standard normal curve model any members who may actually not belong to the subject population. The researcher is usually in a dilemma. If he chooses a narrow interval he increases the error caused by excluding the parameter from the sample, although the parameter could actually be included within a properly set interval. This amounts to accepting the null hypothesis even if it is false; and therefore committing the type error. On the other hand if he chooses a wide interval, he increases the error caused by including a parameter from outside, although the parameter could actually be excluded from a properly set interval. This amounts to rejecting the null hypothesis even if it is true; and therefore committing the type error. We can therefore define our two types of error in the following manner :- Rejecting a correct hypothesis through the choice of a narrow confidence interval or setting up large alpha (rejection) regions amounts to mating a type error. Accepting a false hypothesis through the choice of a wide confidence interval or setting up very small alpha (rejection) regions amounts to mating a type error. We must therefore be careful to balance the setting of our confidence intervals carefully to guard against any of these two errors. 112 Confidence Interval using the proportion Like the population mean, the confidence interval of the proportion may be estimated using confidence intervals. These may be built either around the parameter proportion “ ”, or around the statistic Sample proportion “ Ps ”. the logic involved in this estimation is identical to what we have discussed in the previous section - after all, the proportion is some kind of a mean. The expression which is used in this estimation is analogous to the one we have just used, only the designation of the parameters and the statistics which are in it is different to reflect the fact that we are now dealing with the binomial variable, the proportion. This expression is as follows :- CI P Ps Z Ps Ps Z Ps Study this equation well and you will find that:- Ps = Sample proportion P = Sample standard error of the proportion s Z = The number of standard errors from the proportion parameter which have been determined using the Confidence level set by Ourselves in order to build the model for estimation our parameter population proportion . = The universal or the population proportion whose position within the Normal Curve Model we are using for this estimation. The equation reads, “The Confidence Interval CI is the probability “ P ” that the parameter population proportion “ ” will lie within the Standard Normal Distribution both sides of the sample proportion which we are using as our comparison standard Ps . in the area subtended by Z Ps standard errors (of raw scores) of the proportion on We now use a simple example to estimate the population parameter . 113 Example Nauranga Sokhi Singh is a sawmill operator within the Mount Kenya Forest. He wishes to calculate at 99% confidence level, the true proportion of undersize timber passing out of his sawmill’s mechanical plane. He obtains a random sample of 500 strip boards, measures them and finds that 0.25 of these have been cut undersize by this machine. Mr. Singh is alarmed! And is interested in knowing whether his sawmill will continue working at this shocking degree of error. Provide some statistical advice to your client, Mr. Singh. (Source: Adopted from King’oriah, 2004, pages 169 to 170.) Solution The method of approaching this problem is the same as the one we have learned above. Step One : C = 0.99 (or 99%) given, especially because accuracy in obtaining timber strips from this machine is crucial. Step Two : Using the sample proportion Ps = 0.25 we can compute the standard error of the sample proportion:- P Step Three : s Ps 1 Ps n 0 . 25 1 0 . 25 500 0 . 0194 Obtain the area which is prescribed by the confidence level which we have prescribed. C = 0.99 (or 99%) given: and therefore = 0.01. Consequently, we know from the Standard Normal Tables that the 0.99000 of the area is subtended by ( 0.4950 2 ) standard errors on both sides of the population parameter . Step Four : Build the confidence interval. CI P Ps Z Ps 114 Ps Z Ps whose values are actually :- CI P 0. 25 2 .57 0. 0194 0. 25 2 .57 0. 0194 P 0 . 25 0.050 0 . 25 0.050 P 0 . 20 0 . 30 You tell Mr. Singh that the machine will continue to churn out between 0.2 and 0.3 undersize strip boards from all the timber he will be planing, because your computation tells you that the population parameter “ ” of all the timber strips planed by this machine lies (and could slide between) 20% and 30% of all the timber which it will be used to process. Advise Mr. Singh to have his machine either overhauled, or replaced. EXERCISES 1. 2. Explain the meaning of the following terms:- (a) The null hypothesis and alternative hypothesis. (b) The standard error of mean (c) The standard error of the proportion (d) The normal deviate, Z (a) Explain with reasons why the expected value of a sample mean from any population is the parameter mean of that population. (b) State the Central Limit Theorem. (c) Why do you think the Central Limit Theorem works for any population with any kind of distribution and population mean? (d) How does the standard error of mean assist in model building for all populations? 3. (a) Explain the difference between Hypothesis testing and Statistical estimation. 115 (b) Distinguish between the null and alternative hypotheses. (c) What is meant by one sided single tail test, and a two sided double tail test ? 4. (d) Explain what is meant by decision rule in statistical analysis. (a) State the Central Limit Theorem. (b) Why do you think the Central limit Theorem works for any population with any kind of distribution and population mean? (c) Why is the expected value of a sample mean from any population approximately equal to the population mean of that population? 5. A random variable has a normal distribution with standard deviation σ X 102 . 4 and a = 3.6. What is the probability that this random variable will take on the following values :- 6. (a) 107.8 (b) Greater than 99.7 (c) Between 106.9 and 110.5 (d) Between 96.1 and 104.2 A company fitting exhaust pipes to custom-made cars announces that you will receive a discount if it takes longer than 30 minutes to replace the silencer of your car. Experience has shown that the time taken to replace a silencer is approximately normally distributed with a mean of 25 minutes and a standard deviation of 2.5 minutes. (a) Explain with reasons the kind of probability model you will use to calculate the distribution of the time it takes to replace silencers in custom-made cars. 116 (b) What proportion of customers who receive a discount ? (c) What is the proportion of the silencers which take between 22 and 26 minutes to replace? 117 CHAPTER FIVE THE CHI-SQUARE STATISTIC Introduction In some research situations one may be confronted with qualitative variables where the Normal Distribution is meaningless because of the small sizes of samples involved, or because the quantities being measured cannot be quantified in exact terms. Qualities such as marital status, the color of the cultivar skins, sex of the animals being observed, etc., are the relevant variables; instead of exact numbers which we have used to measure variables all our lives. Chi-square is a distribution-free statistic which works at lower levels of measurement like the nominal and ordinal levels. The Chi Square Distribution The chi-square Distribution is used extensively in testing the independence of any two distributions. The main clue in understanding this distribution lies in the understanding the generalized uses, and then applying it to specific experimental situations. the question of interest is whether or not the observed proportions of a qualitative attribute in a specific experimental situation are identical to those of an expected distribution. For this comparison to be done the subject observed frequencies are compared to those of similar situations - which are called the expected frequencies. the differences are then noted and manipulated statistically in order to test whether such differences are significant. the expression for the Chi-Square statistic is 2 O k i c 1 Ei 2 Ei Where :1. The letter is the Greek letter Chi. The expression “ 2 ” is “Chi-Squared”, which is the parameter used in statistical testing and analysis of qualitative samples. 2. Oi = The observed frequency of the characteristic of interest. 118 3. Ei = The expected frequency of the characteristic of interest. 4. k = The number of paired groups in each class comprising the observed frequencies Oi and the expected frequencies Ei . 5. c = An individual observations of paired groups. Except for the fact that this distribution deals with lower levels of measurements, it is some form of variance. To prove this note that the algorithm for the computation of variances has the same configuration to the formula we have just considered. Compare both formulas as outlined below :- 2 O k c 1 i Ei Ei 2 and 2 n i 1 X 2 n Despite this fact the assumption we make for the chi-square statistic are not as rigorous as those of the variance. For example we cannot assume that data are obtained from a normal distribution. this is why we have to use a special type of mathematical distribution for the analysis of that data. The Chi-Square distribution is such that if the observed values differ very little from the expected values then the chi-square value is very little. On the other hand when the differences are large, then the chi squares are also very large. This means that the mode of this distribution tends to be located among the fewest observations. this mode is not like that of the normal curve. The Chi-Square curve tends to be highest between Two and 15 observations. For small numbers the curve is positively skewed, but for large numbers of observations in excess of Twenty objects, the chi-square can be assumed to approach the shape of the Normal Curve. Farther on, we can assume the normality of data, and the Chi-Square has no difference from the Normal Distribution. Accordingly we may use the Standard Normal curve for model building, data manipulation and analysis. For our purposes, the rejection (alpha) area lies to the right of the distribution because of its characteristic skew. However, theoretically it is possible to have a rejection area situated to the left of the steepest area of the distribution. However, this is not of any 119 Figure 5 - 1 : Various values for Chi-Square Distribution with Increasing Degrees of Freedom. ( Source: George K. King’oriah, Fundamentals of Applied Statistics. 2004, Jomo Kenyatta Foundation, Nairobi.) value to our current discussion. Figure 5 - 1 outlines the characteristic of the Chi-Square distribution at various but increasing degrees of freedom. The Distribution tables for the Chi-Square Statistic which are used in an analogous manner to the Standard normal Tables for building confidence intervals and hypothesis Testing are available on page 499 0f your Textbook (King’oriah, 2004). The tail areas are given at various degrees of freedom at the head of each column; and the entries within the body of the table are the corresponding Chi-Square values at specific levels of confidence. For our purposes, the decision rule for the Chi-Square test has an upper rejection region. The location of the observed or the calculated Chi-Square is compared with the location of the Expected Chi-Square at the degrees of freedom which are determined by our sample size. If the calculated Chi-Square is bigger than the one found in the tables at the prescribed degrees of freedom, Null the Hypothesis for similarity is rejected and the 120 Alternative hypothesis for differences of variables is accepted at the prescribed level of confidence. The calculated Chi-Square which happens to be less than the appropriate critical value falls within the main body of the distribution; and the Null hypothesis for similarities of the variables being compared is accepted. The alternative hypothesis for dis-similarity is accepted. The commonest use of this statistic is with a set of two variables each time, although mathematically the distribution has many more uses than this. For a change we can begin with using the distribution with one variable, where we lose one degree of freedom because we are operating in one dimension. A simple experiment using a six-sided, and fairly loaded die (plural dice) is used. Let us proceed to use this example and see what happens with respect to the confidence interval using the Chi-Square Distribution. Example A fair die has an equal probability of showing any one of the six faces on top when it is tossed. Mr. Laibuni the gambler tosses the die 120 times as he records the number of dots on the top face each time. Each of the faces is expected to show up six times during the 120 times that the die is tossed. However, Mr. Laibuni finds the results listed in Table 5 - 1 Does Mr. Laibuni have any reason to believe that the die is fair? Advise Mr. Laibuni: at 95% Confidence Level. (Source: George K. King’oriah, Fundamentals of Applied Statistics. 2004, Jomo Kenyatta Foundation, Nairobi.) Solution 1. Frame the null and the alternative hypotheses Ho : The die is not loaded to produce biased results. H 0 t 1 t 2 .... t 6 The expected face turn ( t i ) are equal for all faces of the die. HA : The die is loaded to produce biased results. H 0 t 1 t 2 .... t 6 The expected face turn ( t i ) are not equal for all faces of the die. 121 TABLE 5 - 1: THE NUMBER OF DOTS FACING UP IN 120 TOSSES OF A FAIR GAMBLING DIE Number of dots out of 120 tosses 2. 1 Number of times each face turns up on top 12 times 2 14 times 3 31 times 4 29 times 5 20 times 6 14 times Formulate the decision rule. We are given the 95% confidence level. Therefore, the alpha level is 0.05. To use a Chi-Square Statistic table requires that you have appropriate degrees of freedom. There are six possibilities in this experiment. The nature of the experiment is such that we are counting within one dimension, form the first face to the sixth face. Therefore, n = 6, and when we lose one degree of freedom we find that we are left with n = 5 d f with which to use the Chi-Square Distribution table on page 499 of your textbook. 4. Using the Chi-Square table on page 499 wee must obtain the critical Chi-Square value at 5% (or 0.05) significance level. This level is available along the top-most column, the fourth column from the right. We must find our critical level ChiSquare where the values found within this column co-incide with n = 5 d f which are available down the left-most column of this table. the rejection region in this case must therefore begin at the Chi-Square value 2 11. 070 . This is illustrated in Figure 5 - 2. Our decision to reject that the null hypothesis that the die is not loaded occurs if the observed or the calculated Chi-Square exceeds the Chi-Square for the critical level, which is 2 11. 070 . We now set up a table for computing our statistic. 122 TABLE 5 - 2: STEPS IN THE COMPUTATION OF THE CHI-SQUARE USING A GAMBLER’S DIE. O O Observed Frequency: Oi Expected Frequency: Ei Oi E i 1 12 20 –8 64 3.20 2 14 20 –6 36 1.80 3 31 20 11 121 6.05 4 29 20 9 81 4.05 5 20 20 0 0 0.00 6 14 20 –6 36 1.80 Face of Die TOTALS 120 i Ei 2 i Ei 2 Ei 0 16.90 The calculated Chi-Square turns out to be 16.90, which is much larger than the value defining the critical point of Chi-Squares which separate the rejection region from the acceptance region. We reject the null hypothesis and accept the alternative hypothesis that the die is not fair; it is loaded. If our computation was such that we obtained a calculated Chi-Square below 11.070 We would have accepted the null hypothesis that the die is not leaded. This example is one of the simplest uses of Chi-Square statistic, to test the Goodness of Fit of one distribution onto another. The same logic can be applied to compare the observed values to the expected values in any situation. A slightly modified method is required which enables the computation of qualities of any distribution. In any case, whatever device we use we end up stretching out any two variables being compared so that the difference between the actual values and the expected values can be discerned. Once this has been some, the differences, the squared differences and ultimately the Chisquare are obtained as we have done above. 123 Figure 5 - 2: The Chi Square Distribution at Five Degrees of Freedom and 95% Confidence level; Showing the calculated 2c = 16.90 Contingency tables Contingency tables are used as a convenient means of displaying the interaction of two or more variables on one another. Using these tables, the quantities of each variable can be clearly seen, and the total effect of variable interaction is clearly discernible. In addition the computation of the nature of this interaction and other kinds of mathematical manipulations are possible. This kind of tables is useful in Chi-Square analysis because occasionally, the null hypothesis being tested concerns the interaction of variables between two factors. In most cases we are after testing the independence of one variable factor from another one, that one quality is not affected by the presence of another quality, although both qualities are found under the same interactional environment. We now try to do another example where we are testing whether one factor is affecting another one under experimental circumstances at a lower level of measurement (nominal level). Example It is generally accepted that due to rural underdevelopment in some rural areas of Kenya there are primary schools which do not have coursework textbooks, and they perform poorly in final school examinations K.C.P.E. Samson M’Mutiga, an educational researcher based in Uringu Division, has noticed that the subject mathematics is performed poorly in all the schools within the division. He 124 suspects that this is due to the fact that some children cannot afford to buy the prescribed textbooks for mathematics. After collecting data about the children performance in mathematics from four schools in Uringu division, and after recording them on a contingency table, he uses a Chi-Square statistic to test the null hypothesis that: the pass rate in the K.C.P.E. Mathematics within the schools in Uringu Division is not affected by the possession of the prescribed textbooks ; against the alternative hypothesis that the possession of the set books has a statistically significant effect on the pass-rate in the K.C.P.E. subject Mathematics within the schools of this division at 95% confidence level. The Contingency table which he used for analysis is the same as Table 5 - 3. Demonstrate how he conducted the Chi-Square test of his null hypothesis at 95% confidence level. (Source: King’oriah, 2004, Pages 426 to 432) TABLE 5 - 3: A CONTINGENCY TABLE FOR COMPARING TWO QUALITATIVE VARIABLES NAME OF PRIMARY SCHOOL POSSESSION STATUS Uringu Kunene Lubunu Amwari 13 17 16 13 17 3 14 7 30 20 30 20 (HAVING OR NOT HAVING) With Textbook Without Textbook TOTALS 125 The 5 - 3 contains Samson M’Mutiga’s field data. Now he wishes to use Chi-Square techniques of analysis to investigate whether having the required set book influences the pass rate in Mathematics within the schools of this division. He goes ahead and computes the degrees of freedom associated with his samples; and which reflect the fact that the two variables, the textbook-possession status of the pupils and the primary school of their origin are interacting under one environment. Degrees of freedom In any contingency table the total number of cells can be found by multiplying the number of rows with the number of columns. The number of cells in a contingency table contains the total number of observations in any experiment. However, we must remember that these observations are in two dimensions, and in each dimension we lose one degree of freedom. In that connection also, interaction between the two variables can be reflected mathematically through cross multiplication of their qualities. This is also true with the degrees of freedom. Consequently, in the computation of the degrees of freedom, we first of all adjust for one degree of freedom in every dimension, and then cross-multiply the result. Therefore, in this case, the degrees of freedom must be the number of rows minus one times the number of columns minus one. The Chi-Square Table will be used just like we have done in the previous example, after we have made sure that appropriate adjustments are made to the table, and correct degrees of freedom are applied to our expected Chi-square statistic. The given table has two rows ( r )and four columns ( c ). Therefore, the degrees of freedom for the variable represented within rows is ( r - 1), and those of the variable represented within the columns in the contingency table are (c - 1); because we are losing one degree of freedom in each direction. To reflect the interaction of two variables, the two answers regarding degrees of freedom for both the rows and the columns will have to be cross-multiplied. Therefore, for this particular investigation the degrees of freedom to be used in the Chi-Square test will be :- Degrees of freedom = r 1c 1 126 And the Chi-Square which we intend to compute will be designated as :Expected Chi Square = e2 0. 05, r 1 c 1 This expression for the Expected Chi-Square “ e2 ” means that we are looking for the same Chi-Square, below which there is no significant of difference between the two variables at five percent significant level (95% confidence level) at r 1c 1 degrees of freedom. This is how the whole composite expression of the Chi square is interpreted : “ e2 0. 05, r Degrees of Freedom = 1 c 1 ”. Now we need to put our expression in figures :- (row – 1) (column – 1) r = d f 1 c 1 The Chi-Square is expressed at the appropriate significance level and at the appropriate degrees of freedom :- e2 0. 05, r 1 c 1 2 0 . 05 , 2 1 4 1 Observed Frequency and Expected Frequencies Identifying the cells in a contingency table Now we are ready to manipulate our data within our table to see how se can apply the Chi-Square test to test the significance of the interaction of the two variables, the school he visited and the possession status (with regard to owning the relevant set-book). To accomplish this task, we shall construct a table which will allow us to manipulate the instructions of the Chi-Square formula :- 2 O k i c 1 127 Ei Ei 2 In this new table we assign a number to each cell according to the row and the column which intersect ( read : “interact” ) at the position of each cell. The row designation comes first, and the column designation comes after it, and both are separated by a comma. For example, the intersection of the second row and the third column will be represented by “ Cell ( 2, 3 ) ” . Let us now label the table of the interaction of the two nominal variables which we started. as illustrated in table 5 - 2 . In this table we shall use the formula which we shall learn just now to record the expected observation and to differentiate this from the actual observation from the field. The observations from the field will be represented by the “free” numbers (without brackets) within Table 5 - 4, and the expected observations by other numbers, we shall learn to compute presently. Calculating Expected Values for Each cell To do this we shall begin by drawing another table similar to Table 5 - 5, in which we shall indicate the totals of the observations for each column and for each row. Within this table, we demonstrate how to represent expected values of the Chi-Square for each cell. We use rounded brackets to enclose the cell numbers in this kind of table, and the square brackets to enclose the magnitude of the expected observations. The use of any kind of brackets for either does not really make any difference. What is needed is consistency. If you use round brackets for cell numbers for example, be consistent throughout the table, and use square brackets for expected observations. If you use square brackets for cell numbers, also be consistent and use them throughout for this purpose, reserving the round brackets for the expected values. The important thing is that you should not mix them for either category. Use one kind of bracket for one specific category throughout. Therefore, in Table 5 - 5 rounded brackets are on top for each cell, followed by the actual observation without brackets and lastly followed by the expected values in square brackets. Now we embark to answer the question of how the expected values are calculated for each cell. 128 TABLE 5 - 4: ILLUSTRATING HOW TO LABEL INTEREACTION CELLS IN ANY CONTINGENCY TABLE NAME OF PRIMARY SCHOOL POSSESSION STATUS Uringu Kunene Lubunu Amwari (HAVING OR NOT HAVING) With Textbook Without Cell (1, 1) Cell (1, 2) Cell (1, 3) Cell (1, 4) 17 16 13 Cell (2, 1) Cell (2, 2) Cell (2, 3) Cell (2, 4) 17 3 14 7 30 20 30 20 13 Textbook TOTALS 129 TABLE 5 - 5: ILLUSTRATING HOW TO INSERT THE EXPECTED VALUES AND OBSERVED VALUES IN ANY CONTINGENCY TABLE NAME OF PRIMARY SCHOOL POSSESSION Uringu Kunene Lubunu Amwari TOTALS (1, 2) (1, 3) (1, 4) 59 17 16 13 [11.8] [17.7] [11.8] (2, 1) (2, 2) (2, 3) (2, 4) 17 3 14 7 [12.3] [8.2] [12.3] [8.2] 30 20 30 20 STATUS (1, 1) (HAVING OR With 13 NOT Textbook HAVING) Without [17.7] 41 Textbook TOTALS Grand Total 100 The expected value for each cell is computed by dividing the row total at the end of the row containing that cell by the grand total. This tells us the proportion of the grand total which is contributed by the proportion of the total observation within that row. This answer is then divided by the column total of the column containing the cell of focus. For example, for cell (1, 1) the expected value of dividing the row total value of “59” [17.7] has been obtained through by the grand total of “ 100 ” , and then multiplying the resulting proportion by the column total of “ 30 ”. The answer is 17.7 130 students. Of course we cannot get a fraction of a human being, but we need this hypothetical figure for the computation of the expected Chi-Square values. Stating the Hypothesis Obviously, before analyzing the date in any way we must have stated the hypothesis. In this case we go ahead and state the hypothesis which Samson M’Mutiga used at the beginning of his investigation. This follows exactly the rules which we discussed in Chapter Four, or in the example involving the gambler above. The two important hypotheses, the null and the alternative go as follows :- Ho : There is no significant difference in the pass-rate of the subject Mathematics among those students who are in possession of a Mathematics set textbook and those who do not. Ho : There is some statistically significant difference in the pass-rate of the subject Mathematics among those students who are in possession of a Mathematics set textbook and those who do not. Once this is done, this is actually when we set out to calculate the expected values for each cell and to compare these values with the actual observed values from the field using the Chi-Square statistic. To do this we have to tabulate all the expected values against the observed values, and use the same techniques as we used for Mr. Laibuni’s gambling experiment. This will give us the computed Chi-Square statistic which compared the data collected from the field and the expected values computed using Table 5 - 5. The columnal tabulation and summary of the expected data against the observed data is given in Table 5 - 6. Notice in this table how the instructions of the chi-Square expression are followed column by column, until finally the total computed Chi-Square is obtained in the lowest right hand cell. 131 TABLE 5 - 5: ILLUSTRATING HOW TO COMPARE TABULATE THE OBSERVED AND EXPECTED VALUES IN ANY CHI-SQUARED EXPERIMENT O i Ei Cell Number C ij Oi Ei [O i - E i] [O i - E i] 2 1, 1 13 17.7 – 4.7 22.09 1.248 1, 2 17 11.8 5.2 27.04 2.292 1, 3 16 17.7 – 1.7 2.89 0.163 1, 4 13 11.8 1.2 1.44 0.122 2, 1 17 12.3 4.7 22.09 1.796 2, 2 3 8.2 – 5.2 27.04 3.298 2, 3 14 12.3 1.7 2.89 0.235 2, 4 7 8.2 – 1.2 1.44 0.176 TOTALS 100 100 0.000 2 Ei 9.330 Decision Rule We must remember that earlier on we obtained some value from the Chi-Square table on page 499 (King’oriah, 2004) using three degrees of freedom because 2 0. 05, 2 1 4 1 d f . 2 1 4 1 , which we found to be 3 d f . This value is the expected Chi-Square value (critical value) below which the null hypothesis of nodifference is accepted, and above which the null hypothesis of no-difference is rejected, as we accept the alternative hypothesis of having statistically significant difference. This critical 2 value at 5% significance level and 3 degrees of freedom is Our computation in Table 5 - 5 records a total calculated 2 -value 2 = 7.815. of 9.330. We therefore reject the null hypothesis and accept the alternative hypothesis. We support Samson M’Mutiga that the performance of our students in the subject Mathematics within the K.C.P.E examination depends on the possession of the set-books. 132 Figure 5 - 3: The Chi Square Distribution at Three Degrees of Freedom and 95% Confidence level; Showing the calculated 2c = 9.330 Readers are requested to read the chapter on Chi-Square Tests meticulously and note the example on pages 432 to 439 (King’oriah, 2004)where a student of the University of Nairobi (A. O. Otieno, 1988) uses a Chi-Square statistic to discover that actually the atmospheric pollution caused by the industrial activities of the Webuye Paper Mills is causing most small buildings exposed to this pollution to collapse within Webuye Town. EXERCISES 1. Test the null hypothesis that there is no difference in the preference of the type of bathing facilities within a residential unit, among different sizes of families surveyed in the City of Kisumu; at 5% significance level. Education Level Family Structure Three Children and less Four Children and above Shower and bath tub Bath-tub or shower 10 30 30 30 133 2. Test the null hypothesis that salary and education level are statistically independent at 95% confidence level. Education Level High School or less High School and some college University and postgraduate 3. Monthly salary in thousands of shillings to 4.99 5 to 9.99 10 to 14.99 0 10 10 14 8 8 42 0 0 10 A marketing firm is deciding whether a food additive B is a better tasting food than a food additive A. A sample of ten individuals rate the taste on a scale of 1 to 10, the results of the focus groups are listed in the table below. Test the null hypothesis at 5% significance level that food additive B in no better tasting than food additive A. FOOD ADDITIVE TASTE COMPARISON Individual ID Number 1 Additive A Rating 5.5 Additive B Rating 6 2 7 8 3 9 9 4 3 6 5 6 8 6 6 6 7 8 4 8 6.5 8 9 7 8 10 6 9 134 CHAPTER SIX ANALYSIS OF VARIANCE Introduction After some fulfilling discussion on hypothesis testing and contingency tables in the preceding chapters, we are now ready to discuss another technique which uses contingency tables and hypothesis testing of homogeneity or differences of samples. This statistic is called Analysis of Variance. the aim of Analysis of Variance is to find out whether several groups of observations have been drawn from drawn from the same population. If this is so the logic is as in the Chi-Square statistic: that the hypothesis of homogeneity would be accepted and that of the difference of samples would be rejected. The only difference between Analysis of Variance (ANOVA) and the Chi-Square Statistic is the level of measurement. Chi-Square statistic operates at the nominal level of measurement where we cannot assume the normality of data; while analysis of variance operates at the ratio and interval scales where normality of the populations can be assumed, and the accuracy of measurements can be guaranteed to be highly precise. Later we shall see how ANOVA can be used to test the significance of linear and non-linear relationships in Regression and Correlation because of its ability to be a powerful test at high levels of measurement. There are many situations in life where in real life where this kind of analysis is required. One of these, which we shall consider here, is whether when we subject different plots to different fertilizer regimes we obtain the same yield. Straight away the biostatistician’s imagination is kindled into imagining all the situations in his career which require this kind of statistic which we shall not have time to consider in these notes. Our interest here is to learn how to compute the test statistic. Let us now state our fertilizer problem more clearly, an use it to learn how to use Analysis of Variance. One Way Analysis of Variance Example A maize research scientist has three types of fertilizer treatment of different chemical composition used by farmers in his area of operation. He would like to 135 test whether these fertilizer regimes have a significant effect on maize yield using three experimental plots and a duration of five seasons. Given the data in the contingency Table 6 - 1 below, test the null hypothesis that there is no difference between maize yield which can be caused by using the three fertilizer regimes; against the alternative hypothesis that there is a statistically significant difference arising from the use of the three different fertilizer regimes; at 95% confidence level. TABLE 6 - 1 : DIFFERING MAIZE YIELDS IN THREE DIFFERENT HYPOTHETICAL PLOTS SUBJECTED TO THREE DIFFERENT FERTILIZER REGIMES SEASONS FERTILIZER TREATMENT Yield per Yield per Yield per hectare hectare hectare (bags) (bags) (bags) Type A Type B Type C Fertilizer Fertilizer Fertilizer 1 75 81 78 2 77 89 80 3 85 92 84 4 83 86 83 5 76 83 78 TOTALS: T 396 431 403 T.. = MEANS j. : X j 79.2 86.2 80.6 = 82 X1230 . 136 TOTALS Solution In this situation we follow the usual hypothesis testing techniques, wit all the ethical matters being taken care of. In particular we follow the steps of hypothesis testing outlined in Chapter Four Step One : Formulate the null and the alternative hypotheses. Ho: There is no difference in the mean maize yield among the different fields which have been subjected to different fertilizer regimes at Five percent significance level (95% confidence level). H0 : 1 2 .... n In this case n is the three columns representing the three different fertilizer regimes. The variable is the population mean from which the three samples are supposed to have been drawn. The small equation says that although we expect that the three samples come from different populations, the sample means come from one and the same population mean. [Note here we are dealing with the population means from a normal population, hence the use of the parameter .] HA: There is a statistically significant difference in the mean maize yield among the different fields which have been subjected to different fertilizer regimes at Five percent significance level (95% confidence level). H0 : 1 2 .... n Here we assert that the populations from which our samples have been drawn are different, causing the differences in the maize yield; which we detect using our statistical techniques. 137 Understanding the double-summation notation 1. We treat each cell as an area of interaction between any one row i , and the corresponding column j , just like we did for the Chi-Square Statistic. This way we have cells Xi j = cells (1, 1), (1, 2), ...... , (5, 3) in this contingency table. (Compare this with the notation of a one-dimensional observation X i which we n discussed in Chapter 1). The sum of these interactions is X ij along one i 1 dimension, for example down the rows in one column. For example sum all the cells (1, 1), (2, 1), ...... , (n, 1) from the first to the nth down the 1st column. The number of the column is the second on this designation i j at the footnote. Note carefully how this cell designation works for one column before going on. 2. When you are being instructed to sum all observations down the columns first. After obtaining the column totals X i j then you sum the column totals, to r i 1 obtain a grand total, the instruction goes on like this : - k Xij j 1 r X ij . i 1 Let us now go slowly, and interpret this instruction. Look at the right-most sum r Xi j i 1 to the r th . This tells you to sum all the rows from the first row i = 1 up row on top of the summation sign. Then come back to the left-hand- most summation X i j . This left-hand-most designation instructs us to sum all k j1 the columns from the first (1 st ) one, when “ j = 1” ; across to the k th column indicated at the top of the summation sign. In the double summation, you deal with one summation expression at a time; from the right-most summation expression; and ending with the manipulation of the left-most summation expression. Now, if we interpret the 138 k double summation above, shown as “ X i j j 1 r X ij ”, we mean, “Sum all the i 1 row observations from the first observation on the first row, to the n th observation on the n th row , but down each column. Then sum all the totals of these column-by-column totals from the first column ( j = 1), across to the k th column indicated on top of the respective summation sign.” Learners should understand this notation thoroughly before proceeding on; and should supplement this information with what is available in the textbook (King’oriah 2004, pages 237 to 238. Step Two : Summations and Double-Summations (Look at the Table 6 - 1 very carefully.) Sum all observations down the rows first, and then across the columns, to obtain the totals “ T j .” and the grand Total “ T ..” . (These sums are called “Tee Jay Dot ”, and “Tee Double-dot” respectively, as shown on by these expressions.) This will give us the grand total T .. = 1230. When we divide the Grand Total by the total number of all observations in the three plots we obtain the grand mean X = 82. Individual column means “ X j ” are obtained by dividing the column sums with the number of observations in each column. In symbolic summary, these two sums and their respective means are shown as :- r X i1 T1 396 X j X 1 79 . 2 i 1 i2 T2 431 X j X 2 86 . 2 i3 T3 403 X j X 3 80 . 6 r X i 1 r X i 1 k r X ij T 1230 X j j 1i 1 139 82 Study these expressions carefully, because their understanding will help you to understand all future statistical work involving double summation. This is what has to be done first before to open the way for data manipulation in one-way analysis of variance. Step Three : Confidence Levels and Degrees of Freedom Decide on the confidence level upon which to test your Hypotheses, and then calculate the relevant degrees of freedom. Confidence level : C 0 . 95 Significance level ; 1 C 1 0 . 95 0 . 05 Degrees of Freedom are stated in the following manner :- F c 1, c r 1 This expression means, “ The statistic ( F ) [in the F-Statistical tables] is defined by the significance level ( ), and c 1, c r 1 degrees of freedom. If we were to know the Confidence level and the Degrees of freedom (like we are about to do) we can easily define the Critical Value of F, which we now designate as F . Like for the Chi-Square statistic, the degrees of freedom take account of interaction between row and column observations within the ANOVA contingency table. The column degrees of freedom ( “ V1 d f. ” ) account for the loss of one degree of freedom because the columns run along one dimension. Adjusting these degrees by one d f. , we obtain c - 1 d f. ( Number of observations in Columns minus One ). These account for the degrees of freedom among the three samples. (The use of the term among will be important in this discussion presently. Please note that it denoted the interaction among the columns or the three sample characteristics which we also designate as Treatments) The degrees of freedom caused by the whole population of observations within the three plots, take into account the fact that in each row of each column, we adjust for the loss of one degree of freedom : ( r - 1 ). Then the column “row-adjusted” degrees of freedom are multiplied by the existing number of 140 columns “ c ” which are three ( 3 ); accounting for the three fertilizer regimes or Treatments. This gives us the expression “ ” c r 1 d f. This second kind of degrees of freedom are called , “ within degrees of freedom.” They are denoted by the symbol “V2 d f.” . The second kind of degrees of freedom take into account the sampling errors, and all the other random errors due to chance, which are made when collecting data in the field (which data is influenced by the environment - climate, weather soil types etc.) from the three plots. The summary of all the above discussion can be stated as follows :Confidence level : C 0 . 95 Significance level ; 1 C 1 0 . 95 0 . 05 Degrees of Freedom : V1 d f. V2 d f. = = c - 1 d f. c r 1 = 2 = 3 (5 - 1) = 12 d f. Step Four : Using the ANOVA tables . Using the values of the degrees of freedom which we have computed in Step Three, we turn to the ANOVA tables on page 490 of your Textbook (King’oriah, 2004). The column degrees of freedom (V1 d f. ) are found in columns arranged across the table and designated by the numbers on the top-most row, with headings “ 1, 2, 3, ........ , 9 ”. We look for a V1 d f. which labeled “ 2” along the top-most columns. Down this column we slide, until we interact with the row which begins at V2 d f. = 12. (The V2 d f. are read down the left-most column of that table.) The number at the interaction of the column and row of interest reads 3.8853. We now conclude that the F-value we are looking for, which accounts for the entire interactional environment among the three plots ( which we shall call three “ Treatments ” ) is :- F c 1, c r 1 = F0.05 3 1, 3 5 1 F0.05 2, 141 12 3.88553 Readers should make sure they understand all the computations which we have accomplished so far. Unless we are sure we have understood these, we are likely to have difficulties with the discussion which follows from now on. Figure 6 - 1: The Position of the Calculated value of the F-Statistic called Fc , as compared to that of the Critical value of F . If you have not understood the computations, and the associated logic, please supplement the above discussion with what is available in the textbook (King’oriah, 2004) before going on. Step Five : Computation of the sums of squares Variation within and among treatments Having obtained the critical value of F which we have designated as “F ”, we now have a statistical probability model for testing the similarity or the difference of the data from the three treatments. (See Figure 6 - 1. Our critical value of F is clearly visible in that diagram.) All we need now is to compute the F-value which is the result of our activities in the field as we observe the tree treatments over five seasons. 142 The F-Statistic or (ANOVA: Analysis of variance), like the Chi-Square Statistic compares the observed values and the expected values. Go back again to Table 6 - 1. You will find that we computed the Grand Mean of all the observations from the three fields. The value of this grand mean is “ 82 ”. If we view our data globally, this is the expected value of all the observations in all the cells X i j in Table 6 - 1. We shall now make use of a technique which will allow us compare the similarity or the difference of this value X 82 with all the observations from the field. The difference between this Grand Mean and every observation from the three fields is the so-called Total variation all the data we have collected. (a) The Total Sum of squares In our example we compute this Total squared variation by finding the sum of squared differences between every observation we have collected from the three Treatments. The mathematical notational designation of this important value is SS; namely, the total Sum of Squares of all the observations we have recorded form the Grand Mean. This means we find the difference between the Grand Mean and every observation, each time squaring this difference, and obtaining some kind of a squared deviation. If you remember what we did in Chapter One when we computed variances, you will find that we are on our way to obtain a variance of some kind of all the data we have recorded in our experiment. Now let us proceed :- The Total sum of all the Squares = SS c r j 1 i 1 Xij X 2 Amazing! This expression tells us first (within the brackets on the right) to take each observation recorded on each cell “ X i j ”, and subtract the grand mean X 82 from it. The superscript (exponent) on top of this bracket to the right tells us to square each of these differences obtained this way. Then the summation sign immediately lying to the left of this bracket tells us that all this must happen down the column treatment-rows (observations) for each of the three treatments (columns), from the first row ( i = 1) to 143 the r th row (in our case the fifth row in each column). This is the meaning of the first X r summation mechanism on the right designated as “ ij X i 1 2 ” . Once you have finished all that for each of the three treatments, sum all the total results which you have computed across the three treatments. This is the use of the left-most summation sign , c “ ”. Therefore the complicated-looking expression :- j 1 X c SS r ij X j 1 i 1 2 is just an instruction, to tell us to do something down the columns, and then across the bottom of the columns! This is nothing special. We now compute the sum of all the squared deviations from the grand mean in our entire experiment using this formula. When we do this we obtain the following figures :- SS 75 82 + 81 82 + 78 2 82 2 89 82 2 82 77 2 85 82 2 80 82 2 92 83 82 2 82 86 82 2 84 82 2 82 + 2 83 82 2 83 82 2 76 2 78 2 82 +... 2 SS 49 25 9 1 36 1 49 100 16 1 16 4 4 1 16 = 328. If we wish to obey the formula strictly deal with each column at a time, we go ahead and obtain the sum of squared differences for each column. this means we the first add the five squared differences for the first column, 49 25 9 1 36 120 then the second five squared differences for the second column, 144 1 + 49 + 100 16 + 1 = 167 and lastly add the third five squared differences, 16 4 4 1 16 41 What we got after these systematic additions are the column totals. Then, adding all these column totals of the squared differences we obtain :- 120 + 167 + 41 = 328 . This agrees with the SS figure that we just obtained above. (b) The Variation among Treatments It is easiest to compute this variation after computing the Sum of Squared due to Treatments ( SST). Once you have found this figure, then the variation due to random errors in the entire population can be computed. However, we defer this other task until later. We proceed with the calculation of SST. Aware of the manner we handle the double summations, the expression for the Sum of Squares due to Treatments (SST) can be symbolically expressed as :SST c r j 1 X j X 2 From the experience we have gained in ( a ) above, this is a much easier formula to interpret. Again study the treatment means given in Table 6 - 1. You will find the first symbol “ X j ” , pronounced as “ X-bar Jay-Dot ” within the brackets of the expression. This the individual treatment mean for each of yield observations due to the three fertilizer regimes in our field experiment tabulated at Table 6 - 1. Since this mean is the typical value in each treatment, we assume that if it were not for random errors, each value within each treatment should have taken the size of the 145 treatment mean “ X j ”. Therefore, according to our assumption, each treatment should have Five ( 5 ) observations which are the size of the mean - disregarding the random deviations. Accordingly, we look for the difference between each treatment mean “ X j ”. from the Grand mean and square it to obtain X j X . Add the three results across X X . However, this 2 c the columns ( c ) to achieve the instruction given by 2 j j 1 accounts for only the sum of one set of squared differences between the three treatment means and the grand mean, the X 1 X 2 + X 2 X 2 + X 3 X . this is 2 when we remember that each of the three treatments has five observations (rows, denoted as “ r ”. therefore, what we have just obtained is multiplied by the number of rows in each one of the treatments - five ( 5 ) rows each. 5 X 1 X 2 this expression is the same as SST + X 2 X r c 2 + X j X j 1 X 3 X . 2 2 . Ensure that you understand it actually is the same. Using the formula we insert the actual values and compute our SST in the following manner :- SST 5 79.2 82 2 86 . 2 82 2 80 . 6 82 When you compute this you obtain the answer of :- SST c r X j X j 1 146 2 137 . 2 2 (c) Variation within Treatments This is variation due to random error in the entire experiment usually expressed as the sum of squared differences between each observation we have ever recorded in this experiment (all the three treatments) and the Grand Mean. the mathematical expression for this is :- SSE X c r ij X j j 1 i 1 2 This is called the variation within treatments because it records the variation between each treatment mean and the individual observations in each of the three samples. Although we can follow this formula and obtain the SSE, we can simply obtain the same figure the short way; by imagining that what has not been explained by the treatments “ SST c r X j X j 1 numbers of the observations “ SS X c r j 1 i 1 ij 2 X ” from the total 2 ” is definitely explainable by nothing else except that it is the random error in each sample SSE. Therefore, we conclude that SS - SST = SSE. In that regard :- SS – SST = SSE 328 – 137.2 = 190.8 You can verify whether this calculation is correct using the manipulation of the full formula SSE X c r ij X j j 1 i 1 2 . This means you will have to make a table similar to the one on the Textbook (King’oriah, 2004, page 234). The summary of step Five is as follows:- SS = 328, SST = 137.2 147 SSE = 190.8 This result is very important for the next crucial step in one-way Analysis of Variance . this involved actually looking for the variances to Analyze using the Analysis of Variance technique and the ANOVA probability model. Step Six : Computation of mean Squared Values ( or Variances) For any variance to work there must be two things: the sum of squares of one form or another, and the total number of observations in each case adjusted by the appropriate degrees of freedom. We have already accomplished this. We may now wish to compute the respective variances. (a) Mean square due to treatment MST: In our case, this is the variance caused by the differences in fertilizer regimes. This is sometimes known as the variance among samples or treatments, or the variance explained by Treatments. The corresponding Sum of squares due to treatment is SST = 328. (b) The Mean Square due to random error MSE. In our case, this is the variance caused by random errors within each of the treatments and for all the three treatments. This is sometimes known as the variance within the samples or treatments, or variance explained by Chance error within the treatments. It is also called the variance explained by chance (random) error. These are the two variances which will be needed for the analysis. If they are equal then this indicates that the treatments have no effect on the overall population. On the other hand, if they are not equal then the treatment has some significant effect on the values of these means and then variances. This is the logic which was followed by the discoverer of this test statistic Sir Ronald Fisher (1890 - 1962). This explains why the test statistic is called the “ F-test ”, after this illustrious statistician. The total number of observations to be used as the denominator for the computation of the analysis of variance is found from the degrees of freedom which we have been having since we learned how to enter into the F-table. This information is available on page 132 of this module (above). We now bring them forward for immediate use :- ( i ) V1 d f. = ( ii ) V2 d f. = c - 1 d f. c r 1 = 2 = 3 (5 - 1) = 12 d f. 148 Then we interpret these Degrees of freedom to be the treatment d. f. = V1 d.f. = 2 Error d.f. = V2 d. f. = 12 . Accordingly the mean square due to treatment is computed as :- MST SST 137.2 68.6 Treatment d . f . 2 " T2 " The mean Square Explained by Chance (Random) Error : [this is sometimes called the Unexplained variance] is computed as :- MSE SSE 190 .8 15. 9 Error d . f . 12 " E2 " Step Seven Fisher’s ratio, ordinarily called the F-statistic is the ratio between the treatment variance and the error variance. This is expressed as :- T2 Fcalculated E2 or " Fc " The nature of the test is that if the ratio is equal to 1.0, then T2 is equal to E2 . This means that the variance due to treatment is equal to that random (chance, error) variance. In that case, the treatment has no effect on the entire population, since the variance due to treatment could have occurred by chance! After all it is equal to the random variance. The mode of this statistic therefore is found at F = 1.0. the distribution is positively skewed toward the right. The farther the mean due to treatment is from the steepest part of the curve (around 1.0) the more different this is from 1.0 statistically. The critical value delineates the value of F beyond which the similarity of the observed value cannot be judged to be equal to 1.0 at the prescribed Confidence level and the V1, and V2 Degrees of Freedom It looks like Figure 6 - 1. 149 The calculated value of F , designated as Fc is T2 68 . 6 4 . 31 . Compare 2 E 15. 9 this figure with the expected value which was found from the F-tables, at the appropriate degrees of freedom and Confidence level. The value of this figure is 3.8853. Our conclusion is that any value of F above 3.883 does not belong to that population whose modal F-value is 1.0. Any distribution with this kind of F will therefore be rejected. Accordingly, we reject the null Hypothesis that there is no difference between the yield caused by different fertilizer regimes 1, 2, and 3; and accept the alternative hypothesis that there is a statistically significant difference between the three treatments. Therefore, application of fertilizers affects the maize yields among the three fields. The investigator would choose the fertilizer regime with the highest mean yield over the five seasons,. From Table 6 - 1, this is the Type B regime with the mean yield of 82.6 bags per hectare. These eight steps comprise all there is in carrying out the One-Way Analysis of Variance for equal sized samples. For unequal sizes, for example in situations when some members of the observations are not present see your textbook (King’oriah, 2004, pages 242 to 250) The explanation is simple and straight forward because now we have mastered the necessary symbolism. The logic is identical and the statistics used are the same. We now turn to a slightly more complicated form of Analysis of Variance: The two-way analysis of Variance. Two-Way Analysis of Variance Introduction It is a fact of life that the treatments are not the only causes of variation in any group of observations. We shall see later, especially when we shall be dealing with a closely related type of analysis, that there are many causes of variation which affect samples and related groups of observations. In the fertilizer example, we have allowed the treatments to be the only causes of variation. By implication, we have held all the other variables which should have influenced the yield from three fields constant. In that 150 case we find that a large part of the variation still remains unexplained. Consider, for example the relatively small SST = 137.2 compared to the SSE of 190.8. this means that the sum of squared deviations due to random error in this population needs to be disaggregated some more, so that another variable - other than the fertilizer treatment regime - may perhaps be of interest as affecting yields in this area. Example In Kenya, and anywhere else in the world, there are good seasons and bad ones. Crop yield is best in good seasons with copious rainfall and all other factors that affect the growth of plants. During bad years crops do not do that well, and the yields are low. Suppose this investigator wished to test whether the seasons also have some effect on the maize yield. The following table would be relevant. Statistically, we say that we have introduced a Blocking variable in our analysis. This means that while we are investigating the main variable, Fertilizer treatment, we shall be interested in the effects of a second variable in our experiment, this time the seasons are the Blocking variable. In this regard, we find that the time we invested in learning various analytical symbols and notation in the last section will pay dividends here, because we can quickly analyze the summation expressions to obtain our answers in a faster manner than if we were to learn the symbols a-fresh each time. The only thing we need to familiarize ourselves with is how to deal with the blocking variable. Even this is not any big deal, because the logic is identical. We take off and do our analysis straight away. Step One : Arrange all the given data in a contingency table as shown in Table 6 - 2. Then formulate the null and the alternative hypotheses for the main variable and for the blocking variable. This is because we shall end up with two F-calculated results, each proving a different thing; but in an interacting environment between 151 TABLE 6 - 2 : DIFFERING MAIZE YIELDS IN THREE DIFFERENT HYPOTHETICAL PLOTS SUBJECTED TO THREE DIFFERENT FERTILIZER REGIMES DURING DIFERENT SEASONS SEASONS FERTILIZER TREATMENT Yield per Yield per Yield per hectare (bags) hectare (bags) hectare Type A Type B (bags) Type Fertilizer Fertilizer C Fertilizer 1 75 81 2 77 3 TOTALS MEANS 78 234 78 89 80 246 82 85 92 84 261 87 4 83 86 83 252 84 5 76 83 78 237 79 TOTALS: T 396 431 403 T.. = 1230 MEANS j. : X j 79.2 86.2 80.6 . all the variables involved. In this arrangement, test the effects of treatments, and then test the effects of the blocking variable, the seasons. Ho: There is no difference in the mean maize yield among the different fields which have been subjected to different fertilizer regimes at Five percent significance level (95% confidence level). H0 : 1 2 .... n 152 X = 82 HA: There is a statistically significant difference in the mean maize yield among the different fields which have been subjected to different fertilizer regimes at Five percent significance level (95% confidence level). H0 : 1 2 .... n Ho|B: Crop yield is not affected by any of the five seasons at 5% significance level (95% confidence level). H0 B : 1 2 .... n HA|B: The five seasons have a statistically significant effect on the mean maize yield at Five percent significance level (95% confidence level). H0 B : 1 2 .... n This means that we have another set of hypotheses for the blocking variable. We seem to have decided on the Confidence level of our test statistic within the first step as well. Step Two : Determine the degrees of Freedom. Two types of degrees of freedom are to be determined. The treatment degrees of freedom are as before, only that this time we have to take into account the existence of a third dimension the Blocking Variable, which has now been introduced. The treatment dimension is represented by columns as usual where one d f . is lost. The blocking (blocks) dimension, reflecting the influence of seasons, is represented by rows, where another d f . is lost. In addition, the rows and columns must be cross-multiplied so that all the error is accounted for. This is where the difference between a 2-way and a one-way ANOVA lies. 153 This cross-multiplication is done by adjusting for the lost d f. on the rows and on the columns. Consequently the degrees of freedom for the two-way ANOVA are determined as follows :- Treatment d f . ( V1 ) C – 1 = 3 - 1 = 2 d f. = Error d f. (in two dimensions, on the treatment and on the blocking dimension) ( V1 ) = (c – 1)(r – 1) = ( 3 – 1 )( 5 – 1 ) = 2 4 = 8 d f. Therefore, our expected value of the treatment F (designated as FT ) is determined from the tables at the appropriate alpha-level (0.05), entering at the above degrees of freedom into the same table as we used for the one-way analysis of variance. (King'oriah, 2004, page 490) We go ahead and straight away we determine the critical value of F due to treatment, after taking care of dimensional influences of the Blocking Variable through cross-multiplication of rows and columns (which, in turn, have been adjusted for the degrees of freedom along their dimensions. (c – 1)( r – 1). this value of F due to treatment is designated as :FT F0.05, c 1 , c 1 r 1 F0.05, 3 1 , 3 1 5 1 Remember all the small numbers and letters are not mathematical operations, but indicators or labels which show the kind of “ F ” we are talking about. Accordingly, our F due to treatment “ FT ” is finally and correctly described as : F0.05, 2 , 8 . The V1 d f. are 2 and the V2 d f. are 8. 154 The significance level is 0.05. Looking up on page 490 of your textbook we find that this time :FT F0.05, 2 , 8 4 . 4590 The same approach is adopted for the other kind of F which is required in the twoway ANOVA. This time we mechanically regard the rows (blocking variable) as some kind of treatment “columns”. We make two adjustments to obtain the blocking F. FB F0.05, r 1 , r 1 c 1 F0.05, 5 1 , 5 1 3 1 Study this F configuration very carefully and compare it with the FT. FT F0.05, c 1 , c 1 r 1 F0.05, 3 1 , 3 1 5 1 The significance level is (as before) 0.05. Looking up on page 490 of your textbook (King’oriah 2004) we find that this time is F0.05, 4 , 8 3.8378 . Therefore, we summarize the statement of F due to the blocking variable as :- FB F0.05, 4 , 8 3.8378 Step Three :From here the journey is downhill! We calculate the sum of squares SS just like we did before: SS c r Xij X j 1 i 1 Then we obtain the SST: SST 2 X c r j X X j 1 Thereafter, the SSB : SSB X i r c i 1 2 2 Comparing SST and SSB formulae, observe how the places of rows and columns are changing positions. Note how X i represents the Row Means on Table 6 - 2, and X j represents the Column Means. The computations take place in 155 accordance with the instruction of each formula. SSB is computed in the same manner as SST, only that this time, the row means are used instead of column means. This is the importance of differentiating row means from column means by different symbolic designation. Compare X i ( X-bar dot i ) to the column symbol X j ( X-bar j dot ). The following is now a summary of the sum of squares computations. c SS r Xij X j 1 i 1 SST X c r X j j 1 SSB X i r c X i 1 2 2 328 The same values as before. 137 . 2 The same values as before. : This is the only statistic we have not computed 2 All the others are the same as those we computed when we dealt Analysis of Variance. We now go ahead and compute the with before. one-way same and use it in our analysis. SSB 3 SSB 3 78 82 4 2 2 82 82 87 82 2 0 5 2 3 2 2 2 2 84 82 79 82 2 3 16 0 25 4 9 3 54 162 Having obtained SSB, Sum of Squares due to random error (SSE) which remains after introducing the new blocking variable, can easily be obtained through subtraction. Note that the error variance is now reduced from 190.8, which we obtained from One-Way ANOVA. on page 137. 156 2 SSE = SS – SST – SSB = 328 - 137.2 - 162 Step Four : = 28.8 Obtain the mean squared deviations for the treatment and blocking variables. These are calculated in the same manner as for the one-way ANOVA. Divide the respective Sum of Squares by their corresponding degrees of freedom to obtain the calculated Fc (for the treatment or the Blocking variable). MST SST 137.2 68.6 Treatment d . f . 2 MSB SSB 162 162 Blocking d . f . 5 1 4 MSE SSE 28 .8 28 Error d . f . 5 13 1 8 " T2 " 40 .5 3.6 " B2 " " E2 " Step Five : Finish off by computing the calculated values of the two F-statistics, the Treatment FB and the Blocking FT. FT FB MST MSE MSB MSE T2 E2 B2 E2 68 . 6 19 . 053 3.6 40 .5 11. 25 3.6 Compare these with the corresponding critical values obtained from the F-table at appropriate degrees of freedom. These were obtained on page 143 to 145 above. F|T = 4.4959 FT|c = 19.053 F|B = 3.8378 FB|c = 11.25 157 In both cases we reject the null hypotheses that neither the fertilizer regime nor the seasons have significant effects on crop yield on the three plots; and accept the null hypotheses that both the fertilizer regimes and the seasonal variations have some statistically significant effect on maize crop yield on our three fields at 95% Confidence level. EXERCISES 1. Explain in detail all the steps involved in the performance of an analysis of variance test for several unequal samples. 2. Outline the steps you would take to compute F statistics for two-way analysis of variance. 3. Test the null hypothesis that the amount of crop yield in bags per hectare between the year 2000 and 2003 is not due to the plot on which it was grown; and to the year of observation at 5% alpha level: using two-way analysis of Variance. Year 2000 Plot one 87 Plot two 78 Plot three 90 2001 79 79 84 2002 83 81 91 2003 85 83 89 158 CHAPTER SEVEN LINEAR REGRESSION AND CORRELATION Introduction By this time our reader is confident in statistical data analysis, even if at a rudimentary level. We now go a state farther and try to understand how to test the statistical significance of data at the same time as testing the relationship between any two variables of our investigation. In knowing how variables are related with one another, we will then look for methods of understanding the strength of their relationship. In the following discussion we shall also learn how to predict the value of one variable given the trend of the other variable. The two dual statistics which assist us in all these tasks are called Regression and Correlation. If any variable changes and influences another, the influencing variable is called an independent variable. The variable being influenced is the dependent variable, because its size and effects depend on the other independent variable. the independent variable is usually called an exogenous variable, because its magnitude is decided outside by other factors which are out of control of the investigator, and the dependent variable is an endogenous variable. Its value depends on the vicissitudes of the experiment at hand; it depends on the model under the control of the researcher. In referring to the relationship between the dependent variable and the independent variable, we always say that the dependent variable is a function of the independent variable. The dependent variable is always denoted by the capital letter Y, and the independent variable by the capital letter X. (Remember not to confuse these ones with lower case letters because the lower case letters mean other things, as we shall see later.) Therefore, in symbolic terms we write :- Y is a function of X Y = f(X) Meaning, The values of Y depend on the values of X. 159 Whenever Y changes with each change in X we say that there is a functional relationship between Y and X. Assumptions of the Linear Regression/Correlation model 1. There must be two populations, and each of these must contain members of one variable at a time, varying from the smallest member to the largest member. One population comprises the independent variable and the other the dependent variable. 2. The observed values at each level or each value of the independent variable are one selection out many which could have been observed and be obtained, We say that each observation of the independent variable is Stochastic - meaning probabilistic and could occur by chance. This fact does not affect the model very much because, after all the independent variable is endogenous. 3. Since the independent variable is stochastic, the dependent variable is also stochastic. This fact is of great interest to the observers and analysts, and forms the basis of all analysis using these two statistics. the stochastic nature of the dependent variable lies within the model or the experiment, because it is the subject matter of the investigations and the analyses of researchers under any specific circumstances. 4. The relationship being investigated between two variables is assumed to be linear. This assumption will be relaxed later on when we shall be dealing with non-linear regression correlation in the succeeding chapters. 5. Each value of the dependent variable resulting from the influence of the independent variable is random, one of the very many near-equal which could have resulted from the effect of the same level (or value) of the independent variable. 6. Both populations are stochastic, and also normal. In that connection, they are regarded as bi-variate normal. Regression Equation of the Linear Form 160 The name Regression was invented by Sir Francis Galton (1822 - 1911), who, when studying the natural build of men observed that the heights of fathers are related to those of their sons. Taking the heights of the fathers as the independent variable , he observed that the heights of their sons tend to follow the trends of the heights of the fathers. He observed that the heights of the sons regressed about the heights of the fathers. Soon it came to mean that any dependent variable regressed with the independent variable. In this discussion we are interested in knowing how the values of Y regress with the values of X. This is what we have called the functional relationship between the values of Y and those of X. The explicit regression equation which we shall be studying is Y = a + b X. In this equation, “ a ” marks the intercept, or the beginning of things, where the dependent variable might have been found by the independent variable before investigation. This is not exactly the case, but we state it this way for the purposes of understanding. The value “ b ” is called the regression coefficient. When evaluated, b records the rate of change of the dependent variable with the changing values of the independent variable. The nature of this rate of change is that when the functional relationship is plotted on a graph, “ b ” is the magnitude of the slope of this Regression Line. Most of us have plotted graphs of variables in an attempt to investigate their relationship. If the independent variable is positioned along the horizontal axis and the dependent variable along the vertical axis, the stochastic nature of the dependent variable makes all the observations of the same variable take a scatter on the graph. This scattering of the values of the independent variable is the so-called scatter diagram or in short the Scattergram. Closely related variables show scatter diagrams with points of variable interaction tending to regress in one direction, either positive or negative. Unrelated points of interaction do not show any trend at all. See figure 7 - 1. The linear Least squares Line 161 Since the scatter diagram is the plot of actual value of Y which have been observed to exist for every value of X, the locus of the conditional means of Y can be approximated by eye through the scatter diagram. In our earlier classes the teachers might have told us to observe the dots or crosses on the scatter diagram and try to fit the curve by eye. However this is unsatisfactory, it is not accurate enough. Nowadays there are accurate mathematical methods and computer packages for plotting this line with great degrees of estimation accuracy and giving various values of ensuring that the estimate is accurate. The statistical algorithm we are about to learn helps understand these computations and assess their accuracy and efficacy. 162 Figure 7 - 1 : Scatter Diagrams can take any of these forms In a scatter diagram, the least squares line lies exactly in the center of all the dots or the crosses which may happen to be regressing in any specific direction. ( See Figure 7 - 1). The distances between this line and all the dots in the scattergram which are higher than the scattergram balance those which are lower, and the line lies exactly in the middle. This is why it is called the conditional mean of Y. The dots on the scatter diagram are the observed values of Y and those along the line plot those values which for every dot position on the scattergram define the mean value of Y defined by the 163 corresponding values of X. the differences between the higher points and the conditional mean line are called the positive deviations, and between the lower points and the conditional mean line are called the negative deviations. Now let us use a simple example to concretize what we have just said. Example Alexander ole Mbatian is a maize farmer in Maela area of Narok. He records the maize yields in debes ( Tins, equivalents of English Bushels) per hectare for various amounts of a certain type of fertilizer which he used in kilograms per hectare for each of the ten years from 1991 to 2000. The values on the table are plotted on the scatter diagram which appears as Figure 7 - 2. It looks that the relationship between the number of debes produced and the amounts of fertilizer applied on his farm is approximately linear, and the points look like they fall on a straight line. Plot the scatter diagram with the amounts of fertilizer (in Kilograms per hectare) as the independent variable, and the maize yield per hectare and the dependent variable. Solution Now we need a step-by step method of computing the various coefficients which are used in estimating the position of the regression line. For this estimate we first of all need to estimate the slope of the regression line using this expression :- b X X Y Y X X i i 2 i TABLE 7 - 1: DIFFERENT QUANTITIES OF MAIZE PRODUCED FOR VARYING AMOUNTS OF FERTILIZER PER HECTARE WHICH IS USED ON THE PLOTS Year 1991 n 1 X 6 Y 40 1991 2 10 44 1991 3 12 46 164 1991 4 14 48 1991 5 16 52 1991 6 18 58 1991 7 22 60 1991 8 24 68 1991 9 26 74 2000 10 32 80 Where : Xi = Each observation of the variable X ; and in this case each value of the fertilizer in kilograms used for every hectare. X = The mean value of X. Yi = Each observation of the variable Y ; and in this case each value of the maize yield in Debes per hectare. Y = The mean value of Y. n = The usual summation sign shown in the abbreviated form. i 1 The Regression/ correlation statistic involves learning how to evaluate the b-coefficient using the equation b X X Y Y , and to compute various results which X X i i 2 i can be obtained from this evaluation. 165 Figure 7 - 1: The Scatter Diagram of Maize produced with Fertilizer Used The steps which we shall discuss involves the analysis of the various parts of this equation and performing the instruction in the equation to obtain the value of the coefficient “ b ”. Once the coefficient has been obtained the corresponding other coefficient in the regression equation “ a ” is easily obtained because it can be expressed as : a Y bX . In addition other values which will help us in our analysis will be sought and learned . Table 7 - 2 is the tool we shall use to evaluate the equation for the coefficient b. For the estimation of the b-coefficient we use the table 7 - 2 to assist us in the analysis We must now learn to show the deviations in small case variable representative symbols such that X expression is therefore i X xi x i and Y i Y yi . the numerator for the b yi , and the denominator for the same expression is x 2 i These are the values for the X, Y and XY in deviation form, and the summation sign is of course an instruction to add all the values involved. Accordingly, the equation for the b-coefficient is : b xy x i i 2 . Use Table 7 - 2 and fill in the values to calculate b. i 166 TABLE 7 - 2 : CALCULATIONS TO ESTIMATE THE REGRESSION EQUATION FOR THE MAIZE PRODUCED (DEBES) WITH AMOUNTS OF FERTILIZER USED YEAR X i X Y FERTILIZER YIELD (KG.) PER HA. i X Y i Y x x i y 2 i xi yi i (DEBES 1 6 40 - 12 - 17 144 204 2 10 44 -8 - 13 64 104 3 12 46 -6 - 11 36 66 4 14 48 -4 -9 16 36 5 16 52 -2 -5 4 10 6 18 58 0 1 0 0 7 22 60 4 3 16 12 8 24 68 6 11 36 66 9 26 74 8 17 64 136 10 32 80 14 23 196 322 TOTAL 180 570 000 0 576 956 18 57 Means X,Y 167 Solution (Continued). Using the values in Table 7 - 2 the solution for the b coefficient is sought in the following manner :- b xy x i i 2 i Then the value of 956 1. 66 576 “ a ” This is the slope of the regression line. which statisticians also call “ b0 ” ( because it is theoretically the estimate of the value of b in the initial condition of time zero) is calculated as :- a Y bX 57 166 . 18 . 57 29 .88 27 .12 This is the Y-Intercept. The estimated regression equation is therefore :- Yi 27 .12 1. 66 X i The meaning of this equation is that if we are given any value of fertilizer application “ X i ” by Ole Mbatian, we can estimate for him how much maize he can expect (in debes per hectare) of that level fertilizer application using this regression equation. Assume that he chooses to apply 18 kilograms of fertilizer per hectare. The maize yield he expects during a normal season (everything else like rainfall, soil conditions, etc., and other climatic variables remaining constant) estimated using the regression equation will be :- Yi 27 .12 1. 66 Xi 27 .12 1. 66 18 57. The symbols for the calculated values of Y as opposed to the observed values of Y vary from textbook to textbook. In your Textbook (King’oriah 2004) we use “ Yc ” or Ycalculated. Here we are using “ Yi ” pronounced “ Y-hat ” Other books use “ Ye ” meaning “Yestimated” , and so on; it does not make any difference. 168 Tests of Significance for Regression Coefficients Once the regression equation has been obtained, we need to test whether the equation constants “ a ” and “ b ” are significant. This means we need to know whether they could not have occurred by chance. In order to do this we need to look for the variance of both parameters and their standard errors of estimate. The variances for “ a ”, also called the variance for b0 can be estimated by extending Table 7 - 2 and using the estimated values of Y against the actual values of Y to find each deviation of the actual value from the estimated value which is found on the regression line.. All the deviations are then squared and the sum of squared-errors of all the deviations are calculated. Once this is done, the variance of the constants and their standard errors of estimate of any one of the two constants can be easily computed. Let us now carry out this exercise to demonstrate this process. In Table 7 - 3, the estimated values of y are in the column labeled Yi . The deviations of the observed values of Y from the estimated values of Y are 2 on the column labeled “e i ”. Their squared values are found in the column labeled ei . These, and a few others in this table are the calculations required to calculate the standard error of estimating the constants b 0 = a and b 1 = b. The estimated values Yi are found in Table 7 - 3 on the third column from the right. These have been computed using the equation Yi 27 .12 1. 66 X i . Simply substitute the observed values of X i into the equation and solve the equation to find the estimated values of Yi . Other deviation figures in this table will be used later in our analysis. The variance of the intercept is found using the equation: 169 TABLE 7- 3 : COMPUTED VALUES OF Y AND THE ASSOCIATED DEVIATIONS (ERRORS) YEAR X Xi 2 Y xi2 yi yi 2 Yi ei ei 2 1 6 36 40 144 - 17 289 37.08 2.92 8.5264 2 10 100 44 64 - 13 169 43.72 0.28 0.0784 3 12 144 46 36 - 11 121 47.04 - 1.04 1.0816 4 14 196 48 16 -9 81 50.36 - 2.36 5.5696 5 16 256 52 4 -5 25 53.68 - 1.68 2.8224 6 18 324 58 0 1 1 57.00 1.00 1.0000 7 22 484 60 16 3 9 63.64 -3.64 13.2496 8 24 576 68 36 11 121 66.96 1.04 1.0816 9 26 676 74 64 17 289 70.28 3.72 13.8384 10 32 1024 80 196 23 529 80.24 - 0.24 0.0576 47.3056 TOTAL 180 3816 570 576 1634 YI e X n k n x 2 s 2 b0 i 2 i 2 . i In this equation, n = number of observations k = degrees of freedom due to the interaction of two variables. Other values can be found in Table 7 - 3. Let us now use the equation :- 170 e X n k n x 2 s 2 b0 i 2 2 40.3056 3816 10 2 10576 2 40.3056 10 2576 i i s 2 b1 n e 2 i k xi 3. 92 0 . 01 Having found the variances of these constants their standard errors are obviously the square roots of these figures. sb 0 3. 92 sb1 1. 98 0.01 0 .10 Let us now test the number of standard errors that each of the two constants estimate the slope of the regression line. This means we compute each of their t-values and compare it with a critical t at 5% alpha level. If the t values resulting from these constants exceed the expected critical value t then we conclude that each of them is significant. The calculated t value for these parameters is :- t0 t1 Since both b0 b0 Sb and 27.12 0 13. 7 1. 98 0 b1 b1 Sb 1. 66 01 . 16 . 6 1 t = 2.306, with 8 degrees of freedom at 5% level of significance we conclude that both values of the intercept and the slope are significant at 5% level. 171 The Coefficient of Determination and the Correlation Coefficient Using this maize-fertilizer example some measure of the strength of relationship can be derived from the data on Table 7 - 3. This measure of the strength of relationship is known as the Coefficient of Determination, from the fact that it determines how related some data observation series is to the other data available within the independent variable. We begin by computing the coefficient of non-determination which is the ratio of the sum of squared error between the predicted values of Yi and the observed values Y, to the squared error between the actual observed values of Y and the mean of Y . The Sum of Squared error between the Yi and Y = e The Sum of squared error between the Y and Y = y The coefficient of non-determination = e y 2 i 2 i 2 i 2 i = 47.3036 = 1634 47.31 0 . 0290 1634 This coefficient of non-determination is the proportion or the probability of the variation between the two variables X and Y which is not explained by the changes in the independent variable X. The Coefficient of determination is the complement of this coefficient of non-determination. It is the proportion, or the probability of the variation between X and Y which is explained by the changes in the independent variable X. Therefore, the Coefficient of Determination “ R2 ” is calculated using the following technique :- R 2 1 e y 2 i 2 i 1 47.31 1 0 . 0290 1634 0 . 9710 This is a very strong relationship. About 97.1% of the changes in the maize yield (in debes per Hectare) within Mr. Mbatian’s farm is explained by the quantities of fertilizers per acre applied on his farm. It also means that the regression equation which we have defined as Yi 27 .12 1. 66 X i explains about 97.1% of the variation in output. 172 The remaining 3% or thereabouts, (approximately 2.9%), is explained by other environmental factors on his farm which have not been captured in the model. Figure 7 - 2 : The estimated regression line In any analysis of this kind, the strength of the relationship between X and Y is measured by means of the size of Coefficient of Determination. This coefficient varies between Zero value of no relationship at all to 1.0000 value of a perfect relationship. The example we have on Mr. Mbatian’s farm is that of a near perfect relationship, which actually shows the observed data clinging very closely to the regression line that we have constructed; as shown in Figure 7 - 2. The other value which is used very frequently for theoretical work in Statistics is the Correlation Coefficient. This is sometimes called Pearson’s Product moment of Correlation, after its discoverer, Prof. Karl Pearson (1857 - 1936). He also invented the Chi-Square Statistic and many other analytical techniques while he was working at the 173 Galton Laboratory of the University of London. From the computation above you will guess that he is obviously the inventor of all these measures which we have just considered. The Correlation Coefficient is the square root of the Coefficient of determination. We shall use both measures extensively in Biostatistics from now on. Let us now compute the Correlation Coefficient. r R 2 r 1 e y 2 i 2 1 0 . 0290 0 . 9710 i 0. 9710 0. 9854 The measure is useful in determining the nature of the slope of the regression line. A negative relationship has a negatively sloping regression line and a negative Correlation Coefficient. A positive relationship has a positive Correlation Coefficient. In our case the measure is positive. This means that the more of this kind of fertilizer per hectare which is applied on Mr. Mbatian’s farm the more maize yield in terms of debes per hectare that he realizes at the end of each season. Computation Example Having discussed the theory involved in the computation of various coefficients of regression and correlation we need to try an example to illustrate the techniques of computing the relevant coefficients quickly and efficiently. Example The following data had been obtained for the time required by a drug quality control department to inspect outgoing drug tablets for various percentages of those tablets found defective. 174 Percent defective Inspection Time in minutes 17 9 12 7 8 10 14 18 19 6 48 50 43 36 45 49 55 63 55 36 (a) Find the estimated Regression line Yi a b X i (b) Determine the sum of deviations about this line for each of the ten observations. (c) Test the Null Hypothesis that change in inspection time has no significant effect on the percentage rate of the drug tablets found defective using analysis of variance. (d) Use any other test statistic to test the significance of the correlation coefficient. Solution 1. The relevant data and preliminary computations are arranged in Table 7 - 4. 2. The following simple formulae assist in the solution of this kind of problems. We already know that the deviations of X and Y from their means are defined in the following manner :- X 3. Y X x i i Y y The shortcut computations will make use of these deviation formulas to compute various figures which ultimately lead the definition of the regression equation Yi a b X i . (a) To find the sum of squared deviations of all the observations of X from the mean value X , and those of squared deviations of Y from Y , we use the following shortcut expressions :- x 2 X 2 i X n 2 , 175 y 2 Y i 2 Y n 2 (b) The deviations of the cross multiplication between X and Y are found using the expression xy XY X Y . n TABLE 7 - 4 : PERCENT OF TABLETS FOUND DEFECTIVE FOR INSPECTION TIME IN MINUTES Percent Found Defective ( Y ) 1 Time in Minutes (X) 48 2 X2 Y2 XY 17 2304 289 816 50 9 2500 81 450 3 43 12 1849 144 516 4 36 7 1296 49 252 5 45 8 2025 64 360 6 49 10 2401 100 490 7 55 14 3025 196 770 8 63 18 3969 324 1134 9 55 19 3025 361 1045 10 36 6 1296 36 216 Total 480 120 23690 1644 6049 Means X 48 Y 12 Observation (c) If we were to remember these three equations we can then have at our disposal a very powerful tool for the fast computation of the regression coefficients. To find the slope coefficient “ b ” we use the results of the expressions in ( a ) and ( b ) above. xy b x2 Then “ a ” coefficient can be found easily through the equation :- a Y bX 176 Then the correlation coefficient is found using the following expression 4. xy x y r 2 2 The figures to be used in these computations are to be found in Table 7 - 4. Whenever you are faced with a problem of this nature, it is prudent if you perform the tabulations of your data as in Table 7 - 4, and then follow this with the tabulation of data using these simple formulas. We now demonstrate the immense power which is available in the memorization of the simple formulas we have demonstrated in ( a ) ( b ) and ( c ) above :- x2 y xy 2 Xi 2 Y XY 2 i X 2 23690 n Y 2 120 2 1644 204 10 n X Y b 6049 n xy x 2 480 2 650 10 480 120 289 10 289 0 . 445 650 a Y bX 12 0.445 48 9 . 36 Accordingly, the regression equation is: - Yi a b X i 9.36 0.445 X . 5. The values of the Regression Equation are relevant in farther tabulation of the figures which will assist us to derive farther tests. They are recorded on the thrd column from the left of Table 7 - 5. 177 TABLE 7 - 5 : COMPUTATION OF DEVIATIONS AND THE SUM OF SQUARED DEVIATIONS Time in minutes (X) Percent Found Defective ( Y ) Yi D Y Yi D2 48 17 12.00 5.00 25.0000 50 9 12.89 - 3.89 15.1321 43 12 9.78 2.22 4.9284 36 7 6.66 0.34 0.1156 45 8 10.67 - 2.67 7.1289 49 10 12.45 - 2.45 6.0025 55 14 15.12 - 1.12 1.2544 63 18 18.68 - 0.68 0.4624 55 19 15.12 3.88 15.0544 36 6 6.66 - 0.66 0.4356 D Sum of squared Deviations 2 = 75.5143 Using the data in Table 7 - 5, you can see how fast we have been able to compute the important measures which take a lot of time to compute under ordinary circumstances. Obviously, it is faster by computer, since there are proprietary packages which are designed for this kind of work. However, for learning and examination purposes, this method has a lot of appeal. On is able to move quickly, and learn quickly at the same time. 6. It is now very easy to compute the standard error of estimate, which helps us to build a probability model along the regression equation so that we can see how our regression line fits as an estimator of our actual field situation. What we need now is to find a method of calculating the standard error of estimate. Given all the data in Table 7 - 5 we can use the expression below to find the standard error of estimate of the regression equation.. 178 Standard Error of estimate D = Y 75.5243 8 2 n 2 75.5243 10 2 9.4405375 3073 . If we state the hypothesis that there is no significant difference between the observed values and the calculated values of Y for each value of X we can build a two tail t distribution model centered on the regression line. Confidence level This is done by choosing the C = 0.95, and a 0.05 alpha level. Since we need the upper and lower tails on both sides of the regression equation, we divide the alpha level by two to obtain 0.025 on either side. For the 0.05 alpha level, we obtain the appropriate value from the usual t-tables on page 498 of your textbook (King’oriah, 2004). We find that our t-probability model on both sides of our regression equation is built by the following critical value of t , using the two-tail model. Remember this table in your Textbook has two alternatives, the twotail and the single tail alternative. We use the columns indicated by the second row of the table for the two-tail model and find that :t , 10 2 t 0.05, 8 2.306 . We now have a probability model which states that for any observed value of Y to belong to the population of all those values estimated by the regression line, it should not lie more than 2.306 standard errors of estimate on either side of the regression line. The t-value from the table can be useful if it is possible to compute the t-position for every observation. This is done by asking ourselves how many standard errors of estimate each observation lies away from the regression line. We therefore need a formula for computing the actual number standard errors for each observation. The observed values and the calculated values are available on Table 7 - 5. 179 The expression for computing the individual t-values for each observation is ti ti Di Y Yi Y Y Di Y . . Using this expression all the observations can be located and their distance on either side of the regression line can be calculated. This is done at Table 7 - 7. TABLE 7 - 7: COMPUTATION OF DEVIATIONS, THE SUM OF SQUARED DEVIATIONS AND T-VALUES FOR EACH OBSERVATION Time in minutes (X) Percent Found Defective ( Y ) Yi D Y Yi D2 48 17 12.00 5.00 25.0000 1.6270 50 9 12.89 - 3.89 15.1321 - 1.2530 43 12 9.78 2.22 4.9284 0.7220 36 7 6.66 0.34 0.1156 0.1106 45 8 10.67 - 2.67 7.1289 - 0.8689 49 10 12.45 - 2.45 6.0025 - 0.7973 55 14 15.12 - 1.12 1.2544 - 0.3645 63 18 18.68 - 0.68 0.4624 - 0.2213 55 19 15.12 3.88 15.0544 1.2626 36 6 6.66 - 0.66 0.4356 - 0.2148 Sum of squared Deviations D 2 t Di Y = 75.5143 From the t-values on the right-most column of Table 7 - 7, we find that there is not a single observation which lies more that the expected value of :t , 10 2 t 0.05, 8 2.306 180 This tells us that there is a very good relationship between X and Y. This relationship as outlined by the regression equation did not come by chance. The regression equation is a very good predictor of actually what is happening in the field. This also means that whatever parameters of the regression line which we have computed they represent the actual situation regarding the changes in Y which are caused by the changes in X. We can confidently say at 95% confidence level that actual percentage of defective tablets which is found the production line depends on the inspection time in minutes. We may want to instruct our quality control staff to be more vigilant with the inspection so that our drug product may have a few defective tablets as humanly and technically possible. Analysis of Variance for Regression and Correlation Statisticians are not content with only finding the priori values of statistical computations. They always are keen on making double sure that what they report is not due to mere chance. Another tool which they employ for the purpose of data verification is what we learned in Chapter Six. This is what we shall call the F-test in our discussion. In regression analysis we are also interested in changes within the dependent variable which are caused by each change in the independent variable. Actually the string of values of the independent variable is analogous to Treatment which we learned in Analysis of Variance. Each position or observation of the independent variable is a treatment, and we are interested to know the impact of each one of them on the magnitude of the value of the dependent variable each time. Analysis of variance for the Regression/Correlation operates at the highest level of measurement (the ratio level) while the other statistic which we considered in Chapter Six operates at all the other lower levels of measurement. Use of analysis of variance in the regression correlation analysis tests the null hypothesis at whatever confidence level that there is no linear relationship between the independent variable and what we fancy to be the dependent variable. The null hypothesis is that the variation in the dependent variable happened by chance, and is not due to the effects of the independent variable. The alternative hypothesis is that what has 181 been discovered in the initial stages of regression/correlation analysis has not happened by chance. The relationship is statistically significant. Therefore, To use the F-test we assume:1. A normal distribution for the values of Y for each changing value of X. any observed value of Y is just one of those many which could have been observed. This means that the values of Y are stochastic about the regression line. 2. All the values of the independent variable X are stochastic as well, and therefore the distribution is bi-variate normal. ( a ) The null hypothesis is that there is no relationship between X and Y. ( b ) Also there is no change in Y resulting from any change in X. 3. In symbolic terms, the null and the alternative hypotheses of the regression/correlation analysis could be stated in the following manner to reflect all the assumptions we have made :- (i) H : 0 Y1 Y2 ..... Yn No change is recorded in variable Y as a result of the changing levels of the variable X . ( ii ) H : A Y1 Y2 ..... Yn there is some statistically significant change recorded in variable Y as a result of the changing levels of the variable X . 4. ( a ) The total error in F-Tests comprises the explained variation and the unexplained variation. This comprises the sum of squared differences between every observed value of the dependent variable and the mean of the whole string of observations of the dependent variable. SS = TOTAL ERROR = EXPLAINED ERROR + UNEXPLAINED ERROR (b) The error caused by each observation which we regard as an individual treatment is what is regarded as the explained error. SST = EXPLAINED ERROR = VARIATION IN Y CAUSED BY EACH VALUE OF X. 182 (c) The residual error is the unexplained variation due to random circumstances. SSE = TOTAL ERROR - EXPLAINED ERROR SSE = SS - SST In our example SS is the sum of squared differences which are recorded in Table D 7 - 7 as 2 = 75.5143. The percentage of this explained error, or the probability out of 1.0 can be computed by multiplying this raw figure by the coefficient of determination. SS T r 2 D 2 The value of r 2 is easily obtainable from some calculation using the values at x y x y 2 page 164. Mathematically, this is expressed as r 2 2 2 . Using our data :- x y r2 289 , and x y 2 289 289 2 2 289 2 0 . 630 650 204 This is the coefficient of determination, which indicates what probability or proportion of the total variation explained by the X-variable, or the treatment. Therefore, the explained error = SS T r 2 SS SS T D 2 D = 75.5143 r D 0. 630 75.5143 0. 630 75.5143 2 2 2 SS T 47 .574009 183 47 .574009 The error due to chance SSE = SSE = SS - SST 75.5143 47 .574009 27 . 940291 SSE 27 . 940291 5. Degrees of Freedom Total degrees of freedom: caused by the total variation of Y caused by all the environment. SS d f. = n - 1 = 10 - 1 = 9 d f. SS d f. = 9 d f. Unexplained degrees of freedom are lost due to the investigation of parameters in two dimensions. Here we lose two degrees of freedom :- SSE d f. = n - 2 = 10 - 2 = 8 d f. SSE d f. = 8 d f. Treatment degrees of freedom is the difference between the total degrees of freedom and the unexplained degrees of freedom. SST d f. = SS d f. - SSE d f. = (10 - 1) - (10 - 2) = 9 - 8 = 1.0 SST d f. = 1.0 d f. 6. This means that we have all the important ingredients for computing the calculated F-statistic and for obtaining the critical values from the F-Tables, as we have done before. Observe the following data and pay meticulous attention to the accompanying discussion because at the end of all this we shall come to an important summary The summary of all what we have obtained so far can be recorded in the table which is actually suitable for all types of Analysis of variance - called ANOVA table. 184 TABLE 7 - 8 : ANOVA TABLE SOURCE OF VARIATION SUM OF SQUARES FOR A BI-VARIATE REGRESSION ANALYSIS DEGREES OF FREEDOM VARIANCES (MEAN SQUARES) CALCULATED Fc Total SS Explained by Treatment Unexplained D2 =75.5143 N - 1 = 10 SS d f. = 9 d f. SS T 47 .574009 SST d f. = 1.0 d f. MST 47 .574009 47 .57 1 d f. SSE 27 . 940291 SSE d f. = 10 - 2 = 8 d f. MSE 27 . 940291 3. 493 8 d f. MST MSE 47 .57 3.493 Fc = 13.586 Study the summary table carefully. You will find that the computations included in its matrix are what is actually the systematic steps in computing the F-statistic. The critical value of F is obtained from page 490 at 5% significance level and [1, n - 2] degrees of freedom. F0.05, 1, n 2 F0.05, 1, 8 5. 3172 F0.05, 1, 8 5. 3172 7. Compare this F0.05, 1, 8 5. 3172 to the calculated F-value. Fc 13.586 . You will find that we are justified to reject the null hypothesis that the changes is the values of X do not have any effect on the changes of the values of Y at 5% significance level. In our example this means the more watchful the quality control staff are, the more they can detect the defective tablets at 5% significance level. Activity Do all the necessary peripheral reading on this subject and attempt as many examples in your Textbook as possible. Try to offer interpretations of the calculation results like we have in this chapter. It is only with constant exercise 185 that one can master these techniques properly; and be able to apply them with confidence and be able to interpret most data which comes from proprietary computer packages. EXERCISES 1. For a long time scholars have postulated that the predominance of agricultural labor force in any country is an indication of the dependence of that country on primary modes of production, and as a consequence, the per-capita income of each country that has a preponderance of agricultural labor force should be low. A low level of agricultural labor force in any country would then indicate high percapita income levels for that country. Using the data given below and the linear regression/correlation model determine what percentage of per-capita income is determined by agricultural labor force. Agricultural labor force (Millions) Per capita income (US. $ 00) 2. 9 10 8 7 10 4 5 5 6 8 7 4 9 5 8 6 8 8 7 7 12 9 8 9 10 10 11 9 10 11 An urban sociologist practicing within Nairobi has done a middle- income survey, comprising a sample of 15 households, with a view to determining whether the level of education of any middle-income head of household within this City determines the annual income of their families. The following is the result of his findings :- Education level Annual income K. Shs100,000) 7 12 8 12 14 9 18 14 8 12 17 10 16 10 13 18 32 28 24 22 32 36 26 26 28 28 32 30 20 18 186 (i) Compute your bi-variate correlation coefficient and the coefficient of determination. ( ii ) Use the shortcut formulae for Regression/correlation analysis to equation of the regression line of income (Y) on compute the education level ( X ). ( iii ) By what means can the sociologist confirm that there is indeed a relationship? (Here you should describe any one of the methods of testing the significance of your statistic.) 3. A survey of 12 couples is done on the number of children they have Y as compared to the number of children the had stated previously they would have liked to have X. (a) Find the regression equation on this phenomenon, computing all the appropriate regression coefficients. (b) What is the correlation coefficient, the Coefficient of determination, Coefficient of non-determination and that of Alienation with regard to this experiment? What is your interpretation on all these? Couple 1 2 3 4 5 6 7 8 9 10 11 12 Y 4 3 0 2 4 3 0 4 3 1 3 1 X 3 3 0 2 2 3 0 3 2 1 3 2 187 CHAPTER EIGHT PARTIAL REGRESSION, MULTIPLE LINEAR REGRESSION AND CORRELATION Introduction There are situations where only one variable is not enough to explain all the variation in the dependent variable. In this case we are interested in telling how much of the total variation in the dependent variable can be explained by each of the variable which we suspect has some effect on the variability within the dependent variable. We now introduce a situation where ore than one variable is influencing the dependent variable. In that case we shall have one major assumption that all the independent variables are truly independent and none affects the other independent variable. This means that there is no multi-collinearity among the independent variables. In this case also the dependent variable is assumed to be stochastic and to be determined by all the independent variables in the model. In that case, a simple regression model which represents the interaction of the two variables Yi a b X i will not be adequate. In the general case, the model which is applicable in the multi-variate case is :- Yi a b1 X1 b2 X 2 ..... b k X k , in the situation where X k are independent variables, and b k are the changes in the dependent variable Y with respect to each of the X k independent variables. Partial Correlation To be able to understand the concept of multiple regression and correlation we need to understand the concept of partial correlation. This comes about when the analyst examines how much one independent variable is affecting the dependent variable while all the effects of other variables are held constant. This control is achieved by adjusting the values of the dependent variable to account for the disturbing effects of all the other independent variables. For convenience all variables are identified by labeling them with 188 Arabic number subscripts, with the dependent variable being the first variable, and all the others following from X2 , X3 , .... , X k . This means that the relationship between X1 and any other variable, say variable number 3, is designated as r13 , etc. Partial correlations are designated in orders. These depend on how many variables are controlled in the investigation. The zero order means that no variables are controlled. The first order means the effects of one variable are controlled while we investigate the effects of another two variables. The first order partial correlation therefore involves three variables. The order of control goes up in that manner. The designation of the correlation coefficient reflects this type of control also. The general equation for the first order partial correlation coefficient is :- ri jk ri j ri k rk j 1 ri k 2 1 rk j 2 Take the example of variables 1 and 2 holding the effects of variable 3. This equation in real terms becomes :- r123 r12 r13 r23 1 r13 2 1 r23 2 In that case we can interpret the correlation coefficient as :- Explained variation between i and j ri j k having taken away the explanation of i and j on k Total variation between i and j having accounted for the effects of i and k ; and j and k Computing Partial Correlation Coefficients To accomplish our interesting exercise we begin by computing first order correlation coefficients between the dependent variable and all the independent variables 189 involved in the relationship investigation. Once we have done this, the rest of the exercise is easy because it is merely a question of filling in the correlation coefficients in the equation : ri jk ri j ri k rk j 1 ri k 2 1 rk j 2 . Let us use climatic data example given in your textbook, pages 367 - 373. (King’oriah, 2004) Example After observing the weather for a long time in the Mid-Western United States we develop a strong feeling that a cold January and a cold February are always followed by a cold March. We feel that this is true because of the following reasons :1. If a strong high pressure cell develops in Western Canada, this is when it is very cold in Indiana and Illinois throughout January. This high pressure cell remains strong throughout the winter and spring. Conversely, warm Januarys are experienced when this high pressure cell is weak over Western Canada. 2. If the ground becomes chilled and frozen in January, the cold ground lowers the temperatures of all the subsequent air masses adjacent to it (immediately above it). This causes intense cold to subsist throughout February and March. Test the Hypothesis that a cold January ( X 2), and a cold February ( X 3) is always followed by a cold march ( X 1) using the climatic data gathered between 1950 and 1964 in Table 8 - 1. ( Source : Prof. J.C. Hook, Indiana State University, 1979. ) Computational techniques This kind of model can be tackled using a few simple steps which aim to accomplish the instructions of the main equation. The rest of the task involves interpreting the results of the computation. Finally there are significance tests to be done 190 TABLE 8 - 1 : MEAN TEMPERATURES IN DEGREES FAHRENHEIT * OF THE TOWN OF WINAMAC, INDIANA X2 X3 Year January February March 1950 33.5 26.8 34.1 1951 26.6 28.9 35.6 1952 28.5 31.9 35.4 1953 30.1 32.7 38.4 1954 27.2 37.1 34.1 1955 24.1 28.3 37.4 1956 26.5 29.0 37.6 1957 18.9 33.6 37.1 1958 26.4 20.7 35.3 X1 ***The observer died and a new one came in 1962 1962 19.4 25.2 35.1 1963 13.3 18.1 41.1 1964 29.0 27.4 36.9 TOTALS 303.5 340.7 438.1 8019.39 9980.91 16038.55 X x 2 i 2 i Standard Deviations x 1 x2 x 3 x1 Correlation Coefficients 343.3691667 2 5. 349214636 307.8691667 2 5. 065151912 44.2491667 1 1. 910268355 – 70.189 166 66 – 44 . 939166 66 r12 0 .569 425 4725 r13 0 . 385 025 3892 0 .569 0 . 385 r23 0 . 393 2802 0 . 393 *Changing the data to degrees Celsius will have the same effect because this would involve a simple arithmetic transformation of data. ( Source : Prof. J.C. Hook, Indiana State University, 1979. ) 191 in order to tell whether or not the relationship came by chance at specified significance levels. We must caution that these days there are computer packages which aid in the computation of statistics of this kind, and which give exact results, saving considerable amounts of labor. However, this does not mean the end of learning, because what matters is not the accomplishment of the computation of the statistic. What matters is the interpretation of the results. One needs to understand very clearly how and why any statistical algorithm works; in order to guarantee correct interpretation of the data which may come out of the “mouth” of the computer. It is the view of this author that no matter how much we advance technologically, we shall not stop investigating why our environment around us works - particularly in areas which affect so many of us - the area of research and interpretation of research data. Therefore, a small amount of patience in learning such a statistic as this one pays a lot of dividends. We must mention that there is literature which aims at teaching the computer application techniques, and which assist in the solution of problems like this one. To understand these techniques, one also needs to understand how each of these statistics works, whether or not the computer is available to assist in menial tasks. Step One We begin by drawing and completing a table like Table 8 - 1. We must do all the “housekeeping” tasks of completing all the cells of the table. These tasks involve nothing new. We need to compute all the Zero-order correlation coefficients and standard deviations in the exact manner as we learned in the first chapters of this module. Learners should make sure that they compute these measures step by step. Teachers must make sure that the learners understand how to solve the zeroorder measures by setting appropriate exercises to that effect. The zero-order correlation coefficients show relatively weak relationships between the independent variables and the dependent variable, ( 3 ) temperatures. r12 0 .569 425 4725 0 .569 , r23 0 . 393 2802 0 . 393 192 March r13 r31 0 . 385 025 3892 0 . 385 The question being asked in each of the three cases is whether any month which is the independent variable has any effect on the March temperature, the dependent variable. The strength of these relationships is tested by means of the size of the coefficients of determination computed out of the three correlation coefficients. Square these coefficients each time to obtain their corresponding coefficients of determination. (Verify all these measures yourself.) Step Two The second step involves the computation of the partial correlation coefficients using the first order measures found in Step 1. These are tricky areas but not complicated as long as you follow the stipulations of the formula meticulously, inserting the relevant values in their correct positions. Compute the correlation between January and March controlling for the February effects. This is designated as “ r 12 . 3 ”. r123 r12 r13 r23 1 r13 2 1 r23 2 0 . 417695 0 .851775 0 .845551 0 .569 0 . 3850 . 393 1 0 . 385 0 . 4921833 1 0 . 393 2 2 0 . 492 Comparing the result with the zero-order correlation coefficient we find a less negative relationship. This could imply that we are getting somewhere, but does not help us very much. Now, compute the relationship between March and February while controlling for the effects of January, which we have already investigated. This will give us the individual impact of the February temperatures, so that we can determine the direction of this effect, and also the direction. 193 r132 r13 r12 r23 1 r12 2 1 r23 2 0 .161383 0 . 676239 0 .845551 0 . 385 0 .5690 . 393 1 0 .569 0 . 2134213 1 0 . 393 2 2 0 . 213 Again, we have a less negative relationship between the variation of February temperatures and March temperatures. any farther tests of significance are not really necessary. We shall perform these finally when we complete the computation technique. The model is not yet complete. However, we needed to compute these measures at first level as a way to obtaining the final equation and the final model where all the appropriate variables are controlled. We now proceed to examine the multiple simultaneous interactions of all the variables which can be used to explain the changes in the dependent variable -March temperatures. This second step shows that we have not concluded or resolved our problem. The following step will examine whether multiple simultaneous interaction of all the variables will explain the changed in the temperatures of the dependent variable - the month of March Step Three Examine the size of the nature and the sizes of the standard deviations of the individual variables, especially the dependent variable. Use the standard deviations of all the observed values of the dependent variable X1 and adjust this standard deviation by multiplying it with the coefficients of alienation between the dependent variable and the independent variable. This adjustment is done to see how much of the variation in the independent variable remains unexplained by the correlation coefficients which we have computed, The formula for doing this kind of adjustment is :SY X Y 1 r Y X 194 2 Here we must remember that the Coefficient of Alienation variation of the dependent variable X 1 1 r 2 tells us what the cannot be explained by the variations in any of the independent variables. Beginning with the dependent variable X1 itself the standard deviation indicates all the variations in X1 which are not explained by the entire environment in which it operates - including both the independent variables. Its standard deviation represents the total variation in X 1 . If we use the expression which we have just learned “ SY X Y 1 r Y X 2 ”, and multiply it with the coefficient of alienation - that variation which is not explained by the effects of each of the independent variables - we are able to isolate the standard error of estimating the dependent variable X 1 given the effects of each of these variables. Individual standard deviations of all the variables involved are the zero-order standard deviations. the standard errors of estimate between any two variables are the first order standard errors, and so on. To be able to find the total effect of all independent variables on the predictability of the dependent variables, a method is required where the total effect of all the independent variables is calculated. In a three-variable model like this one the first order standard errors of estimating X 1 given the effects of X 2 is computed using the formula :- S1 2 1 1 r12 2 We may now use this to compute the standard error of estimating variable X 1 given the effects of X 2 . 1 1920 . , r 0.569 . 12 S1 2 1 1 r12 2 Therefore r12 0 . 323761 . 2 1. 920 1 0. 323761 1.578888 1.579 The effect of the action of the second variable X 3 can be included by multiplying what we have just computed “ S1 2 alienation between this second variable X 195 1.579 ” by partial coefficient of 3 and the dependent variable, while holding the effects of the first independent variable; because, after all we have just taken into account these effects and therefore to include them again amounts to double-counting. The result of this process is the multiple standard error of estimate, which is found in the following manner :- S12 23 1 1 r12 2 1 r1322 We must not the position of dots in this formula. The formula represents the standard error of variable X 1 given the effects of X 2 and X 3 . It means that after accounting for the variables X 2 and X 3 , there is still sone net variation which remains unexplained. Variable X means of the expression variable X 2 within 2 is accounted for by 2 S1 2 1 1 r12 . What remains unexplained by the variable X 1 is farther explained by variable X 3 after taking into account that variable X 2 has done its “ job ” beforehand. This is the meaning of the dot expression in the “ S12 23 1 1 r12 2 1 r1322 1 r1322 part of the expression ” This first order standard error of estimate is the key to the computation of the Coefficient of Multiple Determination. It comprises the crude standard deviation of X 1 . (the dependent variable) which is denoted as “ 1 1920 ”. Then this value is corrected twice . First we examine the effects of X 2 , which is represented by S1 2 1 1 r12 , and then finally we examine the effect of 2 X 3 given that the variable X 2 has been allowed to operate. This later effect is represented by S12 23 1 1 r12 2 196 1 r1322 . S12 23 using the figures which we have calculated We now set out to compute above. These are summarized as follows :- 1 1920 . , r 2 12 1 1920 . , 0 . 323761 . r 2 132 r 12 0.569 . 0 . 2134213 0 . 0455486 2 S1 23 1. 920 1 0 . 323761 1 0 . 0455486 1. 920 0. 676239 0. 9544213 1. 920 0.8223375 0 . 9769602 1.5425108 1 . 543 Compare this result with the following results :- 1 1920 . .......... Zero order standard error of estimate S1 2 1.579 ..........First order standard error of estimating X 1 given of X 2 S1 23 1.543 ..........Second Order standard error of estimating X given the 1 joint simultaneous effects of X 2 and X 3 . We can see that each time we include the partial effects of additional variable, the standard error becomes smaller and smaller. This means that given the joint simultaneous action of X 2 and of X 3 we reduce the band around the least squares line within which we may expect to find the regressing values of X 2 . The estimation of X 2 becomes more and more accurate each time. Step Four The Coefficient of Multiple Determination This coefficient is computed using the variation of explained by either of X 2 or X 3 X 1 which cannot be acting together on X 1 . This variation is deducted from the total variation of X 1 , and the formula for all this is :- R 21 23 1 197 21 . 23 21 Intuitively, this expression reads “The multiple coefficient of determination of the net variation in X 1 which cannot be accounted for by the effects of X 2 and X 3 working together. This is the difference between all possible variation which can take place (1.0) and the ratio of the net variation unexplained by X 2 and X 3 to the total variation within X 1 . Taking the same expression :- R 2 1 23 21.23 = coefficient of multiple determination of the variation in 1 21 X 1 given the effects of X 2 and X 3 21 . 23 = The net variation in X 1 which cannot be explained by the variation in X 2 and X 3 , 1 and: = The total variance in X 1 , both explained and unexplained. As usual, it is the square of the crude standard deviation of X 1 , or the variance of X1. Step Five: Coefficient of Multiple non-Determination and Determination Consequently :- S 21.23 2 Variance within X 1 unexplained by X 2 and X 3 Total Variation within X 1 1 This expression is equal to the coefficient of Multiple non-Determination. It defined the fraction of the variation in action of X 2 and X 1 which cannot be explained by the joint simultaneous X 3 . The difference between the total variation or the total determination which is possible (1.0000) and this coefficient is actually the Coefficient of Determination. This is given by the formula R 21 23 1 21.23 21 Using the data at our disposal we already have computed : S1 23 the standard error of the variable X 1 given the effects of both X 2 198 1.543 , and therefore and X 3 can be evaluated as :- 1.5425108 S 21 23 1. 543 2.3793397 2 We also know that from the computations in Table 8 - 1 that :- 12 1. 92068355 2 3. 6874303 Consequently:- R 21 23 1 21.23 2.3793397 1 0 . 3547431 0 . 355 2 1 3.6874303 This means that the variation in March temperatures explained by January temperatures (X 2 ) X 1 which can be jointly and February temperatures (X 3 ) is about 35%. This is quite a high ratio using only two variables to explain the variation of a third variable. Other variables in the environment which affect March temperatures 1 (X ) need to be sought and included in the model to see whether we can account for a higher fraction of variation in X 1 . For the time being we may be satisfied that this is all we can manage. However, before we leave this trend of argument we need to not that r 2 12 0 . 324 explains about 32.4% of the variation Step Five: Coefficient of Multiple non-Determination and Determination . r r 2 31 0.148 explains about 14.8% of the variation in X1 . 0 . 355 explains about 35.5% of the variation in X1 . 2 1 23 this indicates that a multiple linear regression model has a lot to offer in explaining the variations in march temperatures. It is expected that the introduction of any other variables such as the nature of the ground and the elevation of slope, the nature of the land use the strength of the prevailing winds with the accompanying wind-chill factors, etc can account for more variation in March temperatures in Winamac Indiana. 199 The Multiple correlation Coefficient is obviously the square root of the coefficient of multiple determination :R 21 23 0 . 3547431 ; R 123 0. 3547431 05956031 . 0596 . There is no sign attached to this coefficient, because it comprises a multiple relationship within some variables which may have positive and others which may have negative correlation. These effects cancel in a multi-variate situation. Step Five: Looking for the Regression Coefficients The process of computing the partial slope coefficients ( B s ) begins with the computation of the slopes accounting for the interaction of X 1 . Note that we are using capital letters instead of lower case letters ( b s ). This is to differentiate between the coefficients resulting from multi-activity of the combined independent variables from those of only one independent variable. Generally the formula for the multiple B given the effects of the two variables on the independent variable is determined in a similar manner as that of the multiple coefficient of determination and correlation which we have computed above. The formula goes as hereunder :- Bi j k B B Bi j Bi k Bk j jk kj Specifically :- B123 B B B12 B13 B32 23 32 This is one of the few cases where we shall not need to verify the formula because of the advanced form of mathematics involved. However, we can see that to have a pure effect 200 on any B while controlling for the effects of all the other variables a similar operation is performed on the zero Bs of other variables as that on the zero-order r when computing partial correlation coefficients. We begin by computing b12 ; as follows : b12 x x x 1 2 2 2 Readers are expected to look for the relevant values to fill in this equation from Table 8 - 1. In so-doing, solve for x x 1 2 and x 2 2 in the following discussion as an exercise. Once this is done you will find that the series of the following computations are possible :- x x 1 2 70 .18917 , This gives us the value of x 2 2 343. 36917 ; b12 70 .18917 . 343. 36917 b13 44 . 93917 . 307 .86917 b23 127 .86917 . 343. 36917 b 32 127 .86917 . 307 .86917 b12 = – 0.2044131 Similarly:- x x 1 3 44 . 93917 , This gives us the value of x 2 3 307 .86917 ; b13 = – 0.1459684. Also :- x x 2 3 127 .86917 , This gives us the value of x 2 3 343. 36917 ; b23 = 0.3723956. Then :- x x 3 2 127 .86917 , x 2 3 307 .86917 ; This gives us the value of b32 = 0.4154335. 201 Once these have been done, the computed values of b are inserted in their appropriate places within the equation for the computation of partial B coefficients. Step six: Calculating the Partial Regression Coefficients and the regression equation This step is accomplished through the use of the equation which we discussed before : Bi j k . B B Bi j Bi k Bk j jk In that regard, the partial B between the kj first and the third variable wile controlling for the effects of the second one is :- B122 0 . 2044131 0 .14596840 . 4154335 b12 bi13 b32 b23 b32 0 . 37239560 . 4154335 Calculating the final fraction we get B 12 . 3 = – 0.9295533. B132 0 .1459684 0 . 20441310 . 3723956 b13 b12 b23 b32 b23 0 . 41543350 . 3723956 Calculating the final fraction we get B 13 . 2 = – 0.4515833. The calculation of the partial slope coefficients opens the door for our stating the intercept coefficient which goes like this :- a1.23 X 1 B12.3 X 2 B13.2 X 3 Obviously, to compute the multiple a we have to compute the respective means. Using the data from Table 8 - 1, we find that X 1 36.50833 , X 2 25. 29167 , and X 3 28. 391667 . (Please investigate that this is the correct position regarding these figures). Substitute these figures in their correct positions within the equation a1.23 X 1 B12.3 X 2 B13.2 X 3 , and 202 the value for a1.23 is finally computed to be a1.23 72.84 . Our regression equation (which you must verify through your own computation) therefore becomes :- X1c a 1.23 B12.3 X 2 B13.2 X 3 Which in actual terms of our example becomes :- X 1c 72.84 0.92296 X 2 0 . 452 X 3 72.84 0.92296 X 2 0 . 452 X 3 This means that given any array of X2 and X3 figures, for each year between 1950 and 1964 we can be able to compute the predicted (or the calculated) values of March (X1) temperatures. Significance Tests Similar tests are available for the multi-variate case to those of the bi-variate regression/correlation. They test Whether the changes in the dependent variable are necessarily influenced by the changes in the two independent variables. to be able to carry out these tests analogous assumptions for this kind of statistic to those of the bi-variate case are made, otherwise the tests would be invalid. 1. There must be normal distributions for each of the variables involved, including the dependent variable. this means that the distribution of all the variables is multi-variate normal. 2. Apart from the relationship between them and the dependent variable, all the independent variables are not related to one another. 3. Each observation of each of the independent variables is one of the many that could have been possible. 203 When all the three conditions are possible we conclude that the model obeys the condition of Heteroscedasticity. The null hypotheses being tested are that all the means of the independent variable are equal and remains so, no matter what the application of the independent variable could be. This means that the model is amenable to the application of Analysis of Variance and other significance tests. Analysis of Variance Once the estimated values of the regression equation for X1 are available, they are compared to the observed values as we did in the bi-variate case. In that regard, we must compute the necessary measure for the computation of the F-Statistic. D 2 The sum of squared differences between the observed values of the dependent variable and the computed values of the same variable. R12.23 D 2 The coefficient of multiple determination, recording the explained variation in X1 caused by a simultaneous changes in all the variables involved in the model. In this case they are X2 and X3. this is the same as the sum of squared deviations due to treatment SST. 1 R12.23 D 2 The total unexplained variation, even when we have accounted for the effects of the two independent variables. This is the same as SSE. Total degrees of freedom are the same as the number of paired cases less one degree of freedom. SS d f . = N - 1. the capital N reflects the multi-variate nature of the investigation; as opposed to the lower-case n which represents the bivariate case. 2. Recall the ANOVA relationship which is expressed as :- SS - SSE = SST 204 This means that the grand sum of squares (SS) less the sum of squares due to error (SSE) is equal to the sum of squares due to Treatment. In this regard, you can also treat the degrees of freedom this way. SS d f. - SSE d f. = SST d f This is also true for the multi-variate case where :- SS d f . = N - 1 , SST d f. = k where k is the number of independent variables . These are the degrees of freedom explained by treatment. SS d f. - SSE d f. This means that = k or SST d f. N 1 SSEd f k or degrees of freedom explained by treatment. Consequently SSEd f N 1 k . With this amount of data available to us we can construct the ANOVA table to summarize our findings. In this table we can at the same time calculate the F-value, (F calculated) which is compared to the F-value obtainable from the tables at 95% confidence level. This critical value of F is designated as :- N k 1 F , k , In our case it happens to be F0.05, 2, formula for F we find F R2 k , N 33 When we actually compute the F using our k 1 1 R 2 k 0.355 33 0 . 645 2 11. 715 9 . 08 . 1. 29 We note that the calculated value of F is larger than the expected value . therefore we reject the null hypothesis of the two variables having no effect on the independent 205 variable X1 and accept the alternative hypothesis that the two independent variables have a significant impact on the dependent variable X1 . TABLE 8 - 2 : ANOVA TABLE FOR MULTI-VARIATE LINEARR RELATIONSHIP TEST Degrees of Freedom Sum of Squares SS Total D Mean Square Fc N 1 2 R2 D2 F Explained by Treatment Unexplained SST R 2 D 2 SSE 1 R 2 D k 2 MST N k 1 R2 D2 1 R D 2 R 2 D 2 N k 1 k 1 R2 N k 1 D R 2 N k 1 1 R k Using this table, one does not need to go to all the pains we have undergone above, 2 because we can use the formula R N k2 1 right away to compute the calculated 1 R k value of F . This is compared with the expected value of F which is sought from the N k 1 . 206 2 N k 1 table F , k , D k 2 MSE k 1 R2 2 2 EXERCISES 1. A crop breeder would like to know what the relationship os between crop yield X1 (dependent Variable) and both the amounts of fertilizer used on the farm X2, and the amounts of insecticide used on his farm X2, after taking into account his activities between 1991 and 2000. Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 X1 40 44 46 48 52 58 60 68 74 80 X2 6 10 12 14 16 18 22 24 26 32 X3 4 4 5 7 9 12 14 20 21 24 Using any of the techniques of Multiple regression and Correlation and those of partial regression and correlation that you have learned above :- (a) Calculate the multiple regression coefficients for the given data, and test their significance (b) Compute the coefficient of multiple determination, multiple non- determination and the multiple correlation coefficient (c) Using the F-test, do you think that the relationship between the two independent variables and the dependent variable came by chance ? 2. The accompanying data was collected on maize in a study of phosphate response, base saturation, and silica relationship in acid soils. Percentage of response is measured in the difference between yield on plots on receiving phosphates and those not receiving phosphates, divided by the yield on plots receiving no phosphates and multiplied by 100. Therefore, there is a correlation between Y and Xi . The variable X2, labeled BEC is base exchange capacity. Consider this as a regression problem with two independent variables X1 and X2. 207 Response to Phosphates (%)(Y) Yields in Kilograms per Ha. ( X1 ) Saturation of BEC (%) ( X2 ) pH of the soil ( X3 ) 88 80 42 37 37 20 20 18 18 4 2 2 -2 -7 844 1678 1573 3025 653 1991 2187 1262 4624 5249 4258 2943 5092 4096 67 57 39 54 46 62 69 74 69 76 80 79 82 85 5.75 6.05 5.45 5.70 5.55 5.00 6.40 6.10 6.05 6.15 5.55 6.40 6.55 6.50 (a) Write the least squares equation, and estimate the regression coefficients. (b) Test the significance of the regression using analysis of variance F-tests . (c) Construct a 95% confidence interval in the multiple regression coefficient, and on the intercept. 208 CHAPTER NINE NON-LINEAR CORRELATION AND REGRESSION Introduction Linear models have their own limitations. The relationship between any two variables may not necessarily be linear. If on the first sight using the scatter diagram the dots or crosses seem to be scattered in some manner that is not clearly linear it may imply, not the lack of relationship, but the fact that the relationship between the dependent variable and the independent variable are not linear, and that the relationship is non-linear. In this connection, therefore, we need to discuss a technique which is able to analyze this kind of data and bring out the results from a non-linear relationship, and the regression along a non-linear trend. This chapter is an extension of the regression/correlation techniques to answer some of the difficulties. Logarithmic Transformations and Curve fitting Consider the following hypothetical data in Table 9 - 1, which relates the proportion of working population in service industries and the Gross Domestic Product per Capita in twelve countries. We need to investigate the nature of correlation between the number of employees in service industries and per capita income of 12 countries. Firstly, we must arrange data in form of a table and do all the elementary calculations. Initial examination of our statistics give us a correlation coefficient of r = 0.85 and r 2 = 0.73. These are initial indications of the strength of the relationship of the two variables. Which means that we need to test whether indeed the relationship is linear, because when data is plotted on a scatter diagram (Figure 9 - 1) the dots show evidence of lying on a curve. Therefore, we need to fit a curve to our data. The easiest technique of fitting a curve to this data is to use logarithmic methods. After obtaining a logarithmic equation then data is translated again into its actual anti-logarithmic form and a fresh curve is plotted to indicate the real nature of data. The advantages of the logarithmic methods is that the least squares methods which we have hitherto discussed can be used without any modification of data. After transforming the data semi-logarithmically, and using the logarithms of the dependent variables we obtain a straight line graph given in 209 Figure 9 - 3, which fits the data better than Figure 9 - 2. The latter is a scatter diagram drawn for the untransformed data on both axes. TABLE 9 - 1: PROPORTION OF WORKING POPULATION IN SERVICE INDUSTRIES (Source: King’oriah, 2004, p.333) Country Number GDP per Capita (Hundreds of Dollars) (X) 2.0 1.2 14.8 8.3 8.4 3.0 4.8 15.6 16.1 11.5 14.2 14.0 113.9 1 2 3 4 5 6 7 8 9 10 11 12 Totals x, y Number of Employees (per Thousand in service Industries) (Y) 12.0 8.0 76.4 17.0 21.3 10.0 12.5 97.3 88.0 35.0 38.6 47.3 453.4 X2 Y2 XY 4.00 1.44 219.04 68.89 70.56 9.00 23.04 243.36 259.21 132.25 201.64 196.00 1428.43 144.00 64.00 5836.96 289.00 453.69 100.00 156.25 9467.29 7744.00 625.00 1489.96 2237.29 28607.44 24.00 9.60 1130.72 141.10 178.92 30.00 60.00 1517.88 1416.80 287.50 548.12 662.20 6006.84 347.3292 2 xy 1703.3183 r, r 2 0.8531411 11, 476.477 0.7278497 210 Figure 9 - 1 : Scatter diagram of employment data Figure 9 - 2 : Straight-line curve on employment data 211 Figure 9 - 3: Results of logarithmic curve fitting. Data has been transformed Semi-logarithmically and transformed back. The results of the Regression equation have been used to fir the curve into the data We can try to do this exercise using the data on table 13 - 2. The relevant equation here is Log Y 0 . 2881 1.8233 X This is the equation whose results are plotted on Figure 9 - 2. Figure 9 - 3 is the results of the re-transformation the Y-values into their usual real number form (looking for the antilogarithms of the results), and then using the results of this to fit a curve into the scatter diagram. In real number form, if you remember that logarithms describe the numbers in terms of their powers of 10, the actual regression equation which realistically fits this data is :Y 1941 . X 1.8233 Double-logarithmic transformation is where the figures on both the X and the Y variables are transformed into their logarithms. Their regression equation is computed using the 212 transformed figures. Thereafter the actual regression equation is computed by looking for the anti-logarithms of all the data for the dependent variable and the independent variable. Learners may try to do this as an exercise. Let us now tabulate the data which was used for semi-logarithmic curve fitting at Table 9 - 2. The equations for fitting the regression lines involve the computation of the regression coefficients in the usual manner - the way we have learned in Chapter 8. Any of the methods available in practice including the use of computers can be used to obtain the relevant results. Hereunder is the shortcut equation for the calculation of the bcoefficient using semi-logarithmic methods and the tabulated data in Table 9 - 2. We give the shortcut equation for the computation of the b-coefficient which is identical to the ones we considered in Chapter 8. logb X log Y N X N X log Y N X 2 2 TABLE 9 - 2. ANALYSIS OF THE DATA WHICH WAS USED FOR SEMI-LOGARITHMIC CURVE FITTING (Source: King’oriah, 2004, p.337) Country Number (X) (Y) Log Y X ( log Y) X2 1 2 3 4 5 6 7 8 9 10 11 12 2.0 1.2 14.8 8.3 8.4 3.0 4.8 15.6 16.1 11.5 14.2 14.0 12.0 8.0 76.4 17.0 21.3 10.0 12.5 97.3 88.0 35.0 38.6 47.3 1.1647 0.8156 3.5461 1.5139 1.7646 1.0000 1.2032 3.9535 3.7811 1.9541 2.5173 2.8053 2.1584 1.0837 27.8699 10.2123 11.1586 3.0000 5.2651 31.0144 31.3065 16.0759 22.5297 23.4486 4.00 1.44 219.04 68.89 70.56 9.00 23.04 243.36 259.21 132.25 201.64 196.00 Totals 113.9 453.4 26.0184 185.1231 1428.43 213 LogY 2 Like the slope coefficient “b ” above, The intercept, or the a-coefficient can also be computed using the data in Table 9 - 2 and the equation which is also a mathematical identity of the equation we encountered in Chapter Eight:- a log Y log b X N Compare this expression with a Y b X . When data is fitted into both equations the following real-value expressions are the result :- logb log b X log Y N X N X log Y N X 2 2 . 121851231 113.917.1131 2 121428.43 113. 9 272 . 2971 0.0653 4167.95 We also insert data into the expression for the constant :- a log Y log b X N 17.1131 0. 0653113.9 0.8063 12 The regression equation, in its untransformed configuration is :log Y 0 .8063 0.0653 X We transform this equation into its real-number form to obtain the actual model which defines the relationship of our data : Y 0 . 64011.162 X . This is the equation which has been used to fit the least squares line onto the scatter diagram at Figure 9 - 3. Like we said above, it is possible to transform all data recorded for both variables X and Y logarithmically. Table 9 - 3 is a record of all the data which has been transformed 214 logarithmically involving both variables. This time we are investigating the relationship between hypothetical data recording the relationship between all people employed in Coffee industry and the foreign exchange earning of a coffee growing country. We expect that if coffee is the main export good for the country it is grown using large numbers of human resources, and that the larger the number, the more the coffee produced, and the more the foreign exchange earning in this country. This time the relevant equation for the slope coefficient is as follows :- b N log X log Y N log X log a and for the intercept is 2 log Y log X log Y log X 2 log b log X N Now we fit in the real data into the shortcut equations :- b 1220 .5555 10.611722 .8053 129.4326 10.61172 Using your calculator you can verify that b = 1.8233. The equation of the intercept is log a log Y b log X N , which we fill with data in the following manner :log a 22 .8053 1.823310 . 6117 12 215 TABLE 9 - 3. ANALYSIS OF THE DATA WHICH WAS USED FOR DOUBLE-LOGARITHMIC CURVE FITTING (Source: King’oriah, 2004, p.341) Employment in Coffee (X) (000,000 People) 5.8 6.3 6.5 6.8 7.6 8.0 8.0 8.5 8.7 8.6 9.0 9.1 Foreign Exchange Earning (Y) ($ 000,000 ) 48.8 58.2 59.9 62.7 72.3 82.1 82.5 93.5 99.1 100.0 114.6 115.2 Log X ( Log X ) 2 LogY LogY 2 log X log Y 0.7634 0.7993 0.8129 0.8325 0.8808 0.9031 0.9031 0.9294 0.9395 0.9345 0.9542 0.9590 0.5828 0.6389 0.6608 0.6931 0.7758 0.8156 0.8156 0.8638 0.8827 0.8733 0.9105 0.9197 1.6884 1.7649 1.7774 1.7973 1.8591 1.9143 1.9165 1.9708 1.9961 2.0000 2.0591 2.0614 2.8507 3.1149 3.1592 3.2303 3.4563 3.6645 3.6730 3.8841 3.9844 4.0000 4.2399 4.2494 1.2889 1.4107 1.4448 1.4963 1.6375 1.7288 1.7308 1.8317 1.8753 1.8690 1.9648 1.9769 10.6117 9.4326 22.8053 43.5067 20.2555 Using your calculator you can verify that a = 0.2881. The double-logarithmic regression equation is : log Y log a b log X . Translated into actual logarithmic data, the equation reads :log Y 0.2881 18233 . log X The scatter diagram for the double logarithmic data is as given in Figure 9 - 4. When we find the anti-logarithm of the data and draw the least-squares curve we find that, first of all the actual regression equation is : Y 1. 941 X 1.82331 , and the actual least squares curve is as at Figure 9 - 5. 216 Non-Linear Regression and Correlation Polynomial Model Building A polynomial is a function consisting of successive powers of the independent variable as in the following expression :Y a b X c X 2 d X 3 .... k X n This is an expression of an n th degree polynomial. The first degree polynomial is the familiar equation Y a b X . The degree of the equation is determined by the exponent or the power of the polynomial. The nature of the least-squares line is determined by polynomial exponents on top of the independent variable. Each degree of the polynomial determines the undulation of the least-squares curve. The polynomial equation is one of these mathematical equations which could be used to describe the wave pattern of the curve. In general the degree of the polynomial generates a least-squares curve of n - 1 bends of the wavy curve. Since the higher-order polynomials can exhibit a positive slope at one point and negative at another they exhibit wide characteristics of relationship between variables, other than linear transformations, which do not have these important characteristics. Example Consider this person who wishes to investigate the relationship of land values to the distance from the town center as given in your textbook. (King’oriah, 2004, page 344). 217 Figure 9 - 4 : The Scatter Diagram for the Double-Logarithmic Data in Table 9 - 3. 218 Figure 9 - 5: Actual Least-squares Curve for Employees in Coffee Industry and Foreign Exchange. The following is actual data which was collected from the field by Prof (Dr.) Evaristus M. Irandu, early in the 1980s, as he was doing his master’s degree in Geography of the University of Nairobi, and is cited in (King’oriah, 2004, page 344 page 358). Follow his trend of thought and find the relationship between the distance and land values from the city centre of Mombasa by fitting a curve to the given data and calculating all the relevant statistics which are estimated by the given data in Table 9 - 4 Prof. Irandu’s scatter diagram, after some extensive data analysis, is given in Figure 9 - 6. After the necessary tabulation he used the shortcut equation below for computing the correlation coefficient. This equation is identical to any you may have used before, and has been known to bring forth accurate results, even without significant rounding error problems. 219 r n XY N X 2 X Y X N Y Y 2 2 2 When we obtain the relevant figures from Table 9 - 4 we have the following results :- r r r 15697.4 61.1341 15325. 97 61.1 2 1514,235 341 2 10461 208351 . 4889.55 3733.21213525 116281 10374 .1 0 . 9783094 1156.34 97244 The Coefficient of Determination “ r 2 r 0 . 978; r 2 0.957 ” which is shown in the last equation, revealed that as much as 95.7% of the change in land values as one moves away from the city center - from Mwembe Tayari area, or even from Ambalal House - is explainable by the distance from this city center. Again, using his data as laid out in Table 9 - 4, and the shortcut formula for the computation of the regression coefficient we may compute the b-Coefficient. The formula, as usual is :b N X Y N X 2 X Y X 2 The formula is identical to any that we have used before; and can be filled with actual data as follows :- 220 b 15697.4 61.1341 15325.97 61.1 2 10461 20835.1 4889 .55 3733.21 Table 9 - 4: Analysis of the variation of land values with distance ( Source : Adopted from E.M. Irandu, 1982 ) (X) Distance (Y) Land Values/ha. ( Sh.20,000 ) 0.4 0.8 1.3 2.1 2.3 3.1 3.0 4.2 4.7 5.3 6.1 59 55 54 40 41 28 18 16 9 6 5 X2 Y2 6.2 7.2 3 2 0.16 0.64 1.69 4.41 5.29 9.61 9.00 17.64 22.09. 28.09 37.21 38.44 51.84 6.9 7.5 3 2 47.61 52.25 9 4 20.7 15.0 341 325.97 14,235 697.4 61. 1 221 3481 3025 2916 1600 1681 784 324 256 81 36 25 XY 9 4 23.6 44.0 70.2 84.0 94.2 86.8 54.0 67.2 42.3 31.8 30.5 18.6 14.4 Figure 9 - 5: Variation of land values with distance ( Source : Adopted from E.M. Irandu, 1982 ) b 10374 .1 89714963; 1156 . 34 b 8 . 975 The interpretation of this coefficient means that a unit change in distance travelled from the city center in Kilometers will cause a decrease of (897.5 20)/- = Sh. 17,950/form any figure - given the linear relationship assumption. The Intercept “ a ” is easily computed using the data on the table and the usual formula for the calculation of the same :- a Y b X N 341 8 . 9714963 611 . 59 . 277228 15 222 This intercept figure revealed that the land values at 0.4 Kilometers from the city center (which center is defined to be around the area spreading somehow from Ambalal House to Mwembe Tayari) in 1982 were Shs. ( 20 59.277228/-) per hectare. This is approximately Sh. 1,185, 544 per hectare. This decline in land values is depicted in Figure 9 - 6. Testing for Non-Linear Relationship The nature of the data in the scatter diagram made Prof. Irandu curious. He needed to test whether a non-linear relationship existed, whose regression equation (if calculated) can be used to define a better fitting non-linear least squares curve. His equation of choice was a second degree polynomial function : Y a bX c X 2 . To accomplish this, he used a set of Normal Equations which are the equations of choice analyze the values of polynomial functions. In order to do this, he tabulated his data according to Table 9 - 5. This allowed him to fill in the relevant values within the Normal Equations and to compute various statistics which define the equation of the polynomial. These equations can be solved using simultaneous equation methods or matrix algebra. (For a detailed solution of the equations see King’oriah, 2004, pages 350 - 354.) The Normal equations for the second degree polynomial curve are :- Y Na b X c X XY a X b X X Y a X b X 2 2 2 2 3 cX3 cX4 All we need to solve these equations is to substitute the respective values from the tables. These values are available at the ends of the columns of Table 9 - 5. Try to see if you could identify them and use simultaneous methods to solve the normal equations. Once 223 you have identified the values at the bottom of the table you will begin your solutions with the following figures for each equation :- Figure 9 - 6 : A straight-line Regression line fitted on data which was analyzed in Table 9 - 4 Actualizing the normal equations we find that :341. 00 15 a 611 b 325.97 c .............................( i ) 697. 40 61.1a 325. 97 b 1966 . 721c ........................( ii ) 2262.24 325.97 a 1966.721b 12 , 758 . 639 c .............( iii ) 224 TABLE 9 - 5; DATA FOR MANIPULATING NORMAL EQUATIONS IN COEFFICIENT COMPUTATIONS OF LINEAR LEAST SQUARES POLYNOMIAL CURVE (X) 0.4 0.8 1.3 2.1 2.3 3.1 3.0 4.2 4.7 5.3 6.1 6.2 7.2 6.9 7.5 61. 1 (Y) 59 55 54 40 41 28 18 16 9 6 5 3 2 3 2 X2 0.16 0.64 1.69 4.41 5.29 9.61 9.00 17.64 22.09. 28.09 37.21 38.44 51.84 47.61 52.25 341 325.97 3481 3025 2916 1600 1681 784 324 256 81 36 25 9 4 9 4 X3 0.064 0.512 2.197 9.261 12.167 29.791 27.000 74.088 103.823 148.877 226.981 238.328 373.428 528.509 391.875 X4 0.256 0.4096 2.8561 19.4481 27.9841 92.3521 81.0000 311.1696 487.9691 789.0481 1384.5841 1477.6336 2687.3856 2266.7121 2730.0625 XY 23.6 44.0 70.2 84.0 94.2 86.8 54.0 67.2 42.3 31.8 30.5 18.6 14.4 20.7 15.0 X 2Y 9.44 35.20 91.26 176.40 216.89 269.08 162.00 282.24 192.81 168.54 186.05 115.32 103.68 142.83 104.50 14,235 1966.721 12,758.639 697.4 2262.24 Y2 Source : Adopted from E.M. Irandu, 1982 . [Any errors in the interpretation of all this data in this text are mine, and not Prof. Irandu’s.] If you follow the argument in your textbook you will come to the least squares curve and equation answer :Y 70.37 165319 . X 0 . 966 X 2 The curve resulting from this equation minimizes the sum of squared deviations between the observed values of Y for each value of X. Using this equation; we can predict the values of Y for each value of X. The values and the process to obtain them are tabulated in Table 9 - 6. 225 Table 9 - 6: Tabulation for the computation of the Least-Squares Line for Yc X 0.4 0.8 1.3 1.4 1.7* 2.1* 2.3 3.1 3.0 4.2 4.7 5.3 6.1 6.2 7.2 6.9 7.5 Intercept a 70.370931 = –bX = – 16.5319 X – 6.61296 – 13.22552 – 21.491462 – 23.144652 – 28.10422 – 34.716699 – 38.02337 – 49.59570 – 51.248871 – 69.43398 – 77.69993 – 87.61907 – 100.84459 – 102.49778 – 114.07011 – 119.02968 – 123.98925 70.37 – 16 . 5319X + 0 . 966 X 2 2 X 0.16 0.64 1.69 1.96 2.89 4.41 5.29 9.61 9.00 17.64 22.09. 28.09 37.21 38.44 51.84 47.61 52.25 cX2 = 9066X2 0.1450594 0.580278 1.5321905 1.7769783 2.6201364 3.9982013 4.7960283 8.1595944 8.7126336 15.992805 20.027271 25.467001 33.735390 34.850534 43.164254 46.999264 47.370979 Y 59 55 54 / / 40 41 28 18 16 9 6 5 3 2 3 2 341 Y 59 55 54 / / 40 41 28 18 16 9 6 5 3 2 3 2 Yc 63.903030 57.725649 50.411660 49.003259* 44.886847* 39.652142 37.143589 28.934825 27.834694 16.929756 12.698272 8.218862 3.261731 2.723685 – 0.534925 – 1.659485 6.247340 Y c ** (Rounded) 63.9 57.7 50.4 49.0* 44.9* 39.7 37.1 28.9 27.8 16.9 12.7 8.2 3.3 2.7 – 0.5 – 1.7 – 6.2 340.9 ( Source : Adopted from E.M. Irandu, 1982 ) * Figures in these rows have been computed using the regression equation to enable close plotting of regression curve on the scatter diagram. ** Values of Yc have been rounded to assist in approximating the position on ordinary graph paper. This is not necessary when a computer program is used for plotting the regression curve. When the values of Yc are plotted against those of X which are available in Table 9 - 6, an almost-perfect fit of the least squares line is obtained as in Figure 9 - 7. The computations in Table 9 - 6 are not mandatory always because in most cases we use appropriate computer packages to arrive at the necessary statistics for the estimation of the least-squares line. This computation has been done to ensure that we understand what is involved in this work. As an exercise the learner may wish to verify the figures and plot the graph on figure 9 - 7. 226 Figure 9 - 7 : The non-linear Regression line using a First order polynomial equation on Land values in Mombasa ( Source : Adopted from E.M. Irandu, 1982 ) Significant Tests Using Non-Linear Regression/Correlation Methods The coefficient of Non-Linear Determination Like the other coefficient of determination for the linear regression and correlation which we have discussed before, this coefficient tells us the amount of variation which is explained by the changes in the independent variable. The easiest and most easily understandable method - among the several which are available - is the one involving the computation of all the estimated values of Y using the regression equation which we have just computed. The difference between these and the observed values of Y, and then all the sum of squared differences, and the mean-square (variance) due to the regression is computed. This variation represents the explained variation in the 227 model - the variation explained by the independent variable. The Total Variation is the crude variance of the dependent variable. In principle, the ratio of the explained variance to that of the total variance is the coefficient of determination. In this kind of analysis, the variance is denoted by :- p Explained Variation Total Variation A2 2 2 Y X X r In this connection: 2 P2 the variance of the predicted values of Y. A2 the variance of the observed values of Y. Table 9 - 7 shows the values of the dependent variable ( Y ) which were observed listed against those which have been computed using the equation for the regression curve obtained using the regression equation. Within the table rounded figures have been used to simplify the computation. Usual methods of calculating the variance of predicted values of Y are employed. The examples we give below give the shortcuts for computing these variances in the same manner as we obtained the regression equations above. In that connection, the figures at the bottom of Table 9 - 7 are used to formulate the ratios of the shortcuts in the following manner :- 2 P Y2 P YP N 2 15087 .11 340.9 2 N 1 15 1 15 15087 .11 116,212 .81 15 1 15 P2 P2 6741. 7133 481.55095 14 228 TABLE 9 - 7 : DATA AND COMPUTATIONS USED FOR OBTAINING THE TOTAL VARIANCE OF Y AND THE EXPLAINED VARIANCE OF Y. (Y) Land Values/ha. ( Sh.20,000 ) Yc Y2 Yc 2 59 55 54 40 41 28 18 16 9 6 5 3 2 63.9 57.7 50.4 39.7 37.1 28.9 27.8 16.9 12.7 8.2 3.3 2.7 – 0.5 3481 3025 2916 1600 1681 784 324 256 81 36 25 9 4 4083.21 3329.29 2540.16 1576.09 1376.41 835.21 772.84 285.61 161.29 67.24 10.89 7.29 0.25 3 2 – 1.7 – 6.2 9 4 2.89 38.44 341 340.90 14,235 15087.11 You will note that there is a negative sign in the variance which we have obtained. Variances normally do not have negative signs, but the value of this one has been influenced by (among other things) the fact that we are dealing with a second degree polynomial function. The sign should not worry us if the actual variance is also of the same sign. This is what is happening in this case, as hereunder:- 2 A Y2 A YA N 2 14 , 235 . 11 341 2 N 1 15 1 15 229 14 , 235 116,281 15 1 15 A2 6803. 0667 14 48593333 . The ratio of the predicted variance to that of the actual variance is the coefficient of nonlinear determination. P2 481.55095 0 . 9909815 2 A 485. 9333 r 2 Y X X 2 0 . 991 The square root of this one is the correlation coefficient r = 0 . 996. It is evident that using the non-linear model we can explain 99.1% of the variation in the independent variable ( X ) and its square. Using the linear model we were able to explain only 95.7% of the variation in Y through the variation in X . We conclude that in this case the non-linear polynomial curve is the best model representing how land values declined from Mombasa City center to the periphery at the time of the field investigation. EXERCISES 1. Discuss the limitations of the bi-variate linear regression/correlation model and compare it to some of the available non-linear models that help solving the problems associated with these limitations. 2. Explain what you mean by :(a) Arithmetic Transformation (b) Semi-logarithmic Transformation (c) Double-logarithmic Transformation 230 3. The following data array records the changing pressure of oxygen at 25o C when filled into various volumes. Volume in Liters Pressure in Atmospheres 3.25 5.00 5.71 8.27 11.50 14.95 17.49 20.35 22.40 7.34 4.77 4.18 2.88 2.07 1.59 1.36 1.17 1.06 231 CHAPTER TEN NON-PARAMETRIC STATISTICS Need for Non-Parametric Statistics The techniques which we have discussed so far, especially those involving continuous distributions have stressed underlying assumptions for which the techniques are valid. These techniques are for the investigation of parameters and for testing hypotheses concerning them. They are called parametric, and their main concern is with statistics whose distribution is normal. A considerable amount of data is such that the underlying distribution is not easily specified. To handle such data we need distribution-free statistics which are not dependent on the distribution of the parent population. these are what are called NonParametric Statistics. If we do not specify the nature of the parent population then we will not deal with parameters like we have hitherto done. this means that non-parametric statistics compare distributions rather than parameters. they may be sensitive to changes in location, in spread, or in both. We will not try to maintain a distinction between distribution-free and non-parametric statistics. Rather, we shall collect them under the same fold-title and call them non-parametric statistics. This kind of statistics has a number of advantages. 1. When it is only possible to make weak assumptions about the nature of the distributions underlying the data. 2. When it is difficult to categorize the data because of adequate scale of measurement. 3. When it is only possible to rank the data but not to measure such data accurately because of the weak scale of measurement underlying experimental design and data collection methods. The only disadvantage about non-parametric statistics is that they are not good accurate estimators of parameters, because the distribution of their parent population is unknown. They are therefore better avoided if there are alternative parametric statistics which can be used with greater effects. 232 Chi-Square and the Test for Goodness of Fit We have already dealt with this kind of statistic, in our foregoing discussion. This is not because it falls within the parametric category, but because of its immense importance and frequency of applications in simple experiments. It was found prudent that the learners should be exposed to this statistic early so that they could compare it with Analysis of Variance, which falls immediately afterwards. No scale of measurement is required to define the categories, although some scale may exist and may be used. The probabilities may be determined by theory or may be estimated from the results of the analysis. The Chi-Square test operates at low levels of measurements. Categories are purely nominal and the data is not used to determine rations and probabilities. We tested the null hypothesis that a normal distribution is involved. The normal parameters had to be estimated before cell probabilities could be computed. The goodness of fit was something to do with the fitness of low-measurement data into the expected distribution, however crude. Remember how we shifted gears to a higher level of measurement - analysis of variance, and then on to Regression/correlation, where we used the normal distribution as the underlying assumption. We were then in Parametric Statistics proper, which included the normal distribution tests, F-tests, and Regression/correlation measures. The Median Test for one sample Recall when we said the median is the counterpart of the mean, especially when the end-observations are expected to have heavy influence on the central tendency of data. In that connection, the median is computed using the observation of the middle category in any range of data. Using the Median test, we test the hypothesis that any set of n-randomly drawn measurements came from some parent population with a specified median. The scale of measurement must be at least interval because the median cannot be determined using any other lower scale - like nominal or ordinal scales. Given any problem where this test must be applied, the differences between all observations and the expected median ( D i ) must be determined and ranked. If any 233 difference is zero, it is disregarded. If ties occur among ranks of the items involved the average rank for each tied item is used. After the ranking exercise, assign the proper sign for each difference, negative if the observation is below the median and positive if above the median. You may then arrange them in absolute value increasing order. Add all the values of the positive-ranked observations and those which are negatively ranked. Compare the two sums of the rank values. The test statistic is based on the smaller of the two types of rank totals regardless of whether it is for the positive or for the negative ranks. This smallest rank is called R, and is the one which is used in Table A-8 ( page 502 of your textbook; King’oriah 2004) to accept or reject the hypothesis. The hypothesis is rejected if the observed values of R exceed the tabulated values of R for the specific n-values and alpha level. Definition of the Test Assumptions Data consists of n measurements : X1 , X2 , X3 , ........... , X n. D1 denotes the difference between X i and the hypothesized median X M . Then :1. Each Di must be a continuous random variable. 2. The distribution of each Di must be symmetric. 3. All these measurements X i : i = 1, 2, ........ , n must represent a random sample from the population distribution. 4. The measurement scale must be at least interval (to enable the computation of the median, because the median cannot be computed at any lower scale conveniently. Hypotheses Ho : The population median is X M HA : The population median is not X M Test Statistic Determine the differences Di X X , i 1, 2, ..... , n .If any Di 0 drop it 234 from the set and decrease the number n by one. Rank the absolute values D i . If ties occur among the ranks, average the ranks of the items involved in the tie and use the average as a rank of each tied item. Each rank is suffixed with the sign of the difference corresponding to it. Let R be the total of the positive ranks, and let R be the total of the negative ranks. The test statistic is the smaller of the two categories of ( R or R ). Designate this number with the symbol R . Decision Rule Reject the Null hypothesis when R exceeds W1 / 2 or when R is less than W /2 . these critical values W1 / 2 and W /2 are given in Table A - 8 of your textbook. (King’oriah, 2004, Page 502) Otherwise, accept the hypothesis. Example The manager of a large motor firm dealing in the newest models of agricultural small-load pickups, randomly selects ten of these petrol powered pickups. He subjects them to petrol consumption mileage tests, and finds that their consumption in terms of kilometers per liter is as in the Table 10 - 1 below. Test the null hypothesis using the median test (at 0.10 significance level) that the median of the population petrol consumption rate is 30 Kilometers per Hour. Solution Eliminate D2 0 . R 3. 5 R 41.5 Since R 3.5 is the smaller than R 41.5 , R is the value of the test statistic. From Table A - 8 of your textbook (King’oriah, 2004, Page 502), W / 2 W0.05 9 with n = 9, as we eliminate observation 2. W1 / 2 W0.95 36 . Since R is less than 9 we reject the null hypothesis that the population median is 30 kilometers per liter. 235 TABLE 10 - 1: PETROL CONSUMPTION RANKINGS OF 10 PICKUPS Measurement Median Di Di Rank 24.6 –5.4 5.4 7 30.0 0.0 0.0 28.2 – 1.8 1.8 2 – 2.6 2.6 3.5 – 3.2 3.2 5 23.9 – 6.1 6.1 8 22.2 – 7.8 7.8 9 26.4 – 3.6 3.6 6 32.6 2.6 2.6 3.5 28.8 – 1.2 1.2 1 27.4 26.8 X M = 30 Notice the treatment of the tied ranks D4 D9 2 . 6 . These two absolute differences tie for the ranks of 3 and 4; Thus each is assigned the average rank of 3.5. Also, note that the first assumption of the test does not allow for ties. Since Di must be continuous, ties should not occur. If the Di s are not continuous we can use the test as an approximate test by dealing with the ties in the manner described above. Two observations are important regarding this example. First, if we can assume that the distribution of kilometers per liter is symmetric, then the median test is the same as testing the hypothesis that the mean kilometers per liter is 30 kilometers. Secondly, data have been measured on the ratio scale, and the t-test given in Chapter Three is appropriate if we assume that the distribution of kilometers per liter is symmetric and normal; or if the sample average is approximately normal via the Central Limit Theorem. 236 The Mann-Whiney Test for Two Independent Samples The test is designed to test whether two random samples have been drawn from the same, or from different populations. It is based in the fact that if the two independent samples have been drawn from the same population the average ranks of the scores of their scores should be approximately equal. If the average ranks of one sample is much bigger or much smaller than that of the second sample, then this shows that both samples come from different populations. Assumptions :The two samples are independent. They also consist of continuous random variables even if the actual observations are discrete. The measurement scale is at least ordinal. The test is designed to test the null hypothesis that both samples have been drawn from the same population distribution against the alternative hypothesis that the two random samples come from different population distributions. Example A farmer is breeding one kind of exotic steers using two different methods of feeding. One group is fed using zero-grazing methods, and the other is released freely onto his grass paddocks for the same period as the first type. He would like to see whether resulting body-weights resulting from the two feeding methods are identical. After some time, he selects ten steers from each method of feeding, and records the body- weight of each of the two types of steers. He would like to know whether the body weight of both types of steers is the same; and if the two feeding methods are equally suitable for rearing his steers. The following table is a record of body weights ten of each type of steers. Test the null hypothesis that the feeding method is same at 5% alpha level. Solution 1. Mix up the body-weight observations and order them from the smallest to the largest. Underline the body-weight from the first group in order to retain the identity within the mixed group. A mean rank has been assigned to each of the 237 ties in body-weight; and where this is the case, the next rank up is skipped. The next highest observation is given the next but one rank. What now remains is the computation of S-value for the observations from the first group. The underlined scores come from Group one, and those not underlined from Group Two. We now go ahead and compute the statistic S using Group one. 2. The statistic S 10 R 1 1 Xi . You may use any of these sums to compute the T statistic T T S 10 R 1 1 Xi The null hypothesis is that the two samples have been drawn from the same distribution at 5% alpha level. It is rejected if the value of T obtained using the above formula is less than the one expected and found from Table A - 9 of your Textbook ( page 503, King’oriah, 2004). We compute the T-value using the formula :- S 10 R 1 1 Xi 1.5 3 4 5.5 7 8 9.5 9.5 11 15 74 The T-value which has been found within the array of the table is 24. The farmer should have rejected the hypothesis if the calculated T-value was less than :T nm W0.025 1010 24 76 (Not available from the table but computed from the formula as we have done here) 238 TABLE 10 - 3 X1 50 50 X2 1.5 55 58 62 62 3 4 5.5 65 67 70 70 7 8 9.5 50 75 11 12 78 13 80 81 82 14 15 16 84 17 88 18 90 19 91 20 We find that the calculated value of T = 74 is neither smaller than the expected value of 24, nor greater than the expected value of 76. It lies between the two limits. We should have accepted the null hypothesis that the two groups are not different, they come from the same population. This is an indication that the two feeding methods are equally good. 239 Wilcoxon Signed Rank Test for Two matched Samples This statistic is very useful in behavioral sciences. It enables the researcher to make ordinal judgments between any two pairs. The statistic is designed to test the null Hypothesis that the two means are equal against the alternative one that they are different. Assumptions 1. that the data consists of n-matched pairs of observations from randomly selected samples X and Y and D i are differences between any ( X i , 2. Y i ). Each difference is a member of a continuous variable and any observation of D i is a result of chance. This means that each D i is a random variable. 3. Each distribution of all the differences is symmetric 4. All paired observations represent a random sample of pairs from a bi-variate distribution. Very many other pairs might as well have been chosen from the same population. 5. The scale of measurement is at least ordinal Procedure If D i is the difference between any matched pair under the two different treatments, then D i are ranked in the order of their absolute values. After that assign the sign of the difference show that each D i originated from the negative or from the positive side; depending on which sample observation was bigger than the other one at each time. If the treatments A and B are equivalent and there is no difference between samples A and B one would expect to find a good mix of the signs having a minus sign and those having a plus sign. In that case, the two sums should be about equal if the null hypothesis of no difference between samples is to be accepted. 240 If the sum of the positive ranks is very much different from the sum of the negative ranks, we should expect that the first treatment treatment B A differs from and therefore reject the hypothesis. On the other hand the same rejection would follow if the sum of the negative ranks predominated; meaning that B is very much different from A . After the ranking exercise one would be required to sum all the ranks bearing each sign. If the R is the sum of all the positive-ranked values and R the sum of all the negative-ranked values, one would examine the smaller value among the two sums no matter whether it belongs to the negative or the positive ranks. 6. Reject the null hypothesis if the smallest rank value sum ( R ) is less than Wa 2 . This is obtained using the formula :- W1 2 n n 1 W 2 2 This is found in the same manner as we found the one for Median, and is available in Table A - 8 of your Textbook ( King’oriah, 2004 page 502 ). Now let us do the actual example. Example On professor of applied psychology selects 10 students randomly from a group of senior classmen to determine whether a speed-reading course has improved their reading speeds. Their readings in terms of words per minute is recorded in the following manner :- 241 TABLE 10 : 4 RESULTS OF A STANDARD SPEED-READING TEST Student Before After 1 210 235 2 160 385 3 310 660 4 410 390 5 130 110 6 260 260 7 330 420 8 185 190 8 100 140 10 500 610 Assist the professor to test the null hypothesis that the two population-means, one for before, and the other for after the course, are equal at 5% alpha level. Solution Take the differences of the two observations from both groups and rank them as illustrated in table 10 - 5. Notice that the sixth difference is eliminated, because there is no difference between the two groups here. This reduces the number of paired scores to ones used in our test to Di = 9. There are two positive-ranked values 2.5 + 2.5 = 5. All the other rank values add to 40.0. Notice that the signs of the ranks are derived from the signs of Di . In that case we have R 5. 0 and R 40 . 0 This means that in our speed-reading course we shall use the test statistic involving the smaller of the two values R 5. 0 . If we now refer to the table we find that :- 0.05 , and 2 0 . 025 1 2 0 . 975 , 242 n 9 TABLE 10 - 5 : RANKED DIFFERENCES OF THE RESULTS OF A STANDARDIZED SPEED-READING COURSE Student Di Ranks of Di 1 - 25 4 2 - 225 8 3 - 350 9 4 5 20 20 2.5 6 0 - 7 - 90 6 8 -5 1 8 - 30 5 10 - 100 7 Examining the table we find that the expected R-value is 6. Then we compute the R-value for W1 W0.975 2 n n 1 W 2 . The computation goes on as follows :2 n n 1 9 9 1 90 W0.025 6 6 2 2 2 45 6 39 . Conclusion The computed value of R 5. 0 is smaller than the expected value at n = 9; which is 6.0. It also lies outside the interval demarcated by the end-limits :n n 1 2 45 243 We therefore reject the null hypothesis that the speed reading course is ineffective and accept the null hypothesis that the professor’s speed reading course is effective. The Kruskal-Wallis Test for Several Independent samples. This test is an extension of the Mann-Whitney test when there are more than two populations. It is non-parametric analog of the parametric single factor, completely randomized design analysis of variance method. The data for the test must be in the following form :- TABLE 10 - 6 : ARRANGEMENT OF DATA FOR THE KRUSKAL-WALLIS TEST FOR SEVERAL INDEPENDENT SAMPLES Sample 1 Sample 2 .......... Sample k X 11 X 11 .......... X 11 X 11 X 11 .......... X 11 . . . X 11 . . . X 11 . . . X 11 .......... .......... The total number of observations is given by N k i 1 ni . The test depends on the ranks, and is similar to the Mann-Whitney test. Assign ranks from 1 to N to all N observations when they have been ordered from the smallest to the largest, disregarding from which of the k populations the observations came. Let ranks assigned to the i th sample :- Ri ni R X , ij j 1 244 Ri Ri n i R i be the sum of the Where R X i j is the rank assigned to X i j . If the Ri s are not the same it supports the null hypothesis that k samples came from the same population. then it indicates that one or more populations are likely comprised of larger values than the rest. If they occur, the ties are treated as in Mann-Whitney test. Definition of the test :Assumptions : 1. The k-random samples are mutually independent 2. All random variables X i j are continuous. 3. The measurement scale is at least ordinal. Hypotheses Ho: The k-population distributions are equal HA: At least one population tends to yield larger observations than the rest. The Test Statistic 12 N N 1 T k i 1 2 1 n i N 1 2 , where :ni n i = i th sample size, N Ri ni R X , ij k i 1 ni i 1, 2, .......... , k , and : j 1 R Xij = The rank assigned to observation X i j . Table A - 10 at the back of your textbook gives the critical T-values at exact significance levels for k = 3 and samples up to and including a sample 245 of k 3, size five. If and /or n 5 for at least one sample the 2 distribution with k - 1 degrees of freedom may be used to find the approximate critical T-value. We shall now work out an example for which Table A - 10 is appropriate. The Chi-Square approximation for the critical T-value appears to be good, even if k and the ni are only slightly larger than 3 and 5, respectively. For example, if k = 6 and we have an alpha level 0.05 , the critical T-value is 02.05, k 1 5 11.1. (See the ChiSquare Table A - 6 on page 499 of your textbook (King’oriah, 2004, page 499 ). We reject the null hypothesis if the T-value is greater than 11.1. If the populations differ, but only in location, then the Kruskal-Wallis test is equivalent to testing the equality of the k-population means for the equivalence of means which is identical to testing for significance using the Analysis of Variance. Example A manager wishes to study the production of output of three machines A, B and C. The hourly output of each machine is measured for five randomly selected hours of operation. From the data given in Table 10 - 7, test the null hypothesis at 5% significance level that the three population distributions are equal, using the Kruskal Wallis Test. Solution The data are ordered as in Table 10 - 8 and ranked accordingly. The computations are carried out as hereunder, to determine the T-value for the necessary comparison. R1 2 4 85 . 10 .5 13 38 R2 1 3 5 6 .5 15.5 R3 6 .5 85 . 10 .5 12 14 51.5 And therefore :- 246 T 2 51.5 1 24 15 12 38 1 25 15 14 15 5 4 0. 057 0. 05 56.5625 39 . 2 5. 233 2 51.5 1 25 15 5 TABLE 10 - 7 : PRODUCTION OUTPUT OF THREE MACHINES Observation Machine A Machine B Machine c 1 25 18 26 2 22 23 28 3 31 21 24 4 26 * 25 5 20 24 32 From Table A - 10 in the appendix of your textbook ( King’oriah, 2004, page 506 ), the critical T-value is 5.6429. Notice that this value corresponds to :n1 5 , n2 5 , n3 4 But the order of the sample sizes does not affect the critical value. Since T = 6.233 is not greater than 5.6429, we accept the null hypothesis that the three population distributions are equal. 247 2 TABLE 10 - 8 : ORDERED DATA FOR PRODUCTION OUTPUT OF THREE MACHINES __________________________________ Rank A B C 1 ................................... 18 2 ................... 20 3 ................................... 21 4 ................... 22 5 ................................... 23 6.5 ................................ 24 ....... 24 8.5 ............... 25 ....................... 25 10.5 .............. 26 ....................... 26 12 .................................................... 28 13 .................. 31 14 .................................................... 32 __________________________________ 248 Rank Correlation Coefficient: Spearman’s Rho. This is one the non-parametric equivalent of the Pearson’s Product moment of Correlation which we discussed in Chapter seven. It is a measure of association in at least an ordinal scale; so that it is possible to rank the observations under study in two ordered series. The rankings of the two scores are compared by observing the differences between the first variable “ X ” and the second variable “ Y ”. These differences are squared and then added. Finally the results are manipulated in order to obtain a statistic which equals to 1.0 if they are in perfect concordance; or – 1.0 if they are in perfect disagreement. The derivation of this statistic involves applying the ordinary formula for the product moment of Correlation to ranks of the two variables instead of to raw observations. Consequently, the resulting statistic denoted as “the Spearman’s Rho” and designated with the symbol “ r s ” or the Greek letter “ ” . the interpretation of the statistic is broadly analogous to that of the Correlation Coefficient, although in this case it is only the relationships of the underlying distributions that are relevant. Definition of the test Assumptions 1. The n-pairs X i , Yi represent a random sample drawn from a bi-variate population distribution of continuous random variables X and Y. 2. The measurement scale is at least ordinal Hypotheses Ho: The X i , Yi are uncorrelated HA: Either there is a tendency for larger values of X to be paired with larger values of Y or there is a tendency for the smaller values of X to be paired with larger values of Y . The Test Statistic 3. Let R X i be the ranks of X i and the R Yi be the ranks of Yi . Ties are handled as usual - assign to each tied value the average of the ranks that would 249 have been assigned had there been no ties. Spearman’s Rho - the correlation measure and the test statistic is given by :- 1 6 n R X R X i 2 i n n2 1 i 1 The decision Rule 4. Reject the null hypothesis if the measure is greater than 1 measure is less than Table A - 11 2 , where 2 and 1 2 2 or if the are given in of your textbook. If there are no ties the Spearman’s rho can be calculated from the product moment correlation coefficient :- X n r by replacing X i s Y X Yi Y n X i X i 1 and i i 1 2 n i i 1 Y 1 2 2 Yi s with their ranks. Example “Nyama Choma” (Roast meat) is a big public health problem in this country. Men and women have been eating it for a long time, and now are succumbing to debilitating maladies which are associated with the overconsumption of proteins and fats together with alcohol. This causes the formation of hard calcites within the joints, impairing efficient bone manipulation during locomotion and causing gout and great hardship within the family among breadwinners. In addition, the existence of high density lipids in the blood causes cholesterol problems which ultimately cause heart diseases, stroke and associated illnesses. 250 Men and women in this country are eating Nyama Choma in great quantities; because beef, goat meat and mutton are relatively cheap and readily available. In addition the dishes accompanying the roast-meat like Kachumbari, Mukimo, Ughali, Mutura (and others) are easy to prepare, very sweet and delicious. In an attempt to attack this unfortunate habit of Nyama-Choma abuse through public education, Dr. Asaph Mwalukware intends to study the meateating habits among men and women of this country, so that he can determine which group of people to target with his information dossiers about the debilitating habits of continuous eating of Nyama Choma and combining this with copious consumption of alcohol. Dr. Mwalukware feels that in any home, husbands have the habit of eating more Nyama than their wives (because of hanging out with “buddies” within pubs and night-clubs until late in the evening, while the wives wait at home), but cannot say for sure this is the case. He would like to test the null hypothesis of no difference in roast-meat-eating habits, using ten randomly-selected couples in Maua Municipality, Kenya. Each couple is asked what their opinion is about habitual nyama chomaeating habits, and to rate their feelings from “ 0 ” (Strongly dislike and condemn Nyama in the strongest terms possible) to “ 100 ” ( I love the and adore the habit, Nyama is delicious! “ Poa! ” How can you do without it? ) The ten pairs of ratings are shown in Table 10 - 9 (Source: hypothetical: Data adapted from Pfafenberger and Patterson : page 680) 251 TABLE 10 - 9: PREFERENCE FOR EATING NYAMA-CHOMA AMONG TEN RANDOMLY SELECTED COUPLES IN MAUA MUNICIPALITY. 1 Husband ( Xi ) 90 Wife ( Yi ) 70 2 100 60 3 75 60 4 80 80 5 6 60 75 75 90 7 8 85 40 100 75 9 10 95 65 85 65 Couple Solution The ranks of all X i s and Yi s are given in Table 10 -10, and the value of Rho is computed using the equation immediately below the table. 1 6 n R X R X i 1 1 1 6 8 4 i 2 i n n2 1 2 10 1.5 2 4 .5 1.5 10 10 2 1 6161 990 1 0 . 976 0 . 024 252 2 .... 3 3 2 TABLE 10 - 9: RANKS FOR THE EATING NYAMA-CHOMA EATING PREFERENCE AMONG TEN RANDOMLY SELECTED COUPLES IN MAUA MUNICIPALITY. 1 Husband ( Xi ) 8 Wife ( Yi ) 4 2 10 1.5 3 4.5 1.5 4 6 7 5 6 2 4.5 5.5 9 7 8 7 1 10 5.5 9 10 9 3 8 3 Couple This value is obviously insignificant; because it is very nearly zero, given the fact that the Spearman’s Rank Correlation Coefficient ranges from Zero to either 1.0 (for very strong positive relationship); and from zero to - 1 (for very strong negative relationship). We might as well conclude that there is no significant relationship between the two group ratings. However, since we are testing whether the ratings of husbands and those of wives are not correlated we begin with identifying the mean of no correlation, which is 0.0000. We then look for a way of building a confidence interval at appropriate significance levels which may help us determine the rejection areas. That means we have to be able to compute the number of standard errors from zero within which the parameter rho may be found for there to be no correlation. We look at our t-tables for ( n = 10 - 2 ) and an alpha level (two-tail ) of 0.05/2. We find that :- t 2, 0 2 t 0.025, 10 2 0.025 253 2 . 306 For there to be a significant relationship we must be able to use Rank Correlation Coefficient and compute either a t-value which is greater than 2.036. Otherwise the relationship is gravitating around 0.0000 correlation, and therefore non-existent or insignificant. This means that preference for eating Nyama-Choma is not shared jointly by husband and wife. Each person has individual opinion about the importance of the meal. We therefore come to a conclusion that husbands and wives are likely to make independent judgments regarding the meal of their choice. Dr. Mwalukware’s target members of Maua population must be selected on other criteria than gender. He may perhaps choose to investigate Miraa eaters and non-miraa eaters, married and unmarried men or women, owners and non-owners of businesses, and others. EXERCISES 1. A drug company is interested in determining how the chemical treatment for a specific form of cancer changes body temperature. Ten patients with the disease are selected at random from a set of patients under experimental control. Their temperatures are measured before and after taking the treatment. the data, given in degrees Fahrenheit are listed below :- Patient 1 2 3 4 5 6 7 8 9 10 Before 98.2 98.4 98.0 99.0 98.6 97.5 98.4 100.0 99.2 98.6 After 99.4 101.2 97.6 99.8 98.0 98.4 98.4 102.3 101.6 98.8 Test the null hypothesis that the two population means are equal at 1% significance level, using the Wilcoxon Signed Rank Test. 2. (a) Using the Mann-Whitney test at 0.05 alpha level and the data below, test the null hypothesis that the two samples come from the same population. (b) Given the crop yield/Fertilizer figures below, use the analysis of variance method to test the null hypothesis of the equality of the three populations, 254 and compare your results with Kruskal-Wallis test results. What farther assumptions must you make to use the analysis of variance F-test data. 3. Observation 1 2 3 4 5 6 7 8 Fertilizer A 80.5 76.4 93.2 90.6 84.7 81.2 78.5 82.0 Fertilizer B 95.4 84.7 88.1 98.2 101.6 88.6 96.4 97.3 A production manager suspects that the level of production among specific class of workers in the firm is related to their hourly pay. The following data is collected on eight randomly selected workers. Worker Hourly Pay ( X ) ( $ ) Production 1 2 3 4 5 6 7 8 3.10 2.50 4.45 2.75 5.00 5.00 2.90 4.75 50 20 62 30 75 60 42 60 (a) Calculate the Spearman’s Rho for the data. (b) Calculate the Pearson’s product moment of correlation coefficient for this data. Outline what assumptions are necessary to compute the Pearson’s r . (c) Using the Spearman’s rho test the null hypothesis that the values of X and Y are uncorrelated 255 REFERENCES Class Texts King’oriah, George K. (2004), Fundamentals of Applied Statistics, Jomo Kenyatta Foundation, Nairobi. Steel, Robert G.D. and James H. Torrie, (1980); Principles and Procedures of Statistics, A Biometric Approach. McGraw Hill Book Company, New York Additional References Irandu, Evaristus Makunyi, “The Road Network in Mombasa Municipal Area: A Spatial Analysis of its Effects on Land Values, Population Density and Travel Patterns”. (1982) Unpublished M.A. Thesis, University of Nairobi. Gibbons, Jean D., (1970) Non-Parametric Statistical inference. McGraw-Hill Book Company, New York (N.Y.) Keller, Gerrald, Brian Warrack and Henry Bartel; Statistics for Management and Economics. (1994) Duxbury Press, Belmont (California). Levine, David M., David Stephan, (et. al.) (2006), Statistics for Managers. Prentice Hall of India New Delhi. Pfaffenberger, Roger C., Statistical Methods for Business and Economics. (1977), Richard D. Irwin, Homewood, Illinois (U.S.A) Salvatore, Dominick, and Derrick Reagle; (2002) Statistics and Econometrics, McGraw Hill Book company, New York (N.Y.) Siegel, Sidney C., Non-Parametric Statistics for the Behavioral Sciences. (1956) McGraw Hill Book Company, New York Snedecor, George W. and William G. Cochran; Statistical Methods (1967) Iowa University Press, Ames, Iowa. (U.S.A.) 256 257