MATH 2441 Probability and Statistics for Biological Sciences Looking Ahead … Terminology and Directions There is certain basic terminology, and certain concepts which are used throughout the explanation and use of the methods of probability and statistics. You will become familiar with these ideas as they arise again and again in the course. However, to alert you to their importance, we mention a few of them here. I. Variables As in many branches of mathematics, the notion of a variable arises immediately. You can find some very formal definitions of a variable in reference books, but for our purposes, a variable is simply a specific property of each of a population of things. For example in reference to a population of human beings, a person's height, weight, age, annual income, years of school completed, hair color, favorite flavor of ice cream, favorite political candidate, etc. are all variables. in reference to the population of all cans of chicken soup produced by Acme Soup Inc., such properties as the percent salt, the number of millilitres of soup in the can, the date on which the can of soup was produced, etc. are all variables in reference to a population of wheat plants of a certain variety, the height of the plant at a certain time after planting, the number of days required for the seed to germinate, the percent protein in the harvested grains, etc. are all variables. in reference to flipping a coin, the identity of the face showing (heads or tails) is a variable. If five coins are tossed simultaneously, the number of coins falling heads up is a variable. In statistics, we often wish to characterize relationships between variables (for example, we might wish to determine whether the value of a person's annual income is related to the value of their hair color, or perhaps, whether the percent protein in wheat kernels is related to how much fertilizer was applied to the field during growth, etc.). In cases such as these, we still retain the distinction between independent variables and dependent variables. The value of the dependent variable is thought to be determined, or at least influenced, by the values assigned or observed for the independent variables. When the value a variable is given comes as the result of some random process (a process in which the specific result is not predictable with certainty in advance), we refer to it as a random variable. Random variables play a large role in statistical work, since we are mostly concerned with properties of members of random samples. II. Samples and Populations Always remember that the basic goal of statistics is to be able to estimate values of properties, draw conclusions about, or make predictions about populations using data obtained from a random sample of that population. Very early in the course, we shall have to be careful to distinguish when we are dealing with a population property (called a population parameter) or a property of a random sample (called a sample statistic). Sample statistics will always be random variables, because their values will always depend on which elements of the population get randomly selected to be part of the sample. Once the sample has been selected, the value of the statistic is known. Population parameters are always fixed quantities, never random variables (simply because once a population is described, it is a fixed collection of things). However, rarely are these values of population parameters known -- hence the need for the methods of statistical analysis. In statistics, the term experiment refers generally to the procedure required to measure the value of a sample statistic. III. Sampling Error and Statistical Significance/Confidence Levels Since the estimates we will make of population properties, or the conclusions we draw or the predictions we make about a population will always be based on direct observation of elements of a smaller random sample selected from that population, we need to keep in mind that we can obtain a quite mistaken and erroneous David W. Sabo (1999) Looking Ahead … Directions and Terminology Page 1 of 2 result about the population just because of the coincidence of which elements of the population actually turn up in the sample. For example, suppose we are studying a population of , say, all graduates of BCIT's Biotechnology Diploma Program. For sake of argument, let's say that there are now exactly 400 such individual's, of which 10 have become millionaires since graduation (though we wouldn't know this unless we determined the current wealth of each one of the 400 -- that is, unless we studied the entire population). Instead of going to the time and expense of locating each of the 400 individuals in this population and forcing each of them to tell us how much money they have (a somewhat questionable experimental approach for several reasons), we decide to put the names of all 400 on small slips of paper in a box, and then while blindfolded, we draw just two names as our random sample of individuals for more detailed study. We then contact those two people, who, as it turns out are quite happy to tell us how much money they have. Now, it could be that both of these individuals in our sample are from the group of 390 grads who are not millionaires. We would then be led by our sample to conclude that no biotechnology grads are millionaires -- clearly an incorrect conclusion. On the other hand, it might also be that by wild coincidence, the two names we draw for our sample just happen to be in the small group of ten grads who are now millionaires. In this case we will come to the incorrect conclusion that all biotechnology grads are millionaires. It is true that basing conclusions about a population on random samples of size two is too risky to consider an appropriate approach. The point here though is that varying degrees of such risk are present no matter how large the sample is, except when the sample includes every element of the population. The difference between the value of a property of the population and the value of the corresponding property of a sample is called is referred to as sampling error. It is a feature of the random sampling process itself, and some degree of sampling error cannot be avoided in statistical experiments involving real populations. Whenever we estimate some property of a population based on observations for a random sample of that population (in the example above, we might be estimating the percentage of biotechnology grads who are millionaires), we will attach a level of confidence to our estimate. The level of confidence is a number on a scale of 0 to 1, written as a percentage, where numbers near 100% indicate a high likelihood that our estimate is correct as stated. (A level of confidence of 100% would mean we are certain our estimate is correct, whereas a level of confidence of 0% means we are certain our estimate is incorrect. For most work, a level of confidence of 95% is considered acceptable.) On the other hand, when we are using the data in a sample to draw conclusions about a population, the measure of reliability of that conclusion is called its level of significance. Again, the level of significance is a number between 0 and 1, written as a percent, and representing the likelihood that our conclusion is wrong as a result of sampling error. A level of significance of 1 or 100% would mean we are certain our conclusion is incorrect, whereas a level of significance of 0 or 0% would mean we are certain our conclusion is correct. Generally, conclusions based on levels of significance which are 0.05 or 5% are considered acceptable. When detailed calculations of the likelihood of a particular conclusion being incorrect are done, that likelihood is called the p-value of the conclusion. We will spend quite a lot of time fleshing out these notions -- they are central to the whole discipline of statistics. From this point on, however, you need to be aware of the sort of results we will be able to get about populations from information obtained from a random sample of that population, and that we can never lose track of the presence of sampling error. Page 2 of 2 Looking Ahead … Directions and Terminology David W. Sabo (1999)